Blitzed Again (Jul 02 2009)

Message boards : Technical News : Blitzed Again (Jul 02 2009)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 914197 - Posted: 5 Jul 2009, 2:42:44 UTC - in response to Message 914192.  

Rather than speculating in a vacuum, I suggest any of you could simply try compressing a WU or two. You'd find that setiathome_enhanced WUs are moderately compressible, astropulse_v505 only slightly. If compressed downloads were implemented, it would be akin to adding another 25 MBits/second to the download bandwidth; enough to help short term, but hardly a permanent fix.

BOINC 5.x and later have libCurl which is prepared to decompress files sent with gzip compression. The download servers run Apache, which can be configured to gzip data being sent. All that would be needed is to set the warning that all users must upgrade to BOINC 5.x or later, give it long enough to be seen and adopted, then reconfigure the download servers.

The real issue is whether the download servers can handle the extra load of doing the compression; an earlier suggestion for a trial at SETI Beta wouldn't test that effectively. And although the 7zip format does compress better than gzip, it takes more memory and time to do the compression. It would also take project-specific coding to implement, rather than using the features BOINC already has.

Edit:
Correct me if I am wrong, but isn't it a text file that is sent each way?

The payload in Enhanced work is 256KB of nearly random data encoded in uue fashion, so can be sent as text. The payload in AP work is 8MB of pure 8 bit nearly random data. In both cases, it's the task of the science applications to ferret out the few cases where the data deviates from random noise.

                                                           Joe

The enhanced files are being transmitted as XML, not binary, therefore, there is something to work with - even if the underlying data is completely random.

The one enhanced WU that I have compresses by 27% with a fairly weak compression technique. That is assuming that I used the right file for the test.

The AP task I tried compressing got 11% with one of the better compression techniques that WinZip has at its disposal. This is a somewhat disappointing number.

The question still remains as to whether the CPU time spent compressing that tasks is worth the saved bandwidty. Typically the project is shorter on CPU time than on bandwidth, although this not true in all cases.


BOINC WIKI
ID: 914197 · Report as offensive
zpm
Volunteer tester
Avatar

Send message
Joined: 25 Apr 08
Posts: 284
Credit: 1,659,024
RAC: 0
United States
Message 914206 - Posted: 5 Jul 2009, 3:05:19 UTC - in response to Message 914197.  

it my opinion, any dual-core processor spends not a lot of time wrapping it up and compressing....a quad even less....



I recommend Secunia PSI: http://secunia.com/vulnerability_scanning/personal/
Go Georgia Tech.
ID: 914206 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 914219 - Posted: 5 Jul 2009, 3:37:02 UTC - in response to Message 914206.  

it my opinion, any dual-core processor spends not a lot of time wrapping it up and compressing....a quad even less....

Especially when it's not doing anything else.
But when the CPU is busy with other tasks, the system is low on available memory, and the disk sub-system is busy with outher throughput you will find that compressing even a small file can take a while, even more so when you've got to do 10, or a 100, or a 1,000 per second to keep up with demand.

Grant
Darwin NT
ID: 914219 · Report as offensive
zpm
Volunteer tester
Avatar

Send message
Joined: 25 Apr 08
Posts: 284
Credit: 1,659,024
RAC: 0
United States
Message 914224 - Posted: 5 Jul 2009, 3:45:19 UTC - in response to Message 914219.  
Last modified: 5 Jul 2009, 3:46:12 UTC

i'll still say "A byte Saved is a Byte that can be spent later." this concept is helping DD@H and H@H, when it comes to paying for bandwidth.


hey, if i were to win the lottery, i would donate a lot to boinc projects....

I recommend Secunia PSI: http://secunia.com/vulnerability_scanning/personal/
Go Georgia Tech.
ID: 914224 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 914233 - Posted: 5 Jul 2009, 4:24:04 UTC - in response to Message 914224.  

i'll still say "A byte Saved is a Byte that can be spent later." this concept is helping DD@H and H@H, when it comes to paying for bandwidth.

The problem here isn't just one of bandwidth- it resources overall. Compression certainly might ease the bandwidth problem, but then the load it imposes on the system could cause a new bottleneck elsewhere in the system.


hey, if i were to win the lottery, i would donate a lot to boinc projects....

Same here.
Unfortunately i missed out on the recent $120,000,000 draw.
Grant
Darwin NT
ID: 914233 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19048
Credit: 40,757,560
RAC: 67
United Kingdom
Message 914264 - Posted: 5 Jul 2009, 6:37:47 UTC

Lossless compression is usually done with the Lempel-Ziv-Welch (LZW) algorithm or a variation of it.
It works by examining the file and placing each string into a dictionary, if the same string is found again it is substituted with a shorter code, as the dictionary builds up the most popular strings get the shortest substitution codes and at the other end unique strings are not compressed.

As the Seti data files consist of a header followed by random data the header can be compressed considerably but the random data will only be compressed a very small amount. The header is about the same size for MB and AP tasks.
As the MB files are short, 350 Kbyte, they will compress more than the AP files which are 8 Mbyte. So seeing figures of 27% for MB and 11% for AP is probably close to what is expected.
ID: 914264 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22189
Credit: 416,307,556
RAC: 380
United Kingdom
Message 914277 - Posted: 5 Jul 2009, 7:42:34 UTC

Perhaps the problem is that of the distribution algorithm?
Looking at my 4 computers, the oldest process "normal" WU in either 34hrs or 10hrs, compared with the 3.5hrs or 1hr of the fastest. The other day the slowest got quite a number of slow work units, which it is still plodding through, and the fastest got a very large number of its fastest which it devoured without any problems.
Obviously on completion the WU are then returned, and if they are coming back very rapidly, and in very large numbers they will flood the available bandwidth, so why not ensure that the faster machines get the "bigger" WU, and the slower the "smaller"?

Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 914277 · Report as offensive
Profile Andy Lee Robinson
Avatar

Send message
Joined: 8 Dec 05
Posts: 630
Credit: 59,973,836
RAC: 0
Hungary
Message 914287 - Posted: 5 Jul 2009, 8:19:24 UTC - in response to Message 914264.  
Last modified: 5 Jul 2009, 8:35:21 UTC

As the Seti data files consist of a header followed by random data the header can be compressed considerably but the random data will only be compressed a very small amount. The header is about the same size for MB and AP tasks.
As the MB files are short, 350 Kbyte, they will compress more than the AP files which are 8 Mbyte. So seeing figures of 27% for MB and 11% for AP is probably close to what is expected.


Random data is like concrete. It cannot be compressed because there are too few recurring patterns to be worth substituting.

AP files contain binary data.
MB files contain base64 encoded binary data, and so 33% bigger than if they were plain binary. They are still not very compressible with LZW because the base64 codes are also random, but instead with 64 symbols instead of 256 for binary.
base64 can be compressed by removing the redundant 2 bits of each byte and then replacing them afterwards, but this is a different algorithm to LZW!

Therefore, the lowest hanging fruit to reduce bandwidth demand is to remove the base64 encoding for the data in the MB files.
If the newer AP workunits work without it, then why not for MB too?

Andy.
ID: 914287 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20265
Credit: 7,508,002
RAC: 20
United Kingdom
Message 914304 - Posted: 5 Jul 2009, 10:48:01 UTC - in response to Message 914159.  

I'm not so sure about compression. My understanding was that random noise wouldn't compress very much at all as there are no patterns in it to exploit to compress. Now a WU with a signal might compress because of the pattern of the signal. ...

Very good idea, and that has parallels to the statistical analysis used to break the German Enigma codes.

But... Would you get anything that could be distinguished from the background noise? Even for when Arecibo is sweeping across the sky...?

Try the idea on a selection of VLARs?

Keep searchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 914304 · Report as offensive
Spacey_88

Send message
Joined: 14 May 99
Posts: 1
Credit: 9,843,643
RAC: 6
United States
Message 914318 - Posted: 5 Jul 2009, 13:23:34 UTC

Getting back to a point much earlier in this thread about the the system being designed not to be overloaded by noisy work because there are multiple splitters working on multiple files... Has anyone noticed that the splitter working on 05mr09af appears to be 'stuck'? Its been at the same point since the servers came back up on Tuesday...

I haven't watched to see if any of the others get stuck but it may be worth looking at... are there really 6 splitters working on 6 different files or do we have times when there are only one or two splitters making work and that compounds the problems with limited bandwidth?
ID: 914318 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 914369 - Posted: 5 Jul 2009, 16:37:20 UTC - in response to Message 914206.  

it my opinion, any dual-core processor spends not a lot of time wrapping it up and compressing....a quad even less....


The processing bottleneck would be at the server, not at the client. Now imagine having to compress ten to twenty tasks per second in order to feed the outbound queue. Similar for decompressing the result data.


BOINC WIKI
ID: 914369 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 914380 - Posted: 5 Jul 2009, 17:26:01 UTC - in response to Message 914369.  

it my opinion, any dual-core processor spends not a lot of time wrapping it up and compressing....a quad even less....


The processing bottleneck would be at the server, not at the client. Now imagine having to compress ten to twenty tasks per second in order to feed the outbound queue. Similar for decompressing the result data.

I agree that compression is tough.

... but what we're doing right now is taking the binary data and converting it to text. The mime type is x-setiathome which is legal but non-standard, and I can't tell how efficient it is by looking at it. If it's like Base64 it is 6 characters for four bytes, if it's like ASCII85, it's 5 characters for 4 bytes.

Then it converts back on the other end.

Shifting from a text encoding to binary (the way AP does it) could save 15% or more, save storage on disk, etc.

Of course, the science application would have to understand more than one encoding method.
ID: 914380 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20265
Credit: 7,508,002
RAC: 20
United Kingdom
Message 914384 - Posted: 5 Jul 2009, 17:46:45 UTC - in response to Message 914380.  

... Shifting from a text encoding to binary (the way AP does it) could save 15% or more, save storage on disk, etc.

Of course, the science application would have to understand more than one encoding method.

The code is already there as used in AP...

Just add an "if" statement? (And someone to code it.)

Happy crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 914384 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30639
Credit: 53,134,872
RAC: 32
United States
Message 914389 - Posted: 5 Jul 2009, 17:57:53 UTC - in response to Message 914380.  

it my opinion, any dual-core processor spends not a lot of time wrapping it up and compressing....a quad even less....


The processing bottleneck would be at the server, not at the client. Now imagine having to compress ten to twenty tasks per second in order to feed the outbound queue. Similar for decompressing the result data.

I agree that compression is tough.

... but what we're doing right now is taking the binary data and converting it to text. The mime type is x-setiathome which is legal but non-standard, and I can't tell how efficient it is by looking at it. If it's like Base64 it is 6 characters for four bytes, if it's like ASCII85, it's 5 characters for 4 bytes.

Then it converts back on the other end.

Shifting from a text encoding to binary (the way AP does it) could save 15% or more, save storage on disk, etc.

Of course, the science application would have to understand more than one encoding method.

re mime type
This morning it came to my attention that we've been sending out workunits with the "application/x-troff-man" mime type. This was because files with numerical suffices (like workunits) are assumed to be man pages. This may have been causing some blockages at firewalls. I changed the mime type to "text/plain."


I get the feeling some Apache tables have been borked.

Elsewhere I believe someone said they were UUcodes. About the worst choice from a bandwidth standpoint but perhaps the only 100% compatible choice at the time BOINC was put in place. Today I'm sure that a different encoding can be chosen.

As to compression, and some tests people have run, first don't forget when you compress the file, you are compressing a file that was uncompressed to be transmitted as plain text. At least one of every 8 bits is a zero and 12.5% compression is automatic in that case. If they are UUCode or Base64 or anything else, none of these use all 128 possible 7 bit configurations, so even more automatic compression is a given. But all this automatic compression is false. To transmit you have to uncompress to plain text.

The numbers I saw quoted looked like MB is UUCode and AP is Base64 with random incompressible data inside as expected.

Perhaps the best compression would be to transmit in a full 8 bit binary mode. I'm not sure however if that is 100% compatible with all the equipment everywhere in use on the net. I know that everything sold today is, but who has what 20 year old box with a couple of percent of SETI users behind it?

ID: 914389 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 914418 - Posted: 5 Jul 2009, 19:21:38 UTC - in response to Message 914389.  

re mime type
This morning it came to my attention that we've been sending out workunits with the "application/x-troff-man" mime type. This was because files with numerical suffices (like workunits) are assumed to be man pages. This may have been causing some blockages at firewalls. I changed the mime type to "text/plain."


I get the feeling some Apache tables have been borked.

That's talking about the HTTP header, and the only thing that would look at that is some nosy, paranoid firewall (they're supposed to be paranoid).

I'm talking about the internal type in the XML-ish work-unit.

That is x-setiathome. It's a private encoding, or it wouldn't start with "x-" and it doesn't have to follow UUENCODE because SETI@Home "owns" both ends of the process.

Looking at the data, it isn't hex, which would be kind-of dumb. UUENCODE and Base64 look a lot alike, and are about as efficient at six bits per character.

It could be something like ASCII85, which encodes four bytes into five characters.

What it isn't is raw binary.

Binary could work because HTTP can handle binary (or you couldn't have graphics on web pages, GIFs and JPEGs are binary), and it would have the least CPU load, especially on the server end.
ID: 914418 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 914421 - Posted: 5 Jul 2009, 19:23:36 UTC - in response to Message 914384.  

... Shifting from a text encoding to binary (the way AP does it) could save 15% or more, save storage on disk, etc.

Of course, the science application would have to understand more than one encoding method.

The code is already there as used in AP...

Just add an "if" statement? (And someone to code it.)

Happy crunchin',
Martin

... and test it, and then time to deploy, and it'd be nice to give the optimized apps a chance to add it to their apps.

The actual coding is probably the easiest part.
ID: 914421 · Report as offensive
Jason

Send message
Joined: 30 Mar 07
Posts: 1
Credit: 106,096
RAC: 0
Australia
Message 914561 - Posted: 6 Jul 2009, 1:54:25 UTC - in response to Message 914048.  



There's still a cost issue there. SETI@Home leases space from the University, and has to work within the constraints set forth by the University's dictates, which include power requirements for both the servers and the air conditioning to cool the servers. Then there's the issue of staff wages (they do try to get paid for their work).

I mention this because if the plan is to get other universities involved, you are effectively doubling the financial strain on the project. Other scientists at other universities will have to purchase servers (or look for donations), lease space, pay for their power usage, pay for the connection to the internet, etc. And of course those scientists will want to get paid as well.



How about rather than getting other uni's involved you get Nvidia involved. The deciding factor for me between an ATI card and an Nvidia card was the CUDA. Both cards play the games I want to play but only one will crunch. Amp up CUDA support messages in the SETI website and I'm sure that Nvidia would find hosting *all* the SETI boxes with a monster pipe a very profitable move. Win Win Win all round. Berkeley getts the boxes and their associated costs off site, we get WU to crunch, Nvidia gets advertising worth millions, the SETI team spends money on research rather than airconditioning and leasing space. Cheers Jason =:)
ID: 914561 · Report as offensive
zpm
Volunteer tester
Avatar

Send message
Joined: 25 Apr 08
Posts: 284
Credit: 1,659,024
RAC: 0
United States
Message 914566 - Posted: 6 Jul 2009, 2:09:17 UTC - in response to Message 914561.  
Last modified: 6 Jul 2009, 2:09:46 UTC

well, hey that would be nice, but people are working on an opencl platform that will allow nvid and ati cards to crunch.... it would go much faster if ati weren't so "do it yourself...." which then if it came out, their goes the incentive for nvid.

I recommend Secunia PSI: http://secunia.com/vulnerability_scanning/personal/
Go Georgia Tech.
ID: 914566 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 914584 - Posted: 6 Jul 2009, 3:22:41 UTC - in response to Message 914561.  



There's still a cost issue there. SETI@Home leases space from the University, and has to work within the constraints set forth by the University's dictates, which include power requirements for both the servers and the air conditioning to cool the servers. Then there's the issue of staff wages (they do try to get paid for their work).

I mention this because if the plan is to get other universities involved, you are effectively doubling the financial strain on the project. Other scientists at other universities will have to purchase servers (or look for donations), lease space, pay for their power usage, pay for the connection to the internet, etc. And of course those scientists will want to get paid as well.



How about rather than getting other uni's involved you get Nvidia involved. The deciding factor for me between an ATI card and an Nvidia card was the CUDA. Both cards play the games I want to play but only one will crunch. Amp up CUDA support messages in the SETI website and I'm sure that Nvidia would find hosting *all* the SETI boxes with a monster pipe a very profitable move. Win Win Win all round. Berkeley getts the boxes and their associated costs off site, we get WU to crunch, Nvidia gets advertising worth millions, the SETI team spends money on research rather than airconditioning and leasing space. Cheers Jason =:)


nVidia was only interested in getting the CUDA code working for marketing purposes. They are not interested in picking up the tab for the entire project when there's plenty of other CUDA capable projects out there that do not need help.
ID: 914584 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 914602 - Posted: 6 Jul 2009, 4:00:45 UTC - in response to Message 914421.  

... Shifting from a text encoding to binary (the way AP does it) could save 15% or more, save storage on disk, etc.

Of course, the science application would have to understand more than one encoding method.

The code is already there as used in AP...

Just add an "if" statement? (And someone to code it.)

Happy crunchin',
Martin

... and test it, and then time to deploy, and it'd be nice to give the optimized apps a chance to add it to their apps.

The actual coding is probably the easiest part.

Agreed, basically importing code from AP sources to the MB splitter and application. Building for all platforms could be problematic, note on the Applications page some haven't been updated since 5.12.

The x-setiathome format carries 6 bits per byte, I've never compared details to see if it's identical to UUencoding or Base64. The application code will use the same decoding if it's called Sun Binary, and of course the original data recorder used a Sun system.

The rate at which new builds can be tested at SETI Beta is fairly slow, so adapting the optimized apps is usually no problem. Obviously with incompatible WUs there would have to be a new application name, so no issues with running the new work on old apps.
                                                                Joe
ID: 914602 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Technical News : Blitzed Again (Jul 02 2009)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.