Blitzed Again (Jul 02 2009)


log in

Advanced search

Message boards : Technical News : Blitzed Again (Jul 02 2009)

Previous · 1 · 2 · 3 · 4 · 5 · Next
Author Message
Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5683
Credit: 56,105,286
RAC: 49,802
Australia
Message 914176 - Posted: 5 Jul 2009, 1:45:31 UTC - in response to Message 914165.


Lossless_data_compression
By operation of the pigeonhole principle, no lossless compression algorithm can efficiently compress all possible data, and completely random data streams cannot be compressed.
I would expect that most of the data downloaded would be mostly ramdom data, hence unlikely to compress much at all.
The result data would probably be more compressable, but how much more? And given how small it already is any savings in bandwidth could well be offset by the prcessing required at the other end to expand it all again.
____________
Grant
Darwin NT.

Josef W. Segur
Volunteer developer
Volunteer tester
Send message
Joined: 30 Oct 99
Posts: 4196
Credit: 1,028,895
RAC: 259
United States
Message 914192 - Posted: 5 Jul 2009, 2:11:36 UTC
Last modified: 5 Jul 2009, 2:18:39 UTC

Rather than speculating in a vacuum, I suggest any of you could simply try compressing a WU or two. You'd find that setiathome_enhanced WUs are moderately compressible, astropulse_v505 only slightly. If compressed downloads were implemented, it would be akin to adding another 25 MBits/second to the download bandwidth; enough to help short term, but hardly a permanent fix.

BOINC 5.x and later have libCurl which is prepared to decompress files sent with gzip compression. The download servers run Apache, which can be configured to gzip data being sent. All that would be needed is to set the warning that all users must upgrade to BOINC 5.x or later, give it long enough to be seen and adopted, then reconfigure the download servers.

The real issue is whether the download servers can handle the extra load of doing the compression; an earlier suggestion for a trial at SETI Beta wouldn't test that effectively. And although the 7zip format does compress better than gzip, it takes more memory and time to do the compression. It would also take project-specific coding to implement, rather than using the features BOINC already has.

Edit:

Correct me if I am wrong, but isn't it a text file that is sent each way?

The payload in Enhanced work is 256KB of nearly random data encoded in uue fashion, so can be sent as text. The payload in AP work is 8MB of pure 8 bit nearly random data. In both cases, it's the task of the science applications to ferret out the few cases where the data deviates from random noise.

Joe

John McLeod VII
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 15 Jul 99
Posts: 24055
Credit: 516,934
RAC: 134
United States
Message 914197 - Posted: 5 Jul 2009, 2:42:44 UTC - in response to Message 914192.

Rather than speculating in a vacuum, I suggest any of you could simply try compressing a WU or two. You'd find that setiathome_enhanced WUs are moderately compressible, astropulse_v505 only slightly. If compressed downloads were implemented, it would be akin to adding another 25 MBits/second to the download bandwidth; enough to help short term, but hardly a permanent fix.

BOINC 5.x and later have libCurl which is prepared to decompress files sent with gzip compression. The download servers run Apache, which can be configured to gzip data being sent. All that would be needed is to set the warning that all users must upgrade to BOINC 5.x or later, give it long enough to be seen and adopted, then reconfigure the download servers.

The real issue is whether the download servers can handle the extra load of doing the compression; an earlier suggestion for a trial at SETI Beta wouldn't test that effectively. And although the 7zip format does compress better than gzip, it takes more memory and time to do the compression. It would also take project-specific coding to implement, rather than using the features BOINC already has.

Edit:
Correct me if I am wrong, but isn't it a text file that is sent each way?

The payload in Enhanced work is 256KB of nearly random data encoded in uue fashion, so can be sent as text. The payload in AP work is 8MB of pure 8 bit nearly random data. In both cases, it's the task of the science applications to ferret out the few cases where the data deviates from random noise.

Joe

The enhanced files are being transmitted as XML, not binary, therefore, there is something to work with - even if the underlying data is completely random.

The one enhanced WU that I have compresses by 27% with a fairly weak compression technique. That is assuming that I used the right file for the test.

The AP task I tried compressing got 11% with one of the better compression techniques that WinZip has at its disposal. This is a somewhat disappointing number.

The question still remains as to whether the CPU time spent compressing that tasks is worth the saved bandwidty. Typically the project is shorter on CPU time than on bandwidth, although this not true in all cases.
____________


BOINC WIKI

zpm
Volunteer tester
Avatar
Send message
Joined: 25 Apr 08
Posts: 284
Credit: 1,462,477
RAC: 3,299
United States
Message 914206 - Posted: 5 Jul 2009, 3:05:19 UTC - in response to Message 914197.

it my opinion, any dual-core processor spends not a lot of time wrapping it up and compressing....a quad even less....


____________

I recommend Secunia PSI: http://secunia.com/vulnerability_scanning/personal/
Go Georgia Tech.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5683
Credit: 56,105,286
RAC: 49,802
Australia
Message 914219 - Posted: 5 Jul 2009, 3:37:02 UTC - in response to Message 914206.

it my opinion, any dual-core processor spends not a lot of time wrapping it up and compressing....a quad even less....

Especially when it's not doing anything else.
But when the CPU is busy with other tasks, the system is low on available memory, and the disk sub-system is busy with outher throughput you will find that compressing even a small file can take a while, even more so when you've got to do 10, or a 100, or a 1,000 per second to keep up with demand.

____________
Grant
Darwin NT.

zpm
Volunteer tester
Avatar
Send message
Joined: 25 Apr 08
Posts: 284
Credit: 1,462,477
RAC: 3,299
United States
Message 914224 - Posted: 5 Jul 2009, 3:45:19 UTC - in response to Message 914219.
Last modified: 5 Jul 2009, 3:46:12 UTC

i'll still say "A byte Saved is a Byte that can be spent later." this concept is helping DD@H and H@H, when it comes to paying for bandwidth.


hey, if i were to win the lottery, i would donate a lot to boinc projects....
____________

I recommend Secunia PSI: http://secunia.com/vulnerability_scanning/personal/
Go Georgia Tech.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5683
Credit: 56,105,286
RAC: 49,802
Australia
Message 914233 - Posted: 5 Jul 2009, 4:24:04 UTC - in response to Message 914224.

i'll still say "A byte Saved is a Byte that can be spent later." this concept is helping DD@H and H@H, when it comes to paying for bandwidth.

The problem here isn't just one of bandwidth- it resources overall. Compression certainly might ease the bandwidth problem, but then the load it imposes on the system could cause a new bottleneck elsewhere in the system.


hey, if i were to win the lottery, i would donate a lot to boinc projects....

Same here.
Unfortunately i missed out on the recent $120,000,000 draw.
____________
Grant
Darwin NT.

WinterKnight
Volunteer tester
Send message
Joined: 18 May 99
Posts: 8486
Credit: 23,009,631
RAC: 14,861
United Kingdom
Message 914264 - Posted: 5 Jul 2009, 6:37:47 UTC

Lossless compression is usually done with the Lempel-Ziv-Welch (LZW) algorithm or a variation of it.
It works by examining the file and placing each string into a dictionary, if the same string is found again it is substituted with a shorter code, as the dictionary builds up the most popular strings get the shortest substitution codes and at the other end unique strings are not compressed.

As the Seti data files consist of a header followed by random data the header can be compressed considerably but the random data will only be compressed a very small amount. The header is about the same size for MB and AP tasks.
As the MB files are short, 350 Kbyte, they will compress more than the AP files which are 8 Mbyte. So seeing figures of 27% for MB and 11% for AP is probably close to what is expected.

rob smith
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8122
Credit: 52,300,962
RAC: 78,227
United Kingdom
Message 914277 - Posted: 5 Jul 2009, 7:42:34 UTC

Perhaps the problem is that of the distribution algorithm?
Looking at my 4 computers, the oldest process "normal" WU in either 34hrs or 10hrs, compared with the 3.5hrs or 1hr of the fastest. The other day the slowest got quite a number of slow work units, which it is still plodding through, and the fastest got a very large number of its fastest which it devoured without any problems.
Obviously on completion the WU are then returned, and if they are coming back very rapidly, and in very large numbers they will flood the available bandwidth, so why not ensure that the faster machines get the "bigger" WU, and the slower the "smaller"?

____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Profile Andy Lee Robinson
Avatar
Send message
Joined: 8 Dec 05
Posts: 615
Credit: 39,239,180
RAC: 24,375
Hungary
Message 914287 - Posted: 5 Jul 2009, 8:19:24 UTC - in response to Message 914264.
Last modified: 5 Jul 2009, 8:35:21 UTC

As the Seti data files consist of a header followed by random data the header can be compressed considerably but the random data will only be compressed a very small amount. The header is about the same size for MB and AP tasks.
As the MB files are short, 350 Kbyte, they will compress more than the AP files which are 8 Mbyte. So seeing figures of 27% for MB and 11% for AP is probably close to what is expected.


Random data is like concrete. It cannot be compressed because there are too few recurring patterns to be worth substituting.

AP files contain binary data.
MB files contain base64 encoded binary data, and so 33% bigger than if they were plain binary. They are still not very compressible with LZW because the base64 codes are also random, but instead with 64 symbols instead of 256 for binary.
base64 can be compressed by removing the redundant 2 bits of each byte and then replacing them afterwards, but this is a different algorithm to LZW!

Therefore, the lowest hanging fruit to reduce bandwidth demand is to remove the base64 encoding for the data in the MB files.
If the newer AP workunits work without it, then why not for MB too?

Andy.

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 8263
Credit: 4,070,696
RAC: 516
United Kingdom
Message 914304 - Posted: 5 Jul 2009, 10:48:01 UTC - in response to Message 914159.

I'm not so sure about compression. My understanding was that random noise wouldn't compress very much at all as there are no patterns in it to exploit to compress. Now a WU with a signal might compress because of the pattern of the signal. ...

Very good idea, and that has parallels to the statistical analysis used to break the German Enigma codes.

But... Would you get anything that could be distinguished from the background noise? Even for when Arecibo is sweeping across the sky...?

Try the idea on a selection of VLARs?

Keep searchin',
Martin

____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

Spacey_88
Send message
Joined: 14 May 99
Posts: 1
Credit: 2,217,867
RAC: 0
United States
Message 914318 - Posted: 5 Jul 2009, 13:23:34 UTC

Getting back to a point much earlier in this thread about the the system being designed not to be overloaded by noisy work because there are multiple splitters working on multiple files... Has anyone noticed that the splitter working on 05mr09af appears to be 'stuck'? Its been at the same point since the servers came back up on Tuesday...

I haven't watched to see if any of the others get stuck but it may be worth looking at... are there really 6 splitters working on 6 different files or do we have times when there are only one or two splitters making work and that compounds the problems with limited bandwidth?
____________

John McLeod VII
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 15 Jul 99
Posts: 24055
Credit: 516,934
RAC: 134
United States
Message 914369 - Posted: 5 Jul 2009, 16:37:20 UTC - in response to Message 914206.

it my opinion, any dual-core processor spends not a lot of time wrapping it up and compressing....a quad even less....


The processing bottleneck would be at the server, not at the client. Now imagine having to compress ten to twenty tasks per second in order to feed the outbound queue. Similar for decompressing the result data.
____________


BOINC WIKI

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 914380 - Posted: 5 Jul 2009, 17:26:01 UTC - in response to Message 914369.

it my opinion, any dual-core processor spends not a lot of time wrapping it up and compressing....a quad even less....


The processing bottleneck would be at the server, not at the client. Now imagine having to compress ten to twenty tasks per second in order to feed the outbound queue. Similar for decompressing the result data.

I agree that compression is tough.

... but what we're doing right now is taking the binary data and converting it to text. The mime type is x-setiathome which is legal but non-standard, and I can't tell how efficient it is by looking at it. If it's like Base64 it is 6 characters for four bytes, if it's like ASCII85, it's 5 characters for 4 bytes.

Then it converts back on the other end.

Shifting from a text encoding to binary (the way AP does it) could save 15% or more, save storage on disk, etc.

Of course, the science application would have to understand more than one encoding method.
____________

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 8263
Credit: 4,070,696
RAC: 516
United Kingdom
Message 914384 - Posted: 5 Jul 2009, 17:46:45 UTC - in response to Message 914380.

... Shifting from a text encoding to binary (the way AP does it) could save 15% or more, save storage on disk, etc.

Of course, the science application would have to understand more than one encoding method.

The code is already there as used in AP...

Just add an "if" statement? (And someone to code it.)

Happy crunchin',
Martin

____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

Profile Gary Charpentier
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12056
Credit: 6,372,698
RAC: 8,548
United States
Message 914389 - Posted: 5 Jul 2009, 17:57:53 UTC - in response to Message 914380.

it my opinion, any dual-core processor spends not a lot of time wrapping it up and compressing....a quad even less....


The processing bottleneck would be at the server, not at the client. Now imagine having to compress ten to twenty tasks per second in order to feed the outbound queue. Similar for decompressing the result data.

I agree that compression is tough.

... but what we're doing right now is taking the binary data and converting it to text. The mime type is x-setiathome which is legal but non-standard, and I can't tell how efficient it is by looking at it. If it's like Base64 it is 6 characters for four bytes, if it's like ASCII85, it's 5 characters for 4 bytes.

Then it converts back on the other end.

Shifting from a text encoding to binary (the way AP does it) could save 15% or more, save storage on disk, etc.

Of course, the science application would have to understand more than one encoding method.

re mime type
This morning it came to my attention that we've been sending out workunits with the "application/x-troff-man" mime type. This was because files with numerical suffices (like workunits) are assumed to be man pages. This may have been causing some blockages at firewalls. I changed the mime type to "text/plain."


I get the feeling some Apache tables have been borked.

Elsewhere I believe someone said they were UUcodes. About the worst choice from a bandwidth standpoint but perhaps the only 100% compatible choice at the time BOINC was put in place. Today I'm sure that a different encoding can be chosen.

As to compression, and some tests people have run, first don't forget when you compress the file, you are compressing a file that was uncompressed to be transmitted as plain text. At least one of every 8 bits is a zero and 12.5% compression is automatic in that case. If they are UUCode or Base64 or anything else, none of these use all 128 possible 7 bit configurations, so even more automatic compression is a given. But all this automatic compression is false. To transmit you have to uncompress to plain text.

The numbers I saw quoted looked like MB is UUCode and AP is Base64 with random incompressible data inside as expected.

Perhaps the best compression would be to transmit in a full 8 bit binary mode. I'm not sure however if that is 100% compatible with all the equipment everywhere in use on the net. I know that everything sold today is, but who has what 20 year old box with a couple of percent of SETI users behind it?

____________

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 914418 - Posted: 5 Jul 2009, 19:21:38 UTC - in response to Message 914389.

re mime type
This morning it came to my attention that we've been sending out workunits with the "application/x-troff-man" mime type. This was because files with numerical suffices (like workunits) are assumed to be man pages. This may have been causing some blockages at firewalls. I changed the mime type to "text/plain."


I get the feeling some Apache tables have been borked.

That's talking about the HTTP header, and the only thing that would look at that is some nosy, paranoid firewall (they're supposed to be paranoid).

I'm talking about the internal type in the XML-ish work-unit.

That is x-setiathome. It's a private encoding, or it wouldn't start with "x-" and it doesn't have to follow UUENCODE because SETI@Home "owns" both ends of the process.

Looking at the data, it isn't hex, which would be kind-of dumb. UUENCODE and Base64 look a lot alike, and are about as efficient at six bits per character.

It could be something like ASCII85, which encodes four bytes into five characters.

What it isn't is raw binary.

Binary could work because HTTP can handle binary (or you couldn't have graphics on web pages, GIFs and JPEGs are binary), and it would have the least CPU load, especially on the server end.
____________

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 914421 - Posted: 5 Jul 2009, 19:23:36 UTC - in response to Message 914384.

... Shifting from a text encoding to binary (the way AP does it) could save 15% or more, save storage on disk, etc.

Of course, the science application would have to understand more than one encoding method.

The code is already there as used in AP...

Just add an "if" statement? (And someone to code it.)

Happy crunchin',
Martin

... and test it, and then time to deploy, and it'd be nice to give the optimized apps a chance to add it to their apps.

The actual coding is probably the easiest part.
____________

Jason
Send message
Joined: 30 Mar 07
Posts: 1
Credit: 106,096
RAC: 0
Australia
Message 914561 - Posted: 6 Jul 2009, 1:54:25 UTC - in response to Message 914048.



There's still a cost issue there. SETI@Home leases space from the University, and has to work within the constraints set forth by the University's dictates, which include power requirements for both the servers and the air conditioning to cool the servers. Then there's the issue of staff wages (they do try to get paid for their work).

I mention this because if the plan is to get other universities involved, you are effectively doubling the financial strain on the project. Other scientists at other universities will have to purchase servers (or look for donations), lease space, pay for their power usage, pay for the connection to the internet, etc. And of course those scientists will want to get paid as well.



How about rather than getting other uni's involved you get Nvidia involved. The deciding factor for me between an ATI card and an Nvidia card was the CUDA. Both cards play the games I want to play but only one will crunch. Amp up CUDA support messages in the SETI website and I'm sure that Nvidia would find hosting *all* the SETI boxes with a monster pipe a very profitable move. Win Win Win all round. Berkeley getts the boxes and their associated costs off site, we get WU to crunch, Nvidia gets advertising worth millions, the SETI team spends money on research rather than airconditioning and leasing space. Cheers Jason =:)

zpm
Volunteer tester
Avatar
Send message
Joined: 25 Apr 08
Posts: 284
Credit: 1,462,477
RAC: 3,299
United States
Message 914566 - Posted: 6 Jul 2009, 2:09:17 UTC - in response to Message 914561.
Last modified: 6 Jul 2009, 2:09:46 UTC

well, hey that would be nice, but people are working on an opencl platform that will allow nvid and ati cards to crunch.... it would go much faster if ati weren't so "do it yourself...." which then if it came out, their goes the incentive for nvid.
____________

I recommend Secunia PSI: http://secunia.com/vulnerability_scanning/personal/
Go Georgia Tech.

Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Technical News : Blitzed Again (Jul 02 2009)

Copyright © 2014 University of California