Message boards :
Number crunching :
News from the top - file compression
Message board moderation
Author | Message |
---|---|
Astro Send message Joined: 16 Apr 02 Posts: 8026 Credit: 600,015 RAC: 0 |
Email from Dr. Anderson: David Anderson to boinc_projects, boinc_dev More options 4:01 pm (6 minutes ago) Libcurl has the ability to handle HTTP replies that are compressed using the 'deflate' and 'gzip' encoding types. Previously the BOINC client didn't enable this feature, but starting with the next version of the client (5.4) it does. This means that BOINC projects will be able to reduce network bandwidth to data servers (and possibly server disk space) by using HTTP compression, without mucking around with applications. This is described here: http://boinc.berkeley.edu/files.php#compression -- David Interesting. |
Francis Noel Send message Joined: 30 Aug 05 Posts: 452 Credit: 142,832,523 RAC: 94 |
Very interesting indeed [snip]Use the Apache 2.0 mod_deflate module to automatically compress files on the fly. [/snip] So depending on the compression ratio the bandwidth usage could drop quite a bit. Consider the added workload to the webserver's CPU though, as it seems it'll be the one uncompressing said data. Since the "deflate" occurs "on-the-fly" the recieved data cant be saved to disk in the compressed state for deferred uncompressing. Just my 2c mambo |
Michael Send message Joined: 21 Aug 99 Posts: 4608 Credit: 7,427,891 RAC: 18 |
Very interesting indeed I agree. I don'tsee any real benefitto anyone. Doesn'tsave disk space, and increases cpu load. Although, it DOES decrease bandwidth usage. I am going to take a Wu and compress it with gzip and see what can be saved. |
Michael Send message Joined: 21 Aug 99 Posts: 4608 Credit: 7,427,891 RAC: 18 |
356K -rw-r--r-- 1 madcow users 354K 2006-02-24 19:07 18mr01ab.23782.14977.186078.1.217 268K -rw-r--r-- 1 madcow users 266K 2006-02-24 19:09 18mr01ab.23782.14977.186078.1.217.gz |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 |
It was quickly discovered that it broke a couple of programs that did gzip for themselves (CPDN and Einstein, I believe). It has been removed for the time being, and will be put back so that is based on a tag in the file description. BOINC WIKI |
SURVEYOR Send message Joined: 19 Oct 02 Posts: 375 Credit: 608,422 RAC: 0 |
My computers always upload and download two wu at a time. If BOINC would only do one at a time that would cut the bandwidth usage in half. The Alpha project is testing 4.4 meg units taht process in a minute, but take 8 to ten minutes to upload witch slowed my dsl to below dialup speed. If I'm on the internet I have to suspend the project [cpu] with uploads to surf the net. I had made the same comment a year ago, but who am I just another ALPHA tester. Just my 2.25 cents Fred BOINC Alpha, BOINC Beta, LHC Alpha, Einstein Alpha |
MikeSW17 Send message Joined: 3 Apr 99 Posts: 1603 Credit: 2,700,523 RAC: 0 |
My computers always upload and download two wu at a time. How's that? You will still download the the second WU and use the bandwidth, just later on. If you're going to download 5Mb of WUs a day, then it makes no difference if you d/l 1Mb in 5 lots over the day or one lot of 5Mb, you still need 5Mb of data transfer a day. |
Tigher Send message Joined: 18 Mar 04 Posts: 1547 Credit: 760,577 RAC: 0 |
My computers always upload and download two wu at a time. I think he might have meant it would reduce contention at a particular point in time rather then there would be less work overall? |
Astro Send message Joined: 16 Apr 02 Posts: 8026 Credit: 600,015 RAC: 0 |
Since I started this, I feel it necessary to add in the next converstations as a point of clarity for everyone: Bruce Allen (einstein replied): Bruce Allen <xxxxxxxxxx@gravity.phys.uwm.edu>to David, boinc_projects, boinc_dev More options Feb 24 (15 hours ago) David, some project (including E@H) are already sending/returning files which are 'zipped'. We need to make sure that the cgi file_upload_handler program does not automatically uncompress files unless this has been requested specifically by the project. Cheers, then David came back with: [boinc_alpha] compression bug in 5.3.21 Inbox David Anderson to boinc_alpha More options Feb 24 (14 hours ago) We quickly found that the support for gzip compression breaks Einstein@home and CPDN, which do their own compression. We're fixing this and it will be in 5.3.22. -- David Then he threw out a proposal [boinc_dev] gzip compression proposal Inbox David Anderson to boinc_dev More options Feb 24 (13 hours ago) My earlier email about gzip compression (and the 5.3.20 client) wasn't completely thought out. Having the client blindly accept gzip encoding breaks projects (like Einstein and CPDN) that already use .gz files, and decompress them in the application. I fixed this (in 5.3.21) by having Curl not accept gzip encoding. But it would be nice to allow application-transparent gzip encoding, since (compared with "deflated") it saves disk space on the server. Here's my proposal for how we can do this without breaking current projects: A <file_info> element (e.g. in a workunit or app_version) can include an optional <gzip/> element. Let's say the file name (the <name> element) is 'foobar'. This tells BOINC that: - the filename on the server is 'foobar.gz' - when downloading the file, Curl should accept the gzip encoding - the filename of the (uncompressed) file on the client is 'foobar'. Can anyone foresee problems with this? David 5.3.22 was released yesterday. One user has reported it not to work as a fix for the compression issue. Note: in hindsight I wish I had never said a thing. LOL |
SURVEYOR Send message Joined: 19 Oct 02 Posts: 375 Credit: 608,422 RAC: 0 |
If BOINC would transfer one wu at a time there is a beter chance on not time time out, plus reduce the wu request by 1/2. One wu at at a time rather than two should give a better chance of completing the one wu before starting on the second. The server would have 1/2 the load. Just my 2 cents Fred BOINC Alpha, BOINC Beta, LHC Alpha, Einstein Alpha |
Jim-R. Send message Joined: 7 Feb 06 Posts: 1494 Credit: 194,148 RAC: 0 |
If BOINC would transfer one wu at a time there is a beter chance on not time time out, plus reduce the wu request by 1/2. One wu at at a time rather than two should give a better chance of completing the one wu before starting on the second. The server would have 1/2 the load. Unless the system were combining both wu's into one "package" to send, I don't see where it would make any difference on a heavily loaded server. If the server is not sending a wu to you it will be sending one to another user. If both wu's were combined into one package, failure of any part of it would cause a failure for both wu's. Sending them individually would allow the possibility of one getting through even if the other were trashed. Also a server doesn't send two of anything at the same time. Data is split up into packets, each with an address of the receiving computer. One packet is sent at a time, however to us "slow" users it seems like you are actually getting two (or 10 or 20) wu's at a time. What is really happening is if the server were just sending two wu's out it sends a packet for wu1 then one for wu2 then another wu1 and another wu2 etc. until all packets that make up both wu's are received. If wu1 gets trashed in the transfer, it's still possible that wu2 would be received correctly. And as I said before, on a heavily loaded server, if it weren't sending you two wu's it would be sending one to you and countless others to other users. I don't know the speed of the network they are using but it could easily run into the "thousands" of wu's that were "apparently" (as I said, to us slow humans) being sent at one time, but in reality only *one* packet going to *one* wu at any particular instant in time. Back to the topic of this thread, file compression would reduce the number of individual packets that need to be sent to make up a wu. I've seen numbers saying about a 33% saving in file size, so if it took 300 packets to make up a wu, the same wu could be sent compressed in 200 packets. If the number of wu's being sent for at a time remained the same, this would reduce the load on the server by 1/3, or if the load on the server remained the same, that same server could serve 1/3 more wu's in a given period of time. Jim Some people plan their life out and look back at the wealth they've had. Others live life day by day and look back at the wealth of experiences and enjoyment they've had. |
Michael Send message Joined: 21 Aug 99 Posts: 4608 Credit: 7,427,891 RAC: 18 |
356K -rw-r--r-- 1 madcow users 354K 2006-02-24 19:07 18mr01ab.23782.14977.186078.1.217 268K -rw-r--r-- 1 madcow users 266K 2006-02-24 19:09 18mr01ab.23782.14977.186078.1.217.gz I agree with the 33% reduction in file size... I suspect a greater load on the servers though during comrpess/uncompress. I am not sure what the true benefit would then be other than badnwidth/disk use...there WILL be a greater load on the cpu. |
Jim-R. Send message Joined: 7 Feb 06 Posts: 1494 Credit: 194,148 RAC: 0 |
Yep I should have pointed that out. There will be overhead in it that I was not calculating, but at the time I was trying to address the point that you do not get "two wu's at the same time" even though it may seem like it. So even if you try to download one or a thousand, they are only going to be sent one packet at a time. So by the time I addressed the issue of compression I left out some of the finer points. Yes the cpu would be worked harder to compress the files, but if they were somehow compressed to begin with it would help. It would transfer the extra load to say the splitter. Split the wu's and compress them then send them to the server to be distributed. The splitter would work a little harder but the server wouldn't be loaded as much. And the receiving end doesn't care where they were compressed. Jim Some people plan their life out and look back at the wealth they've had. Others live life day by day and look back at the wealth of experiences and enjoyment they've had. |
Michael Send message Joined: 21 Aug 99 Posts: 4608 Credit: 7,427,891 RAC: 18 |
hmm, true. I wonder if the client side could do the uncompress? I say pass that overhead off to the users...it would be minimal on our end. |
Michael Send message Joined: 21 Aug 99 Posts: 4608 Credit: 7,427,891 RAC: 18 |
i timed how long it takes to compress and uncompress: Compress madcow@madness2:~/ZIP_TEST> time gzip -c 17my01aa.28386.15936.840898.1.125 >> 17my01aa.28386.15936.840898.1.125.gz real 0m0.057s user 0m0.032s sys 0m0.008s madcow@madness2:~/ZIP_TEST> Uncompress madcow@madness2:~/ZIP_TEST> time gunzip 17my01aa.28386.15936.840898.1.125.gz real 0m0.008s user 0m0.004s sys 0m0.000s madcow@madness2:~/ZIP_TEST> it's fairly minimal....until you start doing hundreds at a time. watcha thunk? |
Tigher Send message Joined: 18 Mar 04 Posts: 1547 Credit: 760,577 RAC: 0 |
In the discussions on these boards about 10 months ago and again in the dev discussions last July the idea was "taken apart" to see how it would benefit. Individual projects make their own arrangements and some are already compressed. There was NO WARMTH for this idea amongst devs last July and Dr A was silent on the matter. I wonder why he now sees it as a good idea? I think it was more than adequately argued back then as a worthwhile idea even just to overcome the horrendous server capacity problems prevailing at that time. I will stress again my support for this just to get the reduced data sets. I will add again that the data sets should be merged and data moved in bulk and not on a single WU basis. This will see a massive reduction in TCP overheads and thoroughly de-stress the servers! Here's looking forward to being ignored for I think the fourth time by the devs on this one....albeit they raised it this time around. |
trux Send message Joined: 6 Feb 01 Posts: 344 Credit: 1,127,051 RAC: 0 |
I wondered since long time why the compression mechanism is not used at S@H, especially since the zip libraries are being present in the the BOINC client source code since a long time. I understand that the CPU load might have been of a concern at downloading WU's, but (as Honza already pointed out on our forum) there is no reason for not using the compression when downloading the applications: they need to be conmpressed only once for all the hundreds thousands of downloads, saving so great part of the bandwidth. So for example imagine the pending transition to S@H Enhanced - the new application together with the FFTW library that is included weight 2.63MB. When ziped, it compresses to 1.13MB (more than 50% compression). Now, there are ~835,000 hosts in S@H. If all of them should download the new application, it would represent 2,196,050 MB ~= 2.2TB! By simply using the compression one single time on the server side (it means 0 CPU load), you can cut the bandwith consumption by more than 50% to just ~1TB! Well, I know that probably not all hosts will download the new application, but still we are speaking about really huge volumes of data. Also, at the WU's the CPU load is actually much lower (at least in the download direction) - you need to compress each WU only once, but each of them is being downloaded by users at least four times - meaning the CPU load per WU is more than 4 times lower than if compressed each time. trux BOINC software Freediving Team Czech Republic |
MikeSW17 Send message Joined: 3 Apr 99 Posts: 1603 Credit: 2,700,523 RAC: 0 |
My computers always upload and download two wu at a time. Would there be? Instead of 1 user downloading 2 WUs at any moment, as some future instant there would be 2 users downdloading 1 unit each. Same bandwith requirement (possibly marginly higher due to connection setup)? |
Jim-R. Send message Joined: 7 Feb 06 Posts: 1494 Credit: 194,148 RAC: 0 |
My computers always upload and download two wu at a time. It would be very nearly the same since it would still be sending out two sets of packets, and depending on the way the splitter/wu creation works, since the different wu's created have different id numbers, I'm assuming that they are really separate files, (unless there is one base file and the header id info is added as it is sent out) to two different destinations (the packets would be addressed to two separate users instead of both packets going to one user.) There would be just a slight chance of packet collisions with only one client receiving both sets of packets, however it shouldn't be enough to jeopardize the transfers. I have routinely downloaded anywhere from 3 or 4 to dozens of files at one time and the download bandwidth was distributed pretty much evenly between downloads. Very few would fail. Sure there might have been collisions where a few packets would be missed and have to be resent, but it was not enough to notice in the actual download speeds. And something else to consider, it has already been seen that a normal wu can be compressed by 1/3 so that would mean 1/3 less packets have to be delivered so that means 1/3 less chance of collisions, 1/3 less bandwidth used for the same amount of data transfer, all with just a slight bit more cpu usage, and again, why would the server necessarily have to be the one to compress the data. It could be compressed on whatever machine (splitter?) actually creates the wu's. and stored in the database in it's compressed format. There would probably be a little bit of overhead here though, as the header would have to be read as it was entered into the database, so it would have to be uncompressed into memory while the relevant data was read from the header. I'm more or less rambling around with ideas here since I don't know exactly how the process works and when and where things get done. But as Trux was saying, using compression would save a lot in bandwidth with just a small overhead in cpu useage. Even on the fairly small wu's, saving 100kb each doesn't sound like much until you multiply that by thousands of wu's. Jim Some people plan their life out and look back at the wealth they've had. Others live life day by day and look back at the wealth of experiences and enjoyment they've had. |
W-K 666 Send message Joined: 18 May 99 Posts: 19080 Credit: 40,757,560 RAC: 67 |
You all seem to be going on about bandwidth here but where is the cpu power coming from to do the compression/decompression at the server end? the image, linked below, is from Tom's Hardwre, and amount compressed is about 800 * one WU, but the spliters sometimes get up to 20/sec, thanks Scarecrow, so thats only 40 secs of splitter time. And the cpu's at UCB are not as powerful as a pent D 900 series. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.