Message boards :
Technical News :
Mirror (Feb 26 2009)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Random day today for me. Catching up on various documentation/sysadmin/data pipeline tasks. Not very glamorous. The question was raised: Why don't we compress workunits to save bandwidth? I forget the exact arguments, but I think it's a combination of small things that, when added together, make this a very low priority. First, the programming overhead to the splitters, clients, etc. - however minor it may be it's still labor and (even worse) testing. Second, the concern that binary data will freak out some incredibly protective proxies or ISPs (the traffic is all going over port 80). Third, the amount of bandwidth we'd gain by compressing workunits is relatively minor considering the possible effort of making it so. Fourth, this is really only a problem (so far) during client download phases - workunits alone don't really clobber the network except for short, infrequent events (like right after the weekly outage). We might be actually implementing better download logic to prevent coral cache from being a redirect, so that may solve this latter issue. Anyway.. this idea comes up from time to time within our group and we usually determine we have bigger fish to fry. Or lower hanging fruit. Oh - I guess that's the end of this month's thread title theme: names of lakes in or around the Sierras that I've been to. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
I understand that when the bandwidth only spikes for 'short' periods of time, it may not be a priority to fix it. However, per a not-so-recent news release, S@H's need for more computers and volunteers was going to increase many-fold. In light of this hoped-for need, can one actually afford to deprioritize the bandwidth issue? More generally, if a virtual stress test could be performed with 10x more hosts than currently are involved (the news release suggests a potential factor of 500-yikes!), what parts of S@H wouldn't scale and would need to be re-engineered? |
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
I understand that when the bandwidth only spikes for 'short' periods of time, it may not be a priority to fix it. However, per a not-so-recent news release, S@H's need for more computers and volunteers was going to increase many-fold. In light of this hoped-for need, can one actually afford to deprioritize the bandwidth issue? Well, to clarify - bandwidth is definitely a priority. Solving them by compressing workunits? Not so much... It wasn't a priority in the past for reasons I stated earlier, and it isn't a priority in the near future since we'll need a lot more bandwidth than compressing workunits will provide. This is why we're exploring the bigger options mentioned in an earlier thread. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0 |
As usual - Thanks for the Posting Matt - It's really appreciated here . . . < wonderin' how many ever notice that re: the quote ;) > BOINC Wiki . . . Science Status Page . . . |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
this is really only a problem (so far) during client download phases - workunits alone don't really clobber the network except for short, infrequent events (like right after the weekly outage). Question: could the client binaries themselves be located off-site (or closer to your feed), perhaps a single server just for those files? |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 |
this is really only a problem (so far) during client download phases - workunits alone don't really clobber the network except for short, infrequent events (like right after the weekly outage). That technique was causing quite a bit of the grief. The problem was the redirect - which apparently a large number of firewalls and ISPs do not allow. The ISP substitutes a page that notes that this is not allowed, and the substitute page downloads correctly. BOINC thinks it has the file until it does a checksum at which point all tasks that relied on that file error out and then new ones are downloaded and run into the same problem. Churning like this is a good way of filling the bandwidth. If the BOINC client knew where to go look for the files instead of a redirect, then it would work better. Another possibility would to have something like BitTorrent - where BOINC would ask the torrent for the files and the torrent would tell BOINC where to go fetch. BOINC WIKI |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
I understand that when the bandwidth only spikes for 'short' periods of time, it may not be a priority to fix it. However, per a not-so-recent news release, S@H's need for more computers and volunteers was going to increase many-fold. In light of this hoped-for need, can one actually afford to deprioritize the bandwidth issue? The A2010 ALFALFA Spring 2009 observations will continue until the end of April, so the midrange work they provide gives a convenient period to decide how to handle the increase in VHAR 'shorties' after that. I guess the delivery pipeline would make the critical time the second or third week of May. I agree compression wouldn't be enough for a permanent solution, but it might ease problems temporarily. Joe |
Mike O Send message Joined: 1 Sep 07 Posts: 428 Credit: 6,670,998 RAC: 0 |
Thanks Matt.. I didnt know why compression wasnt being used, thats why I started with... Heres a crazy idea. It seems pretty clear now actually. You would need to have a beta server and testers. Not Ready Reading BRAIN. Abort/Retry/Fail? |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
this is really only a problem (so far) during client download phases - workunits alone don't really clobber the network except for short, infrequent events (like right after the weekly outage). Wearing my ISP hat for a moment, I'm not sure why an ISP would block the redirect. Corporate America is a different story. Do the applications have to come from the same download servers as the work? Seems like the easiest solution would be to tell BOINC to look for clients.setiathome.ssl.berkeley.EDU or somesuch and download from there. Then that could be one, or several, distributed servers around the planet just based on the number of "A" records in DNS. ... or it could be mapped to the same IP on smaller projects. |
Bryan Price Send message Joined: 3 Apr 99 Posts: 3 Credit: 509,972 RAC: 0 |
Second, the concern that binary data will freak out some incredibly protective proxies or ISPs (the traffic is all going over port 80). But binary data is already going over port 80. A lot of web pages already get compressed by the web server before they get transferred to browser with then proceeds to uncompress that stream to render it. 2/26/2009 10:59:55 PM||Libraries: libcurl/7.19.0 OpenSSL/0.9.8i zlib/1.2.3 That's why libcurl includes zlib, to unpack the data stream. The client appears to already be ready to at least receive if not already send compressed streams. caveat vendor, I'm not a programmer and looking at the documentation I don't see a way of handling it on the sending side, not that it would be necessary. And if the server is already doing THAT processing, zipping the file isn't going to help the bandwidth one little bit. |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
I agree implementing your own compression within BOINC would require a significant amount of work, and the rewards are not large enough given your time and financial constraints. However... Have you considered HTTP based compression? Apache could compress the work unit downloads realtime using gzip or deflate. You'll get about 27% smaller downloads this way. Here's a website that shows you how to edit the apache config file. http://www.serverwatch.com/tutorials/article.php/10825_3514866_2 I'd have to check the BOINC code, but this should be very straight-forward. The one disadvantage to this feature is CPU load on the web servers will increase. Memory requirements for compression are minimal. |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
Oh, I didn't see your post, Bryan. Yes, if it's already being compressed via HTTP realtime, then there's no point to using GZIP/Deflate. But if it's not... then that should be a quick fix to enable. Edit: tcpdump shows compressed data. Can someone run tcpdump while fetching work from SETI to verify? |
Dirk Sadowski Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
BTW. Since days nearly every work request results that the PC get only one new WU. So BOINC ask and ask again.. Because the cache couldn't fill up. More serverload.. Is there a reason, why this happen? |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
But binary data is already going over port 80. A lot of web pages already get compressed by the web server before they get transferred to browser with then proceeds to uncompress that stream to render it. I think we're forgetting GIFs and JPGs, to name a couple. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
That technique was causing quite a bit of the grief. The problem was the redirect - which apparently a large number of firewalls and ISPs do not allow. The ISP substitutes a page that notes that this is not allowed, and the substitute page downloads correctly. BOINC thinks it has the file until it does a checksum at which point all tasks that relied on that file error out and then new ones are downloaded and run into the same problem. Churning like this is a good way of filling the bandwidth. If the BOINC client knew where to go look for the files instead of a redirect, then it would work better. Another possibility would to have something like BitTorrent - where BOINC would ask the torrent for the files and the torrent would tell BOINC where to go fetch. The problem wasn't so much with the location of the files on another server, it was with the technique used to tell the BOINC client where to collect the files from. Matt made the (very reasonable) point that 'caching' (of any description) means that the relief server doesn't have to be manually loaded with the new files when anything changes. But subject to that limitation, and the need to issue Matt with a pair of long-range kicking boots in case anything goes wrong, then establishing an application download server with a different URL closer to the head-end of the 1Gb/s feed might buy a bit of time. Einstein does something like this - even supplying BOINC with a set of multiple download urls - and it seems to work well in general, with just occasional glitches if a mirror site goes off-air. Comparing notes with Einstein might throw up some more ideas. |
Andy Lee Robinson Send message Joined: 8 Dec 05 Posts: 630 Credit: 59,973,836 RAC: 0 |
However... Have you considered HTTP based compression? Apache could compress the work unit downloads realtime using gzip or deflate. You'll get about 27% smaller downloads this way. gz compression over http is great for servers that are generating significant html/xml/css/js traffic, and although advisable to use on this forum, this traffic is a tiny fraction of the WU bandwidth requirements. Its fraction of total bandwidth would be small. Compressibility is related to entropy, and in terms of compression a given block of data has varying degrees of 'sponginess' - you could take a large sponge, compress it and feed it though a gas pipe, and it expands again on reaching the exit. Same principle. However, we are analysing noise that is random and has no identifiable redundancy - to any compressor it is indistinguishable from concrete! The only viable solution that I recommend is to repackage the XML workunits with binary CDATA instead of base64 encoding. I don't agree that this will cause any problems with routing or content filtering - http has to handle all forms of binary transmission, or we would have no web! If the apps were properly XML compliant and not make assumptions about content encoding then they should not need to be rewritten. Transmission should be transparent regardless of content encoding, and much bandwidth could be saved as with Astropulse WUs. Andy. |
Robert Send message Joined: 2 May 00 Posts: 5 Credit: 12,853,177 RAC: 10 |
I just received a 1200+ hour Astropulse work unit (computer is a P4 2.65Mhz) isn't that kinda long ?? |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
That technique was causing quite a bit of the grief. The problem was the redirect - which apparently a large number of firewalls and ISPs do not allow. The ISP substitutes a page that notes that this is not allowed, and the substitute page downloads correctly. BOINC thinks it has the file until it does a checksum at which point all tasks that relied on that file error out and then new ones are downloaded and run into the same problem. Churning like this is a good way of filling the bandwidth. If the BOINC client knew where to go look for the files instead of a redirect, then it would work better. Another possibility would to have something like BitTorrent - where BOINC would ask the torrent for the files and the torrent would tell BOINC where to go fetch. I was kind of thinking more along the lines of don't try to download any work for an application that needs to be downloaded, until the application has successfully downloaded. The problem was WUs were being assigned, BOINC was saying "hey, I need this application to run these tasks", it would then try to get the application, then the app download failed, and all the WUs were dumped, and the process would start over again at the next connect interval. Another thing that would have greatly helped keep that situation from spiraling out of control is what I believe would be a better way to do the quota system. Instead of doubling the quota for a good result returned, it should just be +2. It was pointed out that if you are on an 8-CPU system and you turn in 800+ bad tasks, it only takes eleven (11) good results to bring the quota back to 800. 4-cpu takes 10, 2-cpu only takes 8. Then there's the CUDA quotas that were thrown into the mix, as well, with the multiply factor for that. I think +2 instead of 2x would keep problem computers at bay very nicely. It doesn't even have to be +2.. it can be +5..just as long as it's not multiplication. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Kaylie Send message Joined: 26 Jul 08 Posts: 39 Credit: 333,106 RAC: 0 |
This may be a dumb question, but... How do I know if I have a problem computer and how can I fix it if I do? (Not wanting to be part of the problem), Thanks. |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
This may be a dumb question, but... A problem computer is one that returns an excessive amount of errors. Each error reduces your daily CPU quota by one. Typically you want to be at 100 all the time, but something as simple as missing a deadline will count as an error and reduce the quota. You can look at your hosts page and check and see what the daily quotas are for all of your systems. If you find one that is lower than 90, there's a problem that needs to be addressed. Now that I've said that, there are many different problems, and of course, multiple solutions per problem, so I'm not going to make this post any longer than this. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.