Thousand Island (Feb 23 2009)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 868789 - Posted: 23 Feb 2009, 21:06:51 UTC Our outbound traffic has been pegged since Friday. This may seem like only a download problem, but it even affects uploads, as the basic syn/ack handshaking packets on the upload server get dropped along with the rest of the download packets that can't make it through the dam. After discussions with Eric and Jeff, here's what we gather is happening. We use coral cache to reduce our bandwidth needs. Coral cache is an easy-to-use, free, third-party system which does some nice distributed caching just by redirecting the right apache requests to their servers. For example, somebody wants to download the latest astropulse client, they go to our download server, and then they redirected automatically to the coral cache server. The redirect is of the form such that, if the coral cache server hasn't done so already, it downloads the latest astropulse client from us, caches it, and then sends it to the requester. Once cached, it doesn't need to contact our servers again. So, in essence, all but one of the client download requests hit originate from sources outside our lab, thus saving us lots of bandwidth. That brings us to problem 1. Many ISPs don't like redirects to third-party IPs. This is understandable. What happens in this case is a client downloads a new application, but instead of getting the actual executable they get a blob of HTML saying "this ISP doesn't like third party redirects," etc. Obviously the checksum of this HTML blob won't match the executable checksum, resulting in an application download checksum error. This has been a known problem. So we've been only using coral cache during the first couple of weeks after a new application is made available to reduce the pain of the download rush. A small fraction of our users will be inconvenienced by those redirect errors, but they'll get their clients in due time when coral cache is turned off after the initial "wave." But then there's problem 2. An application download checksum error (a) doesn't cause exponential backoff and (b) causes all workunits also requested by this particular client to be errored out and resent. This is at least the behavior is older, yet still commonly used, boinc clients. Dave said most of that has been addressed, but if they're still bugs they'll be fixed. In any case, what we saw this weekend was a confluence of these two problems. This may not have been an issue before due to lighter traffic patterns, but we sure fell off the deep end this time. Maybe there was a small set of heavily active clients this time around causing most of the pain. And once the network gets pegged, all hell breaks loose, and it takes a while to heal itself. Eric actually had most of this figured out before we arrived today, and already turned off coral cache. At least the broken redirects spiraling out of control would stop happening. He also adjusted the tcp settings on the upload server to help get those partially working again (instead of only 2% uploads getting through, now it's about 50%). The plan is to let this current state of indigestion pass on its own, and if needed change some BOINC settings (if not also BOINC code) so that future coral cache attempts will be direct links as opposed to apache redirects. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 868789 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 868802 - Posted: 23 Feb 2009, 21:33:05 UTC Thanks Matt for trying to quell the burning angst out here. ID: 868802 ·

nemesis Send message Joined: 12 Oct 99 Posts: 1408 Credit: 35,074,350 RAC: 0	Message 868813 - Posted: 23 Feb 2009, 22:30:49 UTC hey Phon, watch it with the "burning angst"! there could be kids reading this! jk Thanks for the info Matt! ID: 868813 ·

Mike O Send message Joined: 1 Sep 07 Posts: 428 Credit: 6,670,998 RAC: 0	Message 868816 - Posted: 23 Feb 2009, 22:37:35 UTC So if im understanding this, the upload/download freez was caused by software? We or rather I started a fund rasing effort to try to help with the connection problem we are seeing more and more of lately. We talked and debated over what would be the best cure for this. 1. A better connection for the servers to the net. or.. 2. A faster server to deal with the higher demands caused by CUDA or.. 3. ? Matt, you are the expert here and no one will dispute that so please.. What is the number one thing that would help fix the jam ups? There are a few including me that want to help in any way we can. Even tho we are mostly broke, we can help pull some donations in and rally for a specific goal by a specific date. Tell use what you need and maybe the community can find a way to make it so. One more thing.. THANK YOU FOR YOUR HARD WORK and I mean HARD WORK on this project! Its has to be pure insanity at times. Not Ready Reading BRAIN. Abort/Retry/Fail? ID: 868816 ·

arkayn Volunteer tester Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0	Message 868817 - Posted: 23 Feb 2009, 22:38:50 UTC - in response to Message 868789. That brings us to problem 1. Many ISPs don't like redirects to third-party IPs Some anti-virus/anti-spyware software also flag the redirect as suspect. Trend Micro Web Reputation - Feedback Submission Form URL: http://boinc2.ssl.berkeley.edu.nyud.net/sah/download_fanout/coral/astropulse_5.03_windows_intelx86.exe Current rating: This URL is currently listed as malicious. Trend Micro Web Reputation Query - Online System Type a website in the field below to: Ã¢â‚¬Â¢ Check its reputation ranking/score Ã¢â‚¬Â¢ Submit feedback about a certain website Complete website: Only HTTP and HTTPS are supported. (e.g., http://www.trendmicro.com) Web reputation result: This URL is currently listed as malicious. ID: 868817 ·

Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0	Message 868843 - Posted: 23 Feb 2009, 23:29:32 UTC . . . Thanks for the Updates Matt - Eric, whatever you did - it worked Sir Accolades to Each of You @ Berkeley - hard work IS your middle names . . . BOINC Wiki . . . Science Status Page . . . ID: 868843 ·

Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 868856 - Posted: 24 Feb 2009, 0:15:00 UTC - in response to Message 868816. Last modified: 24 Feb 2009, 0:15:14 UTC So if im understanding this, the upload/download freez was caused by software? Well, not software as much as implementation of a service used to get around our bandwidth limitations. So yeah, the main bottleneck when problems like this arise is our connection to the internet, which maxes out at 100Mb/sec (though we are paying for 1Gb/sec - long story). We discussed several solutions today at our general meeting. Each has its major cons. All will be quite expensive in time and/or dollars. So the key right now is to work with what we got as best we can which, 99% of the time, is plenty - and while exploring improvements for the future. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 868856 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 868859 - Posted: 24 Feb 2009, 0:23:43 UTC I did some quick googling over the weekend, and was surprised to come up with a quote of around $3,000 for the most visible raw ingredient - a mile of armoured, rodent-proof, direct-burial, 24-core single mode 9/125 LSOH cable (Optix CST). Would it be possible to break down the current budget element of $80,000 for the network upgrade a bit further, so we can see where the other $77,000 would go? ID: 868859 ·

SoNic Send message Joined: 24 Dec 00 Posts: 140 Credit: 2,963,627 RAC: 0	Message 868860 - Posted: 24 Feb 2009, 0:26:47 UTC - in response to Message 868859. Last modified: 24 Feb 2009, 0:31:44 UTC Would it be possible to break down the current budget element of $80,000 for the network upgrade a bit further, so we can see where the other $77,000 would go? Is that cable the kind that burries himself? And eventually terminate also in the fiber patch panel? :) Now, digging would not cost too much - if you can do it by renting a "Dich Witch" and a smart contractor, but if pavement is in the way, things start to get ugly. And yes, you need to know what is burried already there so you don't dig thru electrical cables or irrigations. ID: 868860 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 868861 - Posted: 24 Feb 2009, 0:31:18 UTC - in response to Message 868860. Would it be possible to break down the current budget element of $80,000 for the network upgrade a bit further, so we can see where the other $77,000 would go? Is that cable the kind that burries himself? And eventually terminate also in the fiber patch panel? :) ... and handles all its' own routing? :) Sorry, I couldn't help it. The biggest cost is probably labor. ... and it all has to be done with stuff that CNS likes, because CNS is going to be responsible for it once it is installed. Most likely, the "breakdown" is: SETI@Home: "Hello, CNS, we want more bandwidth, what will it cost?" CNS: "We need to upgrade routers, dig holes, run fiber, $80,000?" ... and since CNS is the only game on campus, that's what it costs. I'm sure their pricing is reasonable. ID: 868861 ·

KWSN Sir Clark Volunteer tester Send message Joined: 17 Aug 02 Posts: 139 Credit: 1,002,493 RAC: 8	Message 868869 - Posted: 24 Feb 2009, 1:00:34 UTC So Eric had it figured out already...... Methinks we're looking in the wrong place. Any chance of getting Arecibo to check his brain out for psychic radio signals or such like. Thanks for the heads-up and keep up the great work. ID: 868869 ·

Mike O Send message Joined: 1 Sep 07 Posts: 428 Credit: 6,670,998 RAC: 0	Message 868870 - Posted: 24 Feb 2009, 1:11:06 UTC Thanks Matt for the info.. So the hardware is capable 99% of the time. Thats comforting news :) As most of us had figured, Its the 100meg plug. At 80k to utilize the 1G, thats gonna be hard to get raised any time soon. HOWEVER.. I will try :) Also.. There are many that would like to make small donations. under 10 dollars. Is there a way that Blurf can set up an PAYPAL account so people can donate $5.00 if they can. Its not much but every dime helps. I know there may be legal issues.. but Blurf is a very trusted member of the community. If Blurf held and then just made a lump donation in his name, wouldn't that skip that issue? There will need to be a tax shelter however.. non-profie org? Also, A post office box where people can mail small checks to.. would probably be convenient for others. He could fund the PO BOX with part of the donations gathered? Of course, Blurf would need to be willing to take on the responsiblity of handling the donations. I'm just trying to find revenue in any way thats legal ;) Eric.. Your ok in my book too! Can't wait to see the fruit of your labors! Not Ready Reading BRAIN. Abort/Retry/Fail? ID: 868870 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 868879 - Posted: 24 Feb 2009, 1:43:00 UTC What percentage of that 90+% bandwidth we saw was due to some sort of error? Could any of that be regained by improved boinc software? (I'm still hoping somebody can give us a 'better' caching algorithm that doesn't hit the server so much.) ID: 868879 ·

Neil Blaikie Volunteer tester Send message Joined: 17 May 99 Posts: 143 Credit: 6,652,341 RAC: 0	Message 868901 - Posted: 24 Feb 2009, 2:36:56 UTC My recommendation for raising funds in a quick manner would be to tender companies with a major sponsorship drive. Tax year is coming to a close and making a donation to an educational institute is tax deductable. It would need a cruncher or group of crunchers with good solid background on financial planning to be able to offer larger companies anything. If a guy can make himself a millionaire by selling pixels on a screen as advertising, then something new could be done to help Berkeley in some way. For those that don't remember or ever saw it then I posted a link millionairepixel 16,000 people donating $5 = $80,000, yes small amount of money but large number of people required. Also depends on whether campus would allow anything like that in the first place, SETI staff is all part of the bigger Berkeley picture. Faster server with a huge amount of drives and enough RAM to be able to cope with constant I/O am sure wouldn't go amiss either. As Matt mentioned sometime ago somewhere, they also have to check the "lab" can take anymore high power equipment with the current electrical supply they have, overload = disaster. Kudos to each of you at Berkeley though for the amazing hard work you guys have done to get things running smoothly again. Hope everything goes equally smooth tomorrow during the outage. ID: 868901 ·

gomeyer Volunteer tester Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0	Message 868911 - Posted: 24 Feb 2009, 3:18:03 UTC Last modified: 24 Feb 2009, 3:19:14 UTC Matt - That was so well explained that even I understood it. Thanks. [edit] BTW, I still think these rollouts should only be launched earlier in the week. [/edit] ID: 868911 ·

Nick Send message Joined: 17 May 99 Posts: 96 Credit: 17,356,094 RAC: 0	Message 868914 - Posted: 24 Feb 2009, 3:28:54 UTC It's certainly nice to see this communication and such but the question remains, when are we going to be able to resume operations? I can upload 1 or 2 completed tasks an hour but nothing downloads. ID: 868914 ·

Sandy Allen Send message Joined: 16 Feb 08 Posts: 2 Credit: 542,829 RAC: 0	Message 868927 - Posted: 24 Feb 2009, 3:54:43 UTC - in response to Message 868870. I agree with being able to donate via PayPal. I helped with one of the presidential campaigns. Every request for money was for $5, $10, $25..not $1K or $10K. Donations could be made in many forms, including at one point PayPal. It was ridiculous the amount of money that was raised this way. ID: 868927 ·

archae86 Send message Joined: 31 Aug 99 Posts: 909 Credit: 1,582,816 RAC: 0	Message 868931 - Posted: 24 Feb 2009, 4:09:05 UTC - in response to Message 868914. Last modified: 24 Feb 2009, 4:47:46 UTC It's certainly nice to see this communication and such but the question remains, when are we going to be able to resume operations? I can upload 1 or 2 completed tasks an hour but nothing downloads. It depends on who "we" is. I have hosts with very low SETI work fractions which were able to get fresh work even late yesterday--as they did not have an excessive number of pending uploads. According to this message from the guy who maintains it, the relevant piece of software inhibits work request when a host has a pending upload count greater than twice its number of CPUs. For a fast host with a high fraction of time devoted to SETI, this means it has to finish uploading the great majority of the work it completed but was unable to upload during the recent unpleasantness. Exponential backdown means that some of the older work won't even retry very often. Overall this is likely a good thing, as it spreads out the mass attack of work request compared to what would otherwise occur. [edited for minor readability typo] ID: 868931 ·

Nick Send message Joined: 17 May 99 Posts: 96 Credit: 17,356,094 RAC: 0	Message 868932 - Posted: 24 Feb 2009, 4:16:16 UTC - in response to Message 868931. This is making sense. I have some older machines and they have been able to download new work units while at the same time my new fast machines, with a big backlog can't. ID: 868932 ·

Batman Send message Joined: 17 Dec 00 Posts: 8 Credit: 84,508 RAC: 0	Message 868960 - Posted: 24 Feb 2009, 7:54:18 UTC - in response to Message 868789. But then there's problem 2. An application download checksum error (a) doesn't cause exponential backoff and (b) causes all workunits also requested by this particular client to be errored out and resent. This would explain why I have processed virtually nothing for the past month while my computer has been running almost 24x7?! After a week of processing an AP WU it gets errored out?! I am very upset about having all my time and computer resources I have donated for the past month wasted. Now that I have learned how to disable AP I won't be wasting any more effort on that. Please tell me that it is not true, maybe I'll turn it back on. ID: 868960 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.