Panic Mode On (28) Server problems

Author	Message
Matthew S. McCleary Send message Joined: 9 Sep 99 Posts: 121 Credit: 2,288,242 RAC: 0	Message 971198 - Posted: 18 Feb 2010, 15:03:04 UTC It's situations such as this -- regardless of what the actual cause is -- that chase people away from crunching for SETI@home. Simply acknowledging that a problem exists and a solution is being looked for, whether the problem is Berkeley's or elsewhere, goes a long way towards calming everyone's nerves. We're not getting that, though, obviously. ID: 971198 ·

rebest Volunteer tester Send message Joined: 16 Apr 00 Posts: 1296 Credit: 45,357,093 RAC: 0	Message 971200 - Posted: 18 Feb 2010, 15:03:54 UTC With all due respect to Ned and Pappa, the Cricket Graphs don't lie. There has been a steady, overall reduction in throughput going back a week; well before the cooling went out in the closet. There are occasional upward spikes, to be sure, but the trend is obvious. Two weeks ago, everything was chugging along just fine and this thread was practically dormant. We understand about weekly outages and emergencies like the A/C. But something else is clearly not right. ???? Join the PACK! ID: 971200 ·

Roundel Send message Joined: 1 Feb 06 Posts: 21 Credit: 6,850,211 RAC: 0	Message 971201 - Posted: 18 Feb 2010, 15:04:49 UTC Last modified: 18 Feb 2010, 15:06:28 UTC Not sure if others have gone through since that 1 went through last night. But I'm now dry on a few machines and almost dry overall across the fleet Cant upload and now getting the errors that there are no jobs available on a dry machine. Oh well, all the hardware can take a much needed rest. ID: 971201 ·

Roundel Send message Joined: 1 Feb 06 Posts: 21 Credit: 6,850,211 RAC: 0	Message 971204 - Posted: 18 Feb 2010, 15:09:57 UTC Well thats interesting, especially if you look at the monthly range. I hadn't noticed any connectivity problems until this whole situation arose at the beginning of the week. I wonder if a router or switch has been dying a slow death and finally gave up the ghost. ID: 971204 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 971217 - Posted: 18 Feb 2010, 15:46:44 UTC - in response to Message 971200. Yes, over the month there is an obvious trend, but look at the yearly chart; the recent performance is in the noise! (But don't tell Matt or he may defer fixing the problem to work on other issues.) ID: 971217 ·

PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1	Message 971218 - Posted: 18 Feb 2010, 15:47:37 UTC - in response to Message 971162. Monitoring my upload process, I see a very few making it through at present. What is frustrating is that I see a lot that get as far as 100% uploaded to be subsequently rejected and queued up to try again. The last bit of handshaking fails and causes the system to repeat work (upload) that appears to have been completed. This is not an new observation. Because it obviously takes bandwidth and server resources to execute this type of failure, and because the behavior has been around 'forever', has any effort been made to remedy it? ID: 971218 ·

Dorphas Send message Joined: 16 May 99 Posts: 118 Credit: 8,007,247 RAC: 0	Message 971224 - Posted: 18 Feb 2010, 16:00:03 UTC Last modified: 18 Feb 2010, 16:00:32 UTC don't know what this may mean in the bigger picture, but i just had one machine upload about 50 workunits....but i can't get them to report at all. ID: 971224 ·

Highlander Send message Joined: 5 Oct 99 Posts: 167 Credit: 37,987,668 RAC: 16	Message 971226 - Posted: 18 Feb 2010, 16:01:42 UTC My Rumor: I think, the last great power outage in the Bay-Area had demaged the ISP-Hardware, and ISP had setup a 10 mbit-link for emergency use. But this is really only my thought about the situation. And whatever it really is, hope all can be fixed in near future (many UL waiting on my side ^^). - Performance is not a simple linear function of the number of CPUs you throw at the problem. - ID: 971226 ·

hiamps Volunteer tester Send message Joined: 23 May 99 Posts: 4292 Credit: 72,971,319 RAC: 0	Message 971236 - Posted: 18 Feb 2010, 16:25:45 UTC The only way I can get any to upload is keep pressing buttons...This project backoff is for the birds, I would rather see them fix the problem than cripple the client. Some get thru but then the project wants to backoff for 2 hours like that is going to do anything but delay the problem. Official Abuser of Boinc Buttons... And no good credit hound! ID: 971236 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14661 Credit: 200,643,578 RAC: 874	Message 971243 - Posted: 18 Feb 2010, 16:47:15 UTC - in response to Message 971236. The only way I can get any to upload is keep pressing buttons...This project backoff is for the birds, I would rather see them fix the problem than cripple the client. Some get thru but then the project wants to backoff for 2 hours like that is going to do anything but delay the problem. I did a bit of button-pushing this morning, and got one machine down to one upload pending (it only had about a dozen in total, so I wasn't adding much to the load!). Nothing on the reporting front, until it tried again of its own accord while I was on the phone at 15:24. SETI@home 18/02/2010 15:24:41 Requesting 718981 seconds of new work, and reporting 10 completed tasks SETI@home 18/02/2010 15:24:56 Scheduler RPC succeeded [server version 611] SETI@home 18/02/2010 15:24:56 Message from server: (Project has no jobs available) Says it all, really. ID: 971243 ·

Dave Send message Joined: 29 Mar 02 Posts: 778 Credit: 25,001,396 RAC: 0	Message 971247 - Posted: 18 Feb 2010, 16:54:28 UTC I know it makes us feel good - + I'm the same - but all this manual button-pushing remember does actually make things worse because it's putting more load on the server. The backoffs, though annoying, are there to spread the load throughout the thousands of clients out there. ID: 971247 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14661 Credit: 200,643,578 RAC: 874	Message 971248 - Posted: 18 Feb 2010, 16:55:50 UTC - in response to Message 971217. Yes, over the month there is an obvious trend, but look at the yearly chart; the recent performance is in the noise! (But don't tell Matt or he may defer fixing the problem to work on other issues.) Up to and including Week 3 on the monthly chart, they were only splitting tapes for MultiBeam work - they were wrestling with major Astropulse database problems. Astropulse splitting restarted during Week 4, and accounts for the higher average throughput since then (there hasn't been a regular supply of AP work since last May, and AP-crunchers' caches are drier than Death Valley). Every AP unit split gets gobbled up instantly. It's gone quiet again on the AP front now, because all loaded tapes have been split. Other peaks and troughs relate to the variety in Angle Range for the MB work recently: if a recording was made during a high AR sky survey, the resulting WUs are processed (and hence downloaded) at four(-ish) times the rate of other ARs. And the flatline since Monday is another story entirely..... ID: 971248 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14661 Credit: 200,643,578 RAC: 874	Message 971249 - Posted: 18 Feb 2010, 16:57:37 UTC - in response to Message 971247. I know it makes us feel good - + I'm the same - but all this manual button-pushing remember does actually make things worse because it's putting more load on the server. The backoffs, though annoying, are there to spread the load throughout the thousands of clients out there. I haven't touched the buttons on the machines with 74 - 100 - 138 pending transfers, honest! ID: 971249 ·

Iona Send message Joined: 12 Jul 07 Posts: 790 Credit: 22,438,118 RAC: 0	Message 971250 - Posted: 18 Feb 2010, 17:01:44 UTC I'm getting the same problems as everyone else.....one WU has been stuck at uploading for almost 3 days! If anything does manage to upload, then almost invariably, it does not get reported and if it does, I break out some Bollinger! Without a doubt, something is amiss with the comms....would some long lengths of string and a few tins be any better? Don't take life too seriously, as you'll never come out of it alive! ID: 971250 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14661 Credit: 200,643,578 RAC: 874	Message 971254 - Posted: 18 Feb 2010, 17:22:43 UTC - in response to Message 971198. It's situations such as this -- regardless of what the actual cause is -- that chase people away from crunching for SETI@home. Simply acknowledging that a problem exists and a solution is being looked for, whether the problem is Berkeley's or elsewhere, goes a long way towards calming everyone's nerves. We're not getting that, though, obviously. As I said this morning, I honestly believe that by the time they left the lab yesterday evening, the staff weren't aware that there was a communications problem. And remember that by "the staff", we are talking about a tiny number of heavily-multitasking individuals - of the eight people on the project page, two have left, one is still writing up his PhD thesis, and only four have any operational responsibility at all. Remember the timeline for this outage: Started around 9am Monday - a National Public Holiday, when I doubt any of them had more than a cursory eye on the lab. Tuesday - Matt's first day back after a week's holiday. Catch up, back up, start recovery - then the aircon blows. Wednesday - get the temperatures under control, then start up the complicated inter-dependent mess of second-hand servers. In the meantime, as hiamps' and my button-pushing experiments have shown, work is trickling back - slowly, but enough to register on their radar as "it's working" (Matt has said as much after previous semi-outages, like when one of the two download servers went down). It's at times like this that I - still - really miss having an official, technical, channel for reporting problems direct to the heart of the opps room. These message boards don't meet the need, because there are too many false positives: most of the problems we discuss here relate to our own machines, and very few - two or three a year, at most - relate to Berkeley problems that the staff aren't already fully aware of. Technical News might be a better venue, but all too often - like last night, when it might have made a difference - that degenerates into general off-topic chit-chat too. And, to ride an old hobby-horse of mine - at other projects the Moderator team would step forward to fill the gap. They know my views on that, and I theirs - no need to reiterate. ID: 971254 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 971256 - Posted: 18 Feb 2010, 17:26:43 UTC - in response to Message 971200. Last modified: 18 Feb 2010, 17:27:14 UTC With all due respect to Ned and Pappa, the Cricket Graphs don't lie. There has been a steady, overall reduction in throughput going back a week; well before the cooling went out in the closet. There are occasional upward spikes, to be sure, but the trend is obvious. .... and with all due respect, the Cricket graphs do not lie, but what they're saying is not always 100% obvious -- they measure just one parameter. Very strange things start to happen when you go from about 95% loading past 100% and up into the higher ranges. Now, alot of the rest is based on my own observations of systems I can look at directly, and parallel behaviour I'm seeing at SETI. For each TCP connection that is open, there is a control block. The server gets a TCP "SYN" packet, it creates a control block, and returns SYN+ACK. Once the connection is up, each packet comes in to the server, the server searches through the control blocks for the one matching that packet (same source and destination ip and port), and the control block then matches the packet to the task processing it. If you have 100 open connections, you have 100 control blocks, 100 threads, and everything goes pretty fast. If you have 10,000 open connections, searching the control blocks takes 100 times longer, and the operating system is managing 100 times more threads. A lot more goes to overhead. ... and when the server is spending too much time on overhead, it isn't answering new connections properly, or servicing the ones it has, and bandwidth goes DOWN. Now, I can't see the internal server metrics, but I do know that by design SETI operates at higher than normal loads and is more likely to push in to this strange realm where high loads show as low bandwidth. It's a bit like a SYN-Flood attack, without the malice. ID: 971256 ·

Bill Walker Send message Joined: 4 Sep 99 Posts: 3868 Credit: 2,697,267 RAC: 0	Message 971257 - Posted: 18 Feb 2010, 17:27:18 UTC According to Cricket, downloads have started again. Expect it will take days to clear the back log though. Thats why we run other projects. ID: 971257 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 971259 - Posted: 18 Feb 2010, 17:31:46 UTC - in response to Message 971236. Last modified: 18 Feb 2010, 17:32:00 UTC The only way I can get any to upload is keep pressing buttons...This project backoff is for the birds, I would rather see them fix the problem than cripple the client. Some get thru but then the project wants to backoff for 2 hours like that is going to do anything but delay the problem. The correct fix is to make the backoffs much, much bigger, or to get someone to write a really, really big check every month for a bigger server room, more servers, more electricity, and more A/C. If the backoffs were dramatically bigger, then the majority of upload attempts that did happen would be successful, and the flow of inbound work would be near the theoretical maximum -- and the overall recovery would be faster. Backoffs are your friend. ID: 971259 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 66002 Credit: 55,293,173 RAC: 49	Message 971261 - Posted: 18 Feb 2010, 17:35:44 UTC - in response to Message 971186. Erhm not sure if anyone noticed the news page; Projects are down due to a server closet air conditioning failure. We have to power down most of our computers until this is fixed. 17 Feb 2010 2:36:55 UTC http://setiathome.berkeley.edu/index.php Apolagies if this has allready been pointed out but this could be whats going on :) My friend, if the AC had not been fixed, we would not be talking right now.......the servers would still be down. There is a comms problem that existed before the AC failure, and still persists. I see ........ then i stand corrected :) Still its a nice chance to give the pc a clean :) I suspect many dust bunnies are meeting their maker about now. Many have here already when I converted to Water Cooling. :D Now If only We could upload, Matt where are You? The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 971261 ·

Rick Send message Joined: 3 Dec 99 Posts: 79 Credit: 11,486,227 RAC: 0	Message 971267 - Posted: 18 Feb 2010, 18:04:58 UTC Seti lives on a very short shoestring. They do what they can with the funds at their disposal. When things are going as planned it's fine but there's no headroom to deal with the massive loads that hit those same servers after an outage. Since there's no funds to do a massive upgrade of the server farm to deal with these rare events, they have done the only thing they can which is to program in a safety net in the client which is the backoff logic. That logic is actually a very reasonable way to give the servers a chance to dig their way out of a bad situation. It's really basic queueing theory. You have a limited resource and in some cases you just can't service everyone at the same time so you create a queue to keep things organized. Nobody likes being in the queue but the alternative is much uglier. In the long run it's the only way to be fair and allow the machinery to work in an efficient manner. The backoff is a way of pushing the queues out into the field so the servers don't have to waste precious resources managing all those requests themselves. If we allow the process to do what it's supposed to do, everything will catch up eventually. ID: 971267 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.