Panic Mode On (28) Server problems

Message boards : Number crunching : Panic Mode On (28) Server problems
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 · Next

AuthorMessage
Matthew S. McCleary
Avatar

Send message
Joined: 9 Sep 99
Posts: 121
Credit: 2,288,242
RAC: 0
United States
Message 971198 - Posted: 18 Feb 2010, 15:03:04 UTC

It's situations such as this -- regardless of what the actual cause is -- that chase people away from crunching for SETI@home. Simply acknowledging that a problem exists and a solution is being looked for, whether the problem is Berkeley's or elsewhere, goes a long way towards calming everyone's nerves. We're not getting that, though, obviously.
ID: 971198 · Report as offensive
Profile rebest Project Donor
Volunteer tester
Avatar

Send message
Joined: 16 Apr 00
Posts: 1296
Credit: 45,357,093
RAC: 0
United States
Message 971200 - Posted: 18 Feb 2010, 15:03:54 UTC

With all due respect to Ned and Pappa, the Cricket Graphs don't lie. There has been a steady, overall reduction in throughput going back a week; well before the cooling went out in the closet. There are occasional upward spikes, to be sure, but the trend is obvious.

Two weeks ago, everything was chugging along just fine and this thread was practically dormant. We understand about weekly outages and emergencies like the A/C. But something else is clearly not right.

????






Join the PACK!
ID: 971200 · Report as offensive
Roundel

Send message
Joined: 1 Feb 06
Posts: 21
Credit: 6,850,211
RAC: 0
United States
Message 971201 - Posted: 18 Feb 2010, 15:04:49 UTC
Last modified: 18 Feb 2010, 15:06:28 UTC

Not sure if others have gone through since that 1 went through last night. But I'm now dry on a few machines and almost dry overall across the fleet Cant upload and now getting the errors that there are no jobs available on a dry machine.
Oh well, all the hardware can take a much needed rest.
ID: 971201 · Report as offensive
Roundel

Send message
Joined: 1 Feb 06
Posts: 21
Credit: 6,850,211
RAC: 0
United States
Message 971204 - Posted: 18 Feb 2010, 15:09:57 UTC

Well thats interesting, especially if you look at the monthly range. I hadn't noticed any connectivity problems until this whole situation arose at the beginning of the week. I wonder if a router or switch has been dying a slow death and finally gave up the ghost.
ID: 971204 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 971217 - Posted: 18 Feb 2010, 15:46:44 UTC - in response to Message 971200.  

Yes, over the month there is an obvious trend, but look at the yearly chart; the recent performance is in the noise! (But don't tell Matt or he may defer fixing the problem to work on other issues.)
ID: 971217 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 971218 - Posted: 18 Feb 2010, 15:47:37 UTC - in response to Message 971162.  

Monitoring my upload process, I see a very few making it through at present. What is frustrating is that I see a lot that get as far as 100% uploaded to be subsequently rejected and queued up to try again. The last bit of handshaking fails and causes the system to repeat work (upload) that appears to have been completed. This is not an new observation.

Because it obviously takes bandwidth and server resources to execute this type of failure, and because the behavior has been around 'forever', has any effort been made to remedy it?
ID: 971218 · Report as offensive
Dorphas
Avatar

Send message
Joined: 16 May 99
Posts: 118
Credit: 8,007,247
RAC: 0
United States
Message 971224 - Posted: 18 Feb 2010, 16:00:03 UTC
Last modified: 18 Feb 2010, 16:00:32 UTC

don't know what this may mean in the bigger picture, but i just had one machine upload about 50 workunits....but i can't get them to report at all.
ID: 971224 · Report as offensive
Highlander
Avatar

Send message
Joined: 5 Oct 99
Posts: 167
Credit: 37,987,668
RAC: 16
Germany
Message 971226 - Posted: 18 Feb 2010, 16:01:42 UTC

My Rumor:

I think, the last great power outage in the Bay-Area had demaged the ISP-Hardware, and ISP had setup a 10 mbit-link for emergency use.

But this is really only my thought about the situation.

And whatever it really is, hope all can be fixed in near future (many UL waiting on my side ^^).


- Performance is not a simple linear function of the number of CPUs you throw at the problem. -
ID: 971226 · Report as offensive
Profile hiamps
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 4292
Credit: 72,971,319
RAC: 0
United States
Message 971236 - Posted: 18 Feb 2010, 16:25:45 UTC

The only way I can get any to upload is keep pressing buttons...This project backoff is for the birds, I would rather see them fix the problem than cripple the client. Some get thru but then the project wants to backoff for 2 hours like that is going to do anything but delay the problem.
Official Abuser of Boinc Buttons...
And no good credit hound!
ID: 971236 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 971243 - Posted: 18 Feb 2010, 16:47:15 UTC - in response to Message 971236.  

The only way I can get any to upload is keep pressing buttons...This project backoff is for the birds, I would rather see them fix the problem than cripple the client. Some get thru but then the project wants to backoff for 2 hours like that is going to do anything but delay the problem.

I did a bit of button-pushing this morning, and got one machine down to one upload pending (it only had about a dozen in total, so I wasn't adding much to the load!).

Nothing on the reporting front, until it tried again of its own accord while I was on the phone at 15:24.

SETI@home	18/02/2010 15:24:41	Requesting 718981 seconds of new work, and reporting 10 completed tasks
SETI@home	18/02/2010 15:24:56	Scheduler RPC succeeded [server version 611]
SETI@home	18/02/2010 15:24:56	Message from server: (Project has no jobs available)

Says it all, really.
ID: 971243 · Report as offensive
Dave

Send message
Joined: 29 Mar 02
Posts: 778
Credit: 25,001,396
RAC: 0
United Kingdom
Message 971247 - Posted: 18 Feb 2010, 16:54:28 UTC

I know it makes us feel good - + I'm the same - but all this manual button-pushing remember does actually make things worse because it's putting more load on the server. The backoffs, though annoying, are there to spread the load throughout the thousands of clients out there.
ID: 971247 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 971248 - Posted: 18 Feb 2010, 16:55:50 UTC - in response to Message 971217.  

Yes, over the month there is an obvious trend, but look at the yearly chart; the recent performance is in the noise! (But don't tell Matt or he may defer fixing the problem to work on other issues.)

Up to and including Week 3 on the monthly chart, they were only splitting tapes for MultiBeam work - they were wrestling with major Astropulse database problems.

Astropulse splitting restarted during Week 4, and accounts for the higher average throughput since then (there hasn't been a regular supply of AP work since last May, and AP-crunchers' caches are drier than Death Valley). Every AP unit split gets gobbled up instantly. It's gone quiet again on the AP front now, because all loaded tapes have been split.

Other peaks and troughs relate to the variety in Angle Range for the MB work recently: if a recording was made during a high AR sky survey, the resulting WUs are processed (and hence downloaded) at four(-ish) times the rate of other ARs.

And the flatline since Monday is another story entirely.....
ID: 971248 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 971249 - Posted: 18 Feb 2010, 16:57:37 UTC - in response to Message 971247.  

I know it makes us feel good - + I'm the same - but all this manual button-pushing remember does actually make things worse because it's putting more load on the server. The backoffs, though annoying, are there to spread the load throughout the thousands of clients out there.

I haven't touched the buttons on the machines with 74 - 100 - 138 pending transfers, honest!
ID: 971249 · Report as offensive
Iona
Avatar

Send message
Joined: 12 Jul 07
Posts: 790
Credit: 22,438,118
RAC: 0
United Kingdom
Message 971250 - Posted: 18 Feb 2010, 17:01:44 UTC

I'm getting the same problems as everyone else.....one WU has been stuck at uploading for almost 3 days! If anything does manage to upload, then almost invariably, it does not get reported and if it does, I break out some Bollinger! Without a doubt, something is amiss with the comms....would some long lengths of string and a few tins be any better?



Don't take life too seriously, as you'll never come out of it alive!
ID: 971250 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 971254 - Posted: 18 Feb 2010, 17:22:43 UTC - in response to Message 971198.  

It's situations such as this -- regardless of what the actual cause is -- that chase people away from crunching for SETI@home. Simply acknowledging that a problem exists and a solution is being looked for, whether the problem is Berkeley's or elsewhere, goes a long way towards calming everyone's nerves. We're not getting that, though, obviously.

As I said this morning, I honestly believe that by the time they left the lab yesterday evening, the staff weren't aware that there was a communications problem. And remember that by "the staff", we are talking about a tiny number of heavily-multitasking individuals - of the eight people on the project page, two have left, one is still writing up his PhD thesis, and only four have any operational responsibility at all.

Remember the timeline for this outage:

Started around 9am Monday - a National Public Holiday, when I doubt any of them had more than a cursory eye on the lab.

Tuesday - Matt's first day back after a week's holiday. Catch up, back up, start recovery - then the aircon blows.

Wednesday - get the temperatures under control, then start up the complicated inter-dependent mess of second-hand servers.

In the meantime, as hiamps' and my button-pushing experiments have shown, work is trickling back - slowly, but enough to register on their radar as "it's working" (Matt has said as much after previous semi-outages, like when one of the two download servers went down).

It's at times like this that I - still - really miss having an official, technical, channel for reporting problems direct to the heart of the opps room. These message boards don't meet the need, because there are too many false positives: most of the problems we discuss here relate to our own machines, and very few - two or three a year, at most - relate to Berkeley problems that the staff aren't already fully aware of. Technical News might be a better venue, but all too often - like last night, when it might have made a difference - that degenerates into general off-topic chit-chat too. And, to ride an old hobby-horse of mine - at other projects the Moderator team would step forward to fill the gap. They know my views on that, and I theirs - no need to reiterate.
ID: 971254 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 971256 - Posted: 18 Feb 2010, 17:26:43 UTC - in response to Message 971200.  
Last modified: 18 Feb 2010, 17:27:14 UTC

With all due respect to Ned and Pappa, the Cricket Graphs don't lie. There has been a steady, overall reduction in throughput going back a week; well before the cooling went out in the closet. There are occasional upward spikes, to be sure, but the trend is obvious.

.... and with all due respect, the Cricket graphs do not lie, but what they're saying is not always 100% obvious -- they measure just one parameter.

Very strange things start to happen when you go from about 95% loading past 100% and up into the higher ranges.

Now, alot of the rest is based on my own observations of systems I can look at directly, and parallel behaviour I'm seeing at SETI.

For each TCP connection that is open, there is a control block. The server gets a TCP "SYN" packet, it creates a control block, and returns SYN+ACK.

Once the connection is up, each packet comes in to the server, the server searches through the control blocks for the one matching that packet (same source and destination ip and port), and the control block then matches the packet to the task processing it.

If you have 100 open connections, you have 100 control blocks, 100 threads, and everything goes pretty fast.

If you have 10,000 open connections, searching the control blocks takes 100 times longer, and the operating system is managing 100 times more threads. A lot more goes to overhead.

... and when the server is spending too much time on overhead, it isn't answering new connections properly, or servicing the ones it has, and bandwidth goes DOWN.

Now, I can't see the internal server metrics, but I do know that by design SETI operates at higher than normal loads and is more likely to push in to this strange realm where high loads show as low bandwidth.

It's a bit like a SYN-Flood attack, without the malice.
ID: 971256 · Report as offensive
Profile Bill Walker
Avatar

Send message
Joined: 4 Sep 99
Posts: 3868
Credit: 2,697,267
RAC: 0
Canada
Message 971257 - Posted: 18 Feb 2010, 17:27:18 UTC

According to Cricket, downloads have started again. Expect it will take days to clear the back log though. Thats why we run other projects.

ID: 971257 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 971259 - Posted: 18 Feb 2010, 17:31:46 UTC - in response to Message 971236.  
Last modified: 18 Feb 2010, 17:32:00 UTC

The only way I can get any to upload is keep pressing buttons...This project backoff is for the birds, I would rather see them fix the problem than cripple the client. Some get thru but then the project wants to backoff for 2 hours like that is going to do anything but delay the problem.

The correct fix is to make the backoffs much, much bigger, or to get someone to write a really, really big check every month for a bigger server room, more servers, more electricity, and more A/C.

If the backoffs were dramatically bigger, then the majority of upload attempts that did happen would be successful, and the flow of inbound work would be near the theoretical maximum -- and the overall recovery would be faster.

Backoffs are your friend.
ID: 971259 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65709
Credit: 55,293,173
RAC: 49
United States
Message 971261 - Posted: 18 Feb 2010, 17:35:44 UTC - in response to Message 971186.  

Erhm not sure if anyone noticed the news page;

Projects are down due to a server closet air conditioning failure.
We have to power down most of our computers until this is fixed. 17 Feb 2010 2:36:55 UTC


http://setiathome.berkeley.edu/index.php


Apolagies if this has allready been pointed out but this could be whats going on :)

My friend, if the AC had not been fixed, we would not be talking right now.......the servers would still be down.

There is a comms problem that existed before the AC failure, and still persists.



I see ........ then i stand corrected :)

Still its a nice chance to give the pc a clean :)

I suspect many dust bunnies are meeting their maker about now.

Many have here already when I converted to Water Cooling. :D

Now If only We could upload, Matt where are You?
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 971261 · Report as offensive
Rick
Avatar

Send message
Joined: 3 Dec 99
Posts: 79
Credit: 11,486,227
RAC: 0
United States
Message 971267 - Posted: 18 Feb 2010, 18:04:58 UTC

Seti lives on a very short shoestring. They do what they can with the funds at their disposal. When things are going as planned it's fine but there's no headroom to deal with the massive loads that hit those same servers after an outage. Since there's no funds to do a massive upgrade of the server farm to deal with these rare events, they have done the only thing they can which is to program in a safety net in the client which is the backoff logic. That logic is actually a very reasonable way to give the servers a chance to dig their way out of a bad situation.

It's really basic queueing theory. You have a limited resource and in some cases you just can't service everyone at the same time so you create a queue to keep things organized. Nobody likes being in the queue but the alternative is much uglier. In the long run it's the only way to be fair and allow the machinery to work in an efficient manner. The backoff is a way of pushing the queues out into the field so the servers don't have to waste precious resources managing all those requests themselves. If we allow the process to do what it's supposed to do, everything will catch up eventually.
ID: 971267 · Report as offensive
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 · Next

Message boards : Number crunching : Panic Mode On (28) Server problems


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.