Panic Mode On (77) Server Problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (77) Server Problems?

Previous · 1 · 2 · 3 · 4 · 5 . . . 23 · Next
Author Message
Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 6391
Credit: 73,908,777
RAC: 49,098
Australia
Message 1287500 - Posted: 24 Sep 2012, 20:56:16 UTC - in response to Message 1287465.

I can track batches of shorties that my 24 core box is processing via its UPS load stats. It's only about a 2% difference drop in power when running shorties, but at that machines power level it is apparent.
Snapshot of the past few weeks.
http://www.hal6000.com/seti/images/ups_load-month.png

What software are you using to monitor the UPS?
I bought a couple of cheap 1400VA UPSs & the software that came with them was crap.
____________
Grant
Darwin NT.

tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 3113
Credit: 232,379,832
RAC: 371,975
United States
Message 1287503 - Posted: 24 Sep 2012, 21:06:10 UTC

It's after noon in Berkeley and I haven't seen any indication that the fellas in the lab have changed anything.

I assume they know there is a problem with Uploads.

I've got computers showing "project backoff" times of up-to almost six hours and similarly sad "retry" times.

The repeating wave on the incoming side of the cricket graph makes me think there is "something" that doesn't like "something" and whatever is broken is sneezing at semi-regular intervals; not merely that the communications are simply slow because of an over-abundance of data. (shortie storm, AP distribution, the usual suspects)

The "sawtooth" shape of the thing (without correlation with the "outflow") makes me wonder. I hope it's making someone else wonder, too.

I don't care, of course, until there are so many un-uploaded (and therefore unreported) work units in all of our queues that all of the Clients need manual intervention to be able to "Update." We (who read here) will be fine, but I'd wager a fair number of crunchers aren't going to be willing to babysit the process for hours (retry now / update, retry now / update) to prevent the "report" queue and client_state files from becoming unmanageable for the Client.

I've noticed that several of mine have refused to auto "Update" while there are hundreds of Uploads stuck. Me? I keep-up with it and try to mitigate the damage. Thank goodness I had a situation last night where I could babysit one of my rigs every few minutes for about 3 hours "retrying" then "updating" of something like 1,500 tasks (and that isn't even a particularly fast machine). I suspect that there are a lot of people who won't do that.

It really isn't important whether the work gets reported now or tomorrow or a week from next Thursday, but it would be a crying shame to have all that crunching eventually time-out and go to waste because it has "broken" the Client's ability to deal with the backlog.

Profile Bernie VineProject donor
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 26 May 99
Posts: 7897
Credit: 35,359,553
RAC: 36,595
United Kingdom
Message 1287506 - Posted: 24 Sep 2012, 21:16:20 UTC - in response to Message 1287503.
Last modified: 24 Sep 2012, 21:21:30 UTC

It's after noon in Berkeley and I haven't seen any indication that the fellas in the lab have changed anything.

I assume they know there is a problem with Uploads.

I've got computers showing "project backoff" times of up-to almost six hours and similarly sad "retry" times.

The repeating wave on the incoming side of the cricket graph makes me think there is "something" that doesn't like "something" and whatever is broken is sneezing at semi-regular intervals; not merely that the communications are simply slow because of an over-abundance of data. (shortie storm, AP distribution, the usual suspects)

The "sawtooth" shape of the thing (without correlation with the "outflow") makes me wonder. I hope it's making someone else wonder, too.

I don't care, of course, until there are so many un-uploaded (and therefore unreported) work units in all of our queues that all of the Clients need manual intervention to be able to "Update." We (who read here) will be fine, but I'd wager a fair number of crunchers aren't going to be willing to babysit the process for hours (retry now / update, retry now / update) to prevent the "report" queue and client_state files from becoming unmanageable for the Client.

I've noticed that several of mine have refused to auto "Update" while there are hundreds of Uploads stuck. Me? I keep-up with it and try to mitigate the damage. Thank goodness I had a situation last night where I could babysit one of my rigs every few minutes for about 3 hours "retrying" then "updating" of something like 1,500 tasks (and that isn't even a particularly fast machine). I suspect that there are a lot of people who won't do that.

It really isn't important whether the work gets reported now or tomorrow or a week from next Thursday, but it would be a crying shame to have all that crunching eventually time-out and go to waste because it has "broken" the Client's ability to deal with the backlog.

Agree 100% I have had to set all my crunchers to NNT because unless I sit here 24/7 my fastest machine is completing WU's faster than they can be reported and the queue continues to grow. Imagine what it must be like on some of the top 10 machines.

I have just done a "ping" check to all the machines on the 208.xxx.xxx subnet I can see, and all of them show packet loss at a level my old company could not have worked at!!

PS I wouldn't "assume" that the lab knows anything is wrong, as we know staffing levels are low, so they may actually be no-one there.
____________

Profile Waldo
Volunteer tester
Avatar
Send message
Joined: 19 May 12
Posts: 174
Credit: 295,374,967
RAC: 173,506
United States
Message 1287508 - Posted: 24 Sep 2012, 21:26:42 UTC - in response to Message 1287500.
Last modified: 24 Sep 2012, 21:30:42 UTC

I can track batches of shorties that my 24 core box is processing via its UPS load stats. It's only about a 2% difference drop in power when running shorties, but at that machines power level it is apparent.
Snapshot of the past few weeks.
http://www.hal6000.com/seti/images/ups_load-month.png

What software are you using to monitor the UPS?
I bought a couple of cheap 1400VA UPSs & the software that came with them was crap.

Hal would have to answer but it looks like he is using rrdtool with snmp or something similar that utilizes rrd..
This is just a guess and I might be wrong.. Some of the newer web enabled ups cards do trending by default.

You might want to check out cacti it is one of the pieces of software I use.

PS: It is a endless cycle when wu finish faster than they can be uploaded :(
I am sure someone is looking at it..
____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 6391
Credit: 73,908,777
RAC: 49,098
Australia
Message 1287542 - Posted: 24 Sep 2012, 23:55:09 UTC - in response to Message 1287508.

You might want to check out cacti it is one of the pieces of software I use.

Thanks.
When i get a chance i'll see if it recognises my UPS, and can make sense of it's output.
____________
Grant
Darwin NT.

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 9449
Credit: 115,790,281
RAC: 72,746
Australia
Message 1287553 - Posted: 25 Sep 2012, 0:34:28 UTC - in response to Message 1287542.

Well I hope that the guys work out what they didn't get right during the last outage during tomorrow's outage as this past week has been very annoying.

Cheers.
____________

Profile Bernie VineProject donor
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 26 May 99
Posts: 7897
Credit: 35,359,553
RAC: 36,595
United Kingdom
Message 1287622 - Posted: 25 Sep 2012, 6:33:51 UTC - in response to Message 1287553.

Well I hope that the guys work out what they didn't get right during the last outage during tomorrow's outage as this past week has been very annoying.

Cheers.

Certainly haven't had a problem "all week" just since late Saturday UTC. As there was no fix, or message or announcement on the front page. It seems likely they still are unaware of the problem.

If it is "fixed" fixed during the outage it is going to take a while to settle down!!
____________

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 5377
Credit: 147,536,379
RAC: 136,212
United States
Message 1287720 - Posted: 25 Sep 2012, 13:41:04 UTC - in response to Message 1287508.

I can track batches of shorties that my 24 core box is processing via its UPS load stats. It's only about a 2% difference drop in power when running shorties, but at that machines power level it is apparent.
Snapshot of the past few weeks.
http://www.hal6000.com/seti/images/ups_load-month.png

What software are you using to monitor the UPS?
I bought a couple of cheap 1400VA UPSs & the software that came with them was crap.

Hal would have to answer but it looks like he is using rrdtool with snmp or something similar that utilizes rrd..
This is just a guess and I might be wrong.. Some of the newer web enabled ups cards do trending by default.

You might want to check out cacti it is one of the pieces of software I use.

PS: It is a endless cycle when wu finish faster than they can be uploaded :(
I am sure someone is looking at it..

I use MRTG instead of RRDtool. The cricket graphs we like to fret over so much are generated using software based on RRDtool.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Vepide
Volunteer tester
Send message
Joined: 2 Feb 08
Posts: 6
Credit: 103,183
RAC: 0
United States
Message 1287763 - Posted: 25 Sep 2012, 16:17:15 UTC - in response to Message 1286802.

I have two WU' downloading and it seems they've been downloading for days...when I hit retry it gets a few bytes then just craps out...not
a problem with any other projects, just SETI. If they're using some kind of bandwidth throttling they should turn it off, or check to see if they have a failed NIC or something else in the pipeline.

Keith White
Avatar
Send message
Joined: 29 May 99
Posts: 383
Credit: 3,646,571
RAC: 2,381
United States
Message 1287768 - Posted: 25 Sep 2012, 21:18:27 UTC
Last modified: 25 Sep 2012, 21:23:01 UTC

Well the servers are up, it took all my results and reported them successfully but the cricket graphs are still dead. Did you upgrade their backbone so our stuff isn't being piped through that link? I thought the new gigabit switch was only internal.

Edit: Oops, my bad, seems like I just happen to turn on internet access just at the point the servers came up. Looks like the cricket graph is now starting to go up.
____________
"Life is just nature's way of keeping meat fresh." - The Doctor

Profile AkioProject donor
Avatar
Send message
Joined: 18 May 11
Posts: 336
Credit: 12,043,437
RAC: 39,235
United States
Message 1287801 - Posted: 25 Sep 2012, 22:52:57 UTC

No luck getting much to download here - says project has no tasks available. Must have to just wait it out for a bit.
____________

fscheel
Send message
Joined: 13 Apr 12
Posts: 73
Credit: 11,135,641
RAC: 0
United States
Message 1287805 - Posted: 25 Sep 2012, 23:00:34 UTC

Not downloading here either. :(

Profile Bernie VineProject donor
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 26 May 99
Posts: 7897
Credit: 35,359,553
RAC: 36,595
United Kingdom
Message 1287811 - Posted: 25 Sep 2012, 23:04:44 UTC

My main machine ran out of work due to the "upload" problem, now I have had 6 tasks, all shorties, all done.
Oh well see how things are shaping up tomorrow.
____________

Starman
Avatar
Send message
Joined: 15 May 99
Posts: 160
Credit: 48,463,638
RAC: 67,716
Canada
Message 1287826 - Posted: 25 Sep 2012, 23:50:28 UTC

Well overall I'm mostly good for CPU work. One machine has lots to keep it going for close to a week. Unfortunately, by brand new ATI 7870 GPU has nothing to impress me with, my ATI 4870 GPU will be out of work before I go to bed tonight, and my ATI 6300 (Turks) is also twiddling its thumbs. Ohh Well.
____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 6391
Credit: 73,908,777
RAC: 49,098
Australia
Message 1287836 - Posted: 26 Sep 2012, 0:14:52 UTC - in response to Message 1287801.

No luck getting much to download here - says project has no tasks available.

Same here.
And i notice the network traffic isn't maxed out, so something's gumming up the works somewhere.

____________
Grant
Darwin NT.

Starman
Avatar
Send message
Joined: 15 May 99
Posts: 160
Credit: 48,463,638
RAC: 67,716
Canada
Message 1287839 - Posted: 26 Sep 2012, 0:30:48 UTC - in response to Message 1287826.

Well, just got 4 CPU WU, but the download was no better than the last 4-5 days.
____________

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 9449
Credit: 115,790,281
RAC: 72,746
Australia
Message 1287844 - Posted: 26 Sep 2012, 0:45:42 UTC - in response to Message 1287839.
Last modified: 26 Sep 2012, 0:49:28 UTC

It seems that the guys got onto what ever it was that was causing problems since the last outage as all but my lowly P.O.S. Vista rig are back up to full (it surprised me though that that P.O.S. was able to report well over 400 in its first 2 hits at the servers once they came back up).

Now if the guys can just restrain themselves to just 1 new tape at a time until the AP's are done things will go well.

Cheers.
____________

Starman
Avatar
Send message
Joined: 15 May 99
Posts: 160
Credit: 48,463,638
RAC: 67,716
Canada
Message 1287845 - Posted: 26 Sep 2012, 0:53:23 UTC

What's interesting is all the WU's i've been getting the last 4-5 days are all due on or around Oct. 5th, and all run high priority.
____________

Profile S@NL Etienne Dokkum
Volunteer tester
Avatar
Send message
Joined: 11 Jun 99
Posts: 206
Credit: 21,677,765
RAC: 17,491
Netherlands
Message 1287889 - Posted: 26 Sep 2012, 5:08:44 UTC - in response to Message 1287811.

My main machine ran out of work due to the "upload" problem, now I have had 6 tasks, all shorties, all done.
Oh well see how things are shaping up tomorrow.


Nothing better, spewing out shorties like a mad man...

Profile Bernie VineProject donor
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 26 May 99
Posts: 7897
Credit: 35,359,553
RAC: 36,595
United Kingdom
Message 1287903 - Posted: 26 Sep 2012, 6:33:09 UTC
Last modified: 26 Sep 2012, 6:33:20 UTC

OK so I believe that there is something WRONG, the graph has not been maxed out since it came back yet this morning all my machines had tasks, all backed off for 5-7 hours, and even with retry they my download 1 or 2 but go straight into backoff again why??

If network is not maxed why can't I connect?
____________

Previous · 1 · 2 · 3 · 4 · 5 . . . 23 · Next

Message boards : Number crunching : Panic Mode On (77) Server Problems?

Copyright © 2015 University of California