Panic Mode On (77) Server Problems?


log in

Advanced search

Message boards : Number crunching : Panic Mode On (77) Server Problems?

Previous · 1 · 2 · 3 · 4 · 5 . . . 23 · Next
Author Message
Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5818
Credit: 58,941,866
RAC: 48,024
Australia
Message 1287500 - Posted: 24 Sep 2012, 20:56:16 UTC - in response to Message 1287465.

I can track batches of shorties that my 24 core box is processing via its UPS load stats. It's only about a 2% difference drop in power when running shorties, but at that machines power level it is apparent.
Snapshot of the past few weeks.
http://www.hal6000.com/seti/images/ups_load-month.png

What software are you using to monitor the UPS?
I bought a couple of cheap 1400VA UPSs & the software that came with them was crap.
____________
Grant
Darwin NT.

tbretProject donor
Volunteer tester
Avatar
Send message
Joined: 28 May 99
Posts: 2785
Credit: 209,755,833
RAC: 122,468
United States
Message 1287503 - Posted: 24 Sep 2012, 21:06:10 UTC

It's after noon in Berkeley and I haven't seen any indication that the fellas in the lab have changed anything.

I assume they know there is a problem with Uploads.

I've got computers showing "project backoff" times of up-to almost six hours and similarly sad "retry" times.

The repeating wave on the incoming side of the cricket graph makes me think there is "something" that doesn't like "something" and whatever is broken is sneezing at semi-regular intervals; not merely that the communications are simply slow because of an over-abundance of data. (shortie storm, AP distribution, the usual suspects)

The "sawtooth" shape of the thing (without correlation with the "outflow") makes me wonder. I hope it's making someone else wonder, too.

I don't care, of course, until there are so many un-uploaded (and therefore unreported) work units in all of our queues that all of the Clients need manual intervention to be able to "Update." We (who read here) will be fine, but I'd wager a fair number of crunchers aren't going to be willing to babysit the process for hours (retry now / update, retry now / update) to prevent the "report" queue and client_state files from becoming unmanageable for the Client.

I've noticed that several of mine have refused to auto "Update" while there are hundreds of Uploads stuck. Me? I keep-up with it and try to mitigate the damage. Thank goodness I had a situation last night where I could babysit one of my rigs every few minutes for about 3 hours "retrying" then "updating" of something like 1,500 tasks (and that isn't even a particularly fast machine). I suspect that there are a lot of people who won't do that.

It really isn't important whether the work gets reported now or tomorrow or a week from next Thursday, but it would be a crying shame to have all that crunching eventually time-out and go to waste because it has "broken" the Client's ability to deal with the backlog.

Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 26 May 99
Posts: 6972
Credit: 26,378,184
RAC: 33,476
United Kingdom
Message 1287506 - Posted: 24 Sep 2012, 21:16:20 UTC - in response to Message 1287503.
Last modified: 24 Sep 2012, 21:21:30 UTC

It's after noon in Berkeley and I haven't seen any indication that the fellas in the lab have changed anything.

I assume they know there is a problem with Uploads.

I've got computers showing "project backoff" times of up-to almost six hours and similarly sad "retry" times.

The repeating wave on the incoming side of the cricket graph makes me think there is "something" that doesn't like "something" and whatever is broken is sneezing at semi-regular intervals; not merely that the communications are simply slow because of an over-abundance of data. (shortie storm, AP distribution, the usual suspects)

The "sawtooth" shape of the thing (without correlation with the "outflow") makes me wonder. I hope it's making someone else wonder, too.

I don't care, of course, until there are so many un-uploaded (and therefore unreported) work units in all of our queues that all of the Clients need manual intervention to be able to "Update." We (who read here) will be fine, but I'd wager a fair number of crunchers aren't going to be willing to babysit the process for hours (retry now / update, retry now / update) to prevent the "report" queue and client_state files from becoming unmanageable for the Client.

I've noticed that several of mine have refused to auto "Update" while there are hundreds of Uploads stuck. Me? I keep-up with it and try to mitigate the damage. Thank goodness I had a situation last night where I could babysit one of my rigs every few minutes for about 3 hours "retrying" then "updating" of something like 1,500 tasks (and that isn't even a particularly fast machine). I suspect that there are a lot of people who won't do that.

It really isn't important whether the work gets reported now or tomorrow or a week from next Thursday, but it would be a crying shame to have all that crunching eventually time-out and go to waste because it has "broken" the Client's ability to deal with the backlog.

Agree 100% I have had to set all my crunchers to NNT because unless I sit here 24/7 my fastest machine is completing WU's faster than they can be reported and the queue continues to grow. Imagine what it must be like on some of the top 10 machines.

I have just done a "ping" check to all the machines on the 208.xxx.xxx subnet I can see, and all of them show packet loss at a level my old company could not have worked at!!

PS I wouldn't "assume" that the lab knows anything is wrong, as we know staffing levels are low, so they may actually be no-one there.
____________


Today is life, the only life we're sure of. Make the most of today.

Profile Waldo
Volunteer tester
Avatar
Send message
Joined: 19 May 12
Posts: 171
Credit: 222,266,371
RAC: 300,700
United States
Message 1287508 - Posted: 24 Sep 2012, 21:26:42 UTC - in response to Message 1287500.
Last modified: 24 Sep 2012, 21:30:42 UTC

I can track batches of shorties that my 24 core box is processing via its UPS load stats. It's only about a 2% difference drop in power when running shorties, but at that machines power level it is apparent.
Snapshot of the past few weeks.
http://www.hal6000.com/seti/images/ups_load-month.png

What software are you using to monitor the UPS?
I bought a couple of cheap 1400VA UPSs & the software that came with them was crap.

Hal would have to answer but it looks like he is using rrdtool with snmp or something similar that utilizes rrd..
This is just a guess and I might be wrong.. Some of the newer web enabled ups cards do trending by default.

You might want to check out cacti it is one of the pieces of software I use.

PS: It is a endless cycle when wu finish faster than they can be uploaded :(
I am sure someone is looking at it..
____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5818
Credit: 58,941,866
RAC: 48,024
Australia
Message 1287542 - Posted: 24 Sep 2012, 23:55:09 UTC - in response to Message 1287508.

You might want to check out cacti it is one of the pieces of software I use.

Thanks.
When i get a chance i'll see if it recognises my UPS, and can make sense of it's output.
____________
Grant
Darwin NT.

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 6948
Credit: 94,498,941
RAC: 74,847
Australia
Message 1287553 - Posted: 25 Sep 2012, 0:34:28 UTC - in response to Message 1287542.

Well I hope that the guys work out what they didn't get right during the last outage during tomorrow's outage as this past week has been very annoying.

Cheers.
____________

Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 26 May 99
Posts: 6972
Credit: 26,378,184
RAC: 33,476
United Kingdom
Message 1287622 - Posted: 25 Sep 2012, 6:33:51 UTC - in response to Message 1287553.

Well I hope that the guys work out what they didn't get right during the last outage during tomorrow's outage as this past week has been very annoying.

Cheers.

Certainly haven't had a problem "all week" just since late Saturday UTC. As there was no fix, or message or announcement on the front page. It seems likely they still are unaware of the problem.

If it is "fixed" fixed during the outage it is going to take a while to settle down!!
____________


Today is life, the only life we're sure of. Make the most of today.

Profile HAL9000
Volunteer tester
Avatar
Send message
Joined: 11 Sep 99
Posts: 4179
Credit: 114,415,485
RAC: 140,552
United States
Message 1287720 - Posted: 25 Sep 2012, 13:41:04 UTC - in response to Message 1287508.

I can track batches of shorties that my 24 core box is processing via its UPS load stats. It's only about a 2% difference drop in power when running shorties, but at that machines power level it is apparent.
Snapshot of the past few weeks.
http://www.hal6000.com/seti/images/ups_load-month.png

What software are you using to monitor the UPS?
I bought a couple of cheap 1400VA UPSs & the software that came with them was crap.

Hal would have to answer but it looks like he is using rrdtool with snmp or something similar that utilizes rrd..
This is just a guess and I might be wrong.. Some of the newer web enabled ups cards do trending by default.

You might want to check out cacti it is one of the pieces of software I use.

PS: It is a endless cycle when wu finish faster than they can be uploaded :(
I am sure someone is looking at it..

I use MRTG instead of RRDtool. The cricket graphs we like to fret over so much are generated using software based on RRDtool.
____________
SETI@home classic workunits: 93,865 CPU time: 863,447 hours

Join the BP6/VP6 User Group today!

Vepide
Send message
Joined: 2 Feb 08
Posts: 6
Credit: 103,089
RAC: 0
United States
Message 1287763 - Posted: 25 Sep 2012, 16:17:15 UTC - in response to Message 1286802.

I have two WU' downloading and it seems they've been downloading for days...when I hit retry it gets a few bytes then just craps out...not
a problem with any other projects, just SETI. If they're using some kind of bandwidth throttling they should turn it off, or check to see if they have a failed NIC or something else in the pipeline.

Keith White
Avatar
Send message
Joined: 29 May 99
Posts: 370
Credit: 2,815,027
RAC: 2,207
United States
Message 1287768 - Posted: 25 Sep 2012, 21:18:27 UTC
Last modified: 25 Sep 2012, 21:23:01 UTC

Well the servers are up, it took all my results and reported them successfully but the cricket graphs are still dead. Did you upgrade their backbone so our stuff isn't being piped through that link? I thought the new gigabit switch was only internal.

Edit: Oops, my bad, seems like I just happen to turn on internet access just at the point the servers came up. Looks like the cricket graph is now starting to go up.
____________
"Life is just nature's way of keeping meat fresh." - The Doctor

Profile SliverProject donor
Avatar
Send message
Joined: 18 May 11
Posts: 281
Credit: 7,139,754
RAC: 6,279
United States
Message 1287801 - Posted: 25 Sep 2012, 22:52:57 UTC

No luck getting much to download here - says project has no tasks available. Must have to just wait it out for a bit.
____________

fscheel
Send message
Joined: 13 Apr 12
Posts: 73
Credit: 11,135,641
RAC: 0
United States
Message 1287805 - Posted: 25 Sep 2012, 23:00:34 UTC

Not downloading here either. :(

Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 26 May 99
Posts: 6972
Credit: 26,378,184
RAC: 33,476
United Kingdom
Message 1287811 - Posted: 25 Sep 2012, 23:04:44 UTC

My main machine ran out of work due to the "upload" problem, now I have had 6 tasks, all shorties, all done.
Oh well see how things are shaping up tomorrow.
____________


Today is life, the only life we're sure of. Make the most of today.

Starman
Avatar
Send message
Joined: 15 May 99
Posts: 134
Credit: 37,065,024
RAC: 54,856
Canada
Message 1287826 - Posted: 25 Sep 2012, 23:50:28 UTC

Well overall I'm mostly good for CPU work. One machine has lots to keep it going for close to a week. Unfortunately, by brand new ATI 7870 GPU has nothing to impress me with, my ATI 4870 GPU will be out of work before I go to bed tonight, and my ATI 6300 (Turks) is also twiddling its thumbs. Ohh Well.
____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5818
Credit: 58,941,866
RAC: 48,024
Australia
Message 1287836 - Posted: 26 Sep 2012, 0:14:52 UTC - in response to Message 1287801.

No luck getting much to download here - says project has no tasks available.

Same here.
And i notice the network traffic isn't maxed out, so something's gumming up the works somewhere.

____________
Grant
Darwin NT.

Starman
Avatar
Send message
Joined: 15 May 99
Posts: 134
Credit: 37,065,024
RAC: 54,856
Canada
Message 1287839 - Posted: 26 Sep 2012, 0:30:48 UTC - in response to Message 1287826.

Well, just got 4 CPU WU, but the download was no better than the last 4-5 days.
____________

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 6948
Credit: 94,498,941
RAC: 74,847
Australia
Message 1287844 - Posted: 26 Sep 2012, 0:45:42 UTC - in response to Message 1287839.
Last modified: 26 Sep 2012, 0:49:28 UTC

It seems that the guys got onto what ever it was that was causing problems since the last outage as all but my lowly P.O.S. Vista rig are back up to full (it surprised me though that that P.O.S. was able to report well over 400 in its first 2 hits at the servers once they came back up).

Now if the guys can just restrain themselves to just 1 new tape at a time until the AP's are done things will go well.

Cheers.
____________

Starman
Avatar
Send message
Joined: 15 May 99
Posts: 134
Credit: 37,065,024
RAC: 54,856
Canada
Message 1287845 - Posted: 26 Sep 2012, 0:53:23 UTC

What's interesting is all the WU's i've been getting the last 4-5 days are all due on or around Oct. 5th, and all run high priority.
____________

Profile S@NL Etienne Dokkum
Volunteer tester
Avatar
Send message
Joined: 11 Jun 99
Posts: 161
Credit: 16,335,243
RAC: 25,726
Netherlands
Message 1287889 - Posted: 26 Sep 2012, 5:08:44 UTC - in response to Message 1287811.

My main machine ran out of work due to the "upload" problem, now I have had 6 tasks, all shorties, all done.
Oh well see how things are shaping up tomorrow.


Nothing better, spewing out shorties like a mad man...

Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 26 May 99
Posts: 6972
Credit: 26,378,184
RAC: 33,476
United Kingdom
Message 1287903 - Posted: 26 Sep 2012, 6:33:09 UTC
Last modified: 26 Sep 2012, 6:33:20 UTC

OK so I believe that there is something WRONG, the graph has not been maxed out since it came back yet this morning all my machines had tasks, all backed off for 5-7 hours, and even with retry they my download 1 or 2 but go straight into backoff again why??

If network is not maxed why can't I connect?
____________


Today is life, the only life we're sure of. Make the most of today.

Previous · 1 · 2 · 3 · 4 · 5 . . . 23 · Next

Message boards : Number crunching : Panic Mode On (77) Server Problems?

Copyright © 2014 University of California