Panic Mode On (77) Server Problems?

Message boards : Number crunching : Panic Mode On (77) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 22 · Next

AuthorMessage
tbret
Volunteer tester
Avatar

Send message
Joined: 28 May 99
Posts: 3380
Credit: 296,162,071
RAC: 40
United States
Message 1287503 - Posted: 24 Sep 2012, 21:06:10 UTC

It's after noon in Berkeley and I haven't seen any indication that the fellas in the lab have changed anything.

I assume they know there is a problem with Uploads.

I've got computers showing "project backoff" times of up-to almost six hours and similarly sad "retry" times.

The repeating wave on the incoming side of the cricket graph makes me think there is "something" that doesn't like "something" and whatever is broken is sneezing at semi-regular intervals; not merely that the communications are simply slow because of an over-abundance of data. (shortie storm, AP distribution, the usual suspects)

The "sawtooth" shape of the thing (without correlation with the "outflow") makes me wonder. I hope it's making someone else wonder, too.

I don't care, of course, until there are so many un-uploaded (and therefore unreported) work units in all of our queues that all of the Clients need manual intervention to be able to "Update." We (who read here) will be fine, but I'd wager a fair number of crunchers aren't going to be willing to babysit the process for hours (retry now / update, retry now / update) to prevent the "report" queue and client_state files from becoming unmanageable for the Client.

I've noticed that several of mine have refused to auto "Update" while there are hundreds of Uploads stuck. Me? I keep-up with it and try to mitigate the damage. Thank goodness I had a situation last night where I could babysit one of my rigs every few minutes for about 3 hours "retrying" then "updating" of something like 1,500 tasks (and that isn't even a particularly fast machine). I suspect that there are a lot of people who won't do that.

It really isn't important whether the work gets reported now or tomorrow or a week from next Thursday, but it would be a crying shame to have all that crunching eventually time-out and go to waste because it has "broken" the Client's ability to deal with the backlog.
ID: 1287503 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9954
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1287506 - Posted: 24 Sep 2012, 21:16:20 UTC - in response to Message 1287503.  
Last modified: 24 Sep 2012, 21:21:30 UTC

It's after noon in Berkeley and I haven't seen any indication that the fellas in the lab have changed anything.

I assume they know there is a problem with Uploads.

I've got computers showing "project backoff" times of up-to almost six hours and similarly sad "retry" times.

The repeating wave on the incoming side of the cricket graph makes me think there is "something" that doesn't like "something" and whatever is broken is sneezing at semi-regular intervals; not merely that the communications are simply slow because of an over-abundance of data. (shortie storm, AP distribution, the usual suspects)

The "sawtooth" shape of the thing (without correlation with the "outflow") makes me wonder. I hope it's making someone else wonder, too.

I don't care, of course, until there are so many un-uploaded (and therefore unreported) work units in all of our queues that all of the Clients need manual intervention to be able to "Update." We (who read here) will be fine, but I'd wager a fair number of crunchers aren't going to be willing to babysit the process for hours (retry now / update, retry now / update) to prevent the "report" queue and client_state files from becoming unmanageable for the Client.

I've noticed that several of mine have refused to auto "Update" while there are hundreds of Uploads stuck. Me? I keep-up with it and try to mitigate the damage. Thank goodness I had a situation last night where I could babysit one of my rigs every few minutes for about 3 hours "retrying" then "updating" of something like 1,500 tasks (and that isn't even a particularly fast machine). I suspect that there are a lot of people who won't do that.

It really isn't important whether the work gets reported now or tomorrow or a week from next Thursday, but it would be a crying shame to have all that crunching eventually time-out and go to waste because it has "broken" the Client's ability to deal with the backlog.

Agree 100% I have had to set all my crunchers to NNT because unless I sit here 24/7 my fastest machine is completing WU's faster than they can be reported and the queue continues to grow. Imagine what it must be like on some of the top 10 machines.

I have just done a "ping" check to all the machines on the 208.xxx.xxx subnet I can see, and all of them show packet loss at a level my old company could not have worked at!!

PS I wouldn't "assume" that the lab knows anything is wrong, as we know staffing levels are low, so they may actually be no-one there.
ID: 1287506 · Report as offensive
Profile Waldo
Volunteer tester
Avatar

Send message
Joined: 19 May 12
Posts: 174
Credit: 317,086,018
RAC: 0
United States
Message 1287508 - Posted: 24 Sep 2012, 21:26:42 UTC - in response to Message 1287500.  
Last modified: 24 Sep 2012, 21:30:42 UTC

I can track batches of shorties that my 24 core box is processing via its UPS load stats. It's only about a 2% difference drop in power when running shorties, but at that machines power level it is apparent.
Snapshot of the past few weeks.
http://www.hal6000.com/seti/images/ups_load-month.png

What software are you using to monitor the UPS?
I bought a couple of cheap 1400VA UPSs & the software that came with them was crap.

Hal would have to answer but it looks like he is using rrdtool with snmp or something similar that utilizes rrd..
This is just a guess and I might be wrong.. Some of the newer web enabled ups cards do trending by default.

You might want to check out cacti it is one of the pieces of software I use.

PS: It is a endless cycle when wu finish faster than they can be uploaded :(
I am sure someone is looking at it..
ID: 1287508 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 1287542 - Posted: 24 Sep 2012, 23:55:09 UTC - in response to Message 1287508.  

You might want to check out cacti it is one of the pieces of software I use.

Thanks.
When i get a chance i'll see if it recognises my UPS, and can make sense of it's output.
Grant
Darwin NT
ID: 1287542 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1287553 - Posted: 25 Sep 2012, 0:34:28 UTC - in response to Message 1287542.  

Well I hope that the guys work out what they didn't get right during the last outage during tomorrow's outage as this past week has been very annoying.

Cheers.
ID: 1287553 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9954
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1287622 - Posted: 25 Sep 2012, 6:33:51 UTC - in response to Message 1287553.  

Well I hope that the guys work out what they didn't get right during the last outage during tomorrow's outage as this past week has been very annoying.

Cheers.

Certainly haven't had a problem "all week" just since late Saturday UTC. As there was no fix, or message or announcement on the front page. It seems likely they still are unaware of the problem.

If it is "fixed" fixed during the outage it is going to take a while to settle down!!
ID: 1287622 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1287720 - Posted: 25 Sep 2012, 13:41:04 UTC - in response to Message 1287508.  

I can track batches of shorties that my 24 core box is processing via its UPS load stats. It's only about a 2% difference drop in power when running shorties, but at that machines power level it is apparent.
Snapshot of the past few weeks.
http://www.hal6000.com/seti/images/ups_load-month.png

What software are you using to monitor the UPS?
I bought a couple of cheap 1400VA UPSs & the software that came with them was crap.

Hal would have to answer but it looks like he is using rrdtool with snmp or something similar that utilizes rrd..
This is just a guess and I might be wrong.. Some of the newer web enabled ups cards do trending by default.

You might want to check out cacti it is one of the pieces of software I use.

PS: It is a endless cycle when wu finish faster than they can be uploaded :(
I am sure someone is looking at it..

I use MRTG instead of RRDtool. The cricket graphs we like to fret over so much are generated using software based on RRDtool.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1287720 · Report as offensive
Vepide
Volunteer tester

Send message
Joined: 2 Feb 08
Posts: 6
Credit: 103,183
RAC: 0
United States
Message 1287763 - Posted: 25 Sep 2012, 16:17:15 UTC - in response to Message 1286802.  

I have two WU' downloading and it seems they've been downloading for days...when I hit retry it gets a few bytes then just craps out...not
a problem with any other projects, just SETI. If they're using some kind of bandwidth throttling they should turn it off, or check to see if they have a failed NIC or something else in the pipeline.
ID: 1287763 · Report as offensive
Keith White
Avatar

Send message
Joined: 29 May 99
Posts: 392
Credit: 13,035,233
RAC: 22
United States
Message 1287768 - Posted: 25 Sep 2012, 21:18:27 UTC
Last modified: 25 Sep 2012, 21:23:01 UTC

Well the servers are up, it took all my results and reported them successfully but the cricket graphs are still dead. Did you upgrade their backbone so our stuff isn't being piped through that link? I thought the new gigabit switch was only internal.

Edit: Oops, my bad, seems like I just happen to turn on internet access just at the point the servers came up. Looks like the cricket graph is now starting to go up.
"Life is just nature's way of keeping meat fresh." - The Doctor
ID: 1287768 · Report as offensive
Profile Akio
Avatar

Send message
Joined: 18 May 11
Posts: 375
Credit: 32,129,242
RAC: 0
United States
Message 1287801 - Posted: 25 Sep 2012, 22:52:57 UTC

No luck getting much to download here - says project has no tasks available. Must have to just wait it out for a bit.
ID: 1287801 · Report as offensive
fscheel

Send message
Joined: 13 Apr 12
Posts: 73
Credit: 11,135,641
RAC: 0
United States
Message 1287805 - Posted: 25 Sep 2012, 23:00:34 UTC

Not downloading here either. :(
ID: 1287805 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9954
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1287811 - Posted: 25 Sep 2012, 23:04:44 UTC

My main machine ran out of work due to the "upload" problem, now I have had 6 tasks, all shorties, all done.
Oh well see how things are shaping up tomorrow.
ID: 1287811 · Report as offensive
Starman
Avatar

Send message
Joined: 15 May 99
Posts: 204
Credit: 81,351,915
RAC: 25
Canada
Message 1287826 - Posted: 25 Sep 2012, 23:50:28 UTC

Well overall I'm mostly good for CPU work. One machine has lots to keep it going for close to a week. Unfortunately, by brand new ATI 7870 GPU has nothing to impress me with, my ATI 4870 GPU will be out of work before I go to bed tonight, and my ATI 6300 (Turks) is also twiddling its thumbs. Ohh Well.
ID: 1287826 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 1287836 - Posted: 26 Sep 2012, 0:14:52 UTC - in response to Message 1287801.  

No luck getting much to download here - says project has no tasks available.

Same here.
And i notice the network traffic isn't maxed out, so something's gumming up the works somewhere.

Grant
Darwin NT
ID: 1287836 · Report as offensive
Starman
Avatar

Send message
Joined: 15 May 99
Posts: 204
Credit: 81,351,915
RAC: 25
Canada
Message 1287839 - Posted: 26 Sep 2012, 0:30:48 UTC - in response to Message 1287826.  

Well, just got 4 CPU WU, but the download was no better than the last 4-5 days.
ID: 1287839 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1287844 - Posted: 26 Sep 2012, 0:45:42 UTC - in response to Message 1287839.  
Last modified: 26 Sep 2012, 0:49:28 UTC

It seems that the guys got onto what ever it was that was causing problems since the last outage as all but my lowly P.O.S. Vista rig are back up to full (it surprised me though that that P.O.S. was able to report well over 400 in its first 2 hits at the servers once they came back up).

Now if the guys can just restrain themselves to just 1 new tape at a time until the AP's are done things will go well.

Cheers.
ID: 1287844 · Report as offensive
Starman
Avatar

Send message
Joined: 15 May 99
Posts: 204
Credit: 81,351,915
RAC: 25
Canada
Message 1287845 - Posted: 26 Sep 2012, 0:53:23 UTC

What's interesting is all the WU's i've been getting the last 4-5 days are all due on or around Oct. 5th, and all run high priority.
ID: 1287845 · Report as offensive
Profile S@NL Etienne Dokkum
Volunteer tester
Avatar

Send message
Joined: 11 Jun 99
Posts: 212
Credit: 43,822,095
RAC: 0
Netherlands
Message 1287889 - Posted: 26 Sep 2012, 5:08:44 UTC - in response to Message 1287811.  

My main machine ran out of work due to the "upload" problem, now I have had 6 tasks, all shorties, all done.
Oh well see how things are shaping up tomorrow.


Nothing better, spewing out shorties like a mad man...
ID: 1287889 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9954
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1287903 - Posted: 26 Sep 2012, 6:33:09 UTC
Last modified: 26 Sep 2012, 6:33:20 UTC

OK so I believe that there is something WRONG, the graph has not been maxed out since it came back yet this morning all my machines had tasks, all backed off for 5-7 hours, and even with retry they my download 1 or 2 but go straight into backoff again why??

If network is not maxed why can't I connect?
ID: 1287903 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1287910 - Posted: 26 Sep 2012, 7:13:54 UTC - in response to Message 1287903.  

OK so I believe that there is something WRONG, the graph has not been maxed out since it came back yet this morning all my machines had tasks, all backed off for 5-7 hours, and even with retry they my download 1 or 2 but go straight into backoff again why??

If network is not maxed why can't I connect?

Without AP's in the system the cricket graph is sitting about where you would normally expect it to but as to you can't get a look in is probably me & others sucking up whatever is available.

Cheers.
ID: 1287910 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (77) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.