Panic Mode On (79) Server Problems?

Message boards : Number crunching : Panic Mode On (79) Server Problems?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 22 · Next

AuthorMessage
MikeN

Send message
Joined: 24 Jan 11
Posts: 319
Credit: 64,719,409
RAC: 85
United Kingdom
Message 1309911 - Posted: 24 Nov 2012, 20:36:22 UTC - in response to Message 1309909.  


Overnight, a few hours ago, i managed to report & download some work.
About 50 minutes ago i got a Scheduler timout again, then failure when receiving data from the peer, now it's nothing but couldn't connect to server, repated over & over again.
Looks like still more work is required.


Cricket graphs show everything maxed out so I suspect it is just a feeding frenzy after such a long outage. Everyone trying to refill their WU allowance at the same time. I only know of two solutions:

1. Serious update button abuse
2. patience, try again in a few hours when things have settled down a little.
ID: 1309911 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1309918 - Posted: 24 Nov 2012, 20:51:18 UTC - in response to Message 1309911.  
Last modified: 24 Nov 2012, 20:56:31 UTC

Cricket graphs show everything maxed out so I suspect it is just a feeding frenzy after such a long outage.

Nope, that results in different errors such as the well known time out.
And if they kept the routing through the campus network it would mean it's not affected by the upload & download traffic.


EDIT- just pinged the Scheduler, looks like it's back off the campus network.
However packet loss varies between 0 & 25%, previously it was around 75% and even then it was possible to contact the Schelduer. You'd rarely get a reply, but you could contact it.
Grant
Darwin NT
ID: 1309918 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1309939 - Posted: 24 Nov 2012, 22:04:11 UTC - in response to Message 1309918.  

Cricket graphs show everything maxed out so I suspect it is just a feeding frenzy after such a long outage.

Nope, that results in different errors such as the well known time out.
And if they kept the routing through the campus network it would mean it's not affected by the upload & download traffic.


EDIT- just pinged the Scheduler, looks like it's back off the campus network.
However packet loss varies between 0 & 25%, previously it was around 75% and even then it was possible to contact the Schelduer. You'd rarely get a reply, but you could contact it.



Hmm, it could be system load related- one system has managed to contact the Scheduler a couple of times- but most times are still Couldn't connect to server errors , the other system is completely unable to.
But if that's the case it means they've changed some settings somewhwere. In the past, no matter how bad the load you could generally contact the Scheduler. It's just been this last month & a bit where we started getting the timeouts.
As it is, inbound traffic is only around 10Mb/s, usually its around 14, closer to 20 after an outage.
And i suspect the low inbound traffic is due to the inability to contact the Scheduler.
Grant
Darwin NT
ID: 1309939 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1309944 - Posted: 24 Nov 2012, 22:10:09 UTC - in response to Message 1309939.  
Last modified: 24 Nov 2012, 22:10:39 UTC

Cricket graphs show everything maxed out so I suspect it is just a feeding frenzy after such a long outage.

Nope, that results in different errors such as the well known time out.
And if they kept the routing through the campus network it would mean it's not affected by the upload & download traffic.


EDIT- just pinged the Scheduler, looks like it's back off the campus network.
However packet loss varies between 0 & 25%, previously it was around 75% and even then it was possible to contact the Schelduer. You'd rarely get a reply, but you could contact it.



Hmm, it could be system load related- one system has managed to contact the Scheduler a couple of times- but most times are still Couldn't connect to server errors , the other system is completely unable to.
But if that's the case it means they've changed some settings somewhwere. In the past, no matter how bad the load you could generally contact the Scheduler. It's just been this last month & a bit where we started getting the timeouts.
As it is, inbound traffic is only around 10Mb/s, usually its around 14, closer to 20 after an outage.
And i suspect the low inbound traffic is due to the inability to contact the Scheduler.

Inbound traffic is about average....86K MB results received in the last hour.
The rigs have been getting a tad bit of work here and there. Mostly CPU tasks, so the GPUs have been mostly falling back to Einstein.

Oh well, the kitties will take whatever they can claw out of the servers for now and happily crunch it.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1309944 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1309952 - Posted: 24 Nov 2012, 22:32:20 UTC - in response to Message 1309944.  
Last modified: 24 Nov 2012, 22:34:16 UTC

Inbound traffic is about average....86K MB results received in the last hour.

The number of results per hour may be, but the number of bytes per second is way down. And since the number of results being returned is about average, that means it must be the traffic to the Scheduler that's dropped off significantly as it's no longer using the campus network.
Grant
Darwin NT
ID: 1309952 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1309959 - Posted: 24 Nov 2012, 22:43:16 UTC - in response to Message 1309952.  

Inbound traffic is about average....86K MB results received in the last hour.

The number of results per hour may be, but the number of bytes per second is way down. And since the number of results being returned is about average, that means it must be the traffic to the Scheduler that's dropped off significantly as it's no longer using the campus network.

Other way round. At the moment, it's been switched back, so that the scheduler traffic is attempting to use the Hurricane Electric link we're used to seeing on the gigabitethernet2_3 Cricket graph.

That's the same setting that we had at the beginning of this week, and the beginning of the month. The difference seems to be that for the last three weeks, our requests reached the scheduler, but the replies got lost: now, the difficulty seems to be connecting to the scheduler in the first place, so that the full request message doesn't get sent.
ID: 1309959 · Report as offensive
Profile Bill G Special Project $75 donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 1282
Credit: 187,688,550
RAC: 182
United States
Message 1309960 - Posted: 24 Nov 2012, 22:44:36 UTC - in response to Message 1309952.  

One of my computer had run dry....with a 5 day backoff so I did an update...got 20 lost tasks then while they were downloading got another 121 tasks.

Now here is the interesting thing, normally you get a max of two downloads at a time....for awhile there I was getting 3 at the same time, of course one was an AP, so it would appear that things are changing for the AP crunching.

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1309960 · Report as offensive
Keith White
Avatar

Send message
Joined: 29 May 99
Posts: 392
Credit: 13,035,233
RAC: 22
United States
Message 1309970 - Posted: 24 Nov 2012, 23:31:03 UTC

It's been 7 hours since recovery and I'm still getting

11/24/2012 6:22:06 PM | | Project communication failed: attempting access to reference site
11/24/2012 6:22:06 PM | SETI@home | Scheduler request failed: Failure when receiving data from the peer
11/24/2012 6:22:10 PM | | Internet access OK - project servers may be temporarily down.

At least I got through once, okay twice to report the 100+ done workunits but that was hours ago.
"Life is just nature's way of keeping meat fresh." - The Doctor
ID: 1309970 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1309972 - Posted: 24 Nov 2012, 23:38:06 UTC - in response to Message 1308515.  
Last modified: 24 Nov 2012, 23:53:01 UTC

Quick script to provide the percentage of failures on your machine.

Thanks for that.

Scheduler Requests: 821
Scheduler Success: 42 %
Scheduler Failure: 57 %
Scheduler Timeout: 22 % of total
Scheduler Timeout: 38 % of failures

That's since the 26/10/2012 18:30hrs (UTC +9:30), so just under one month's worth.


Other system

Scheduler Requests: 431
Scheduler Success: 54 %
Scheduler Failure: 45 %
Scheduler Timeout: 0 % of total
Scheduler Timeout: 0 % of failures

That's simce 9/11/2012 16:30 (UTC +9:30), so just over 2 weeks worth.
Grant
Darwin NT
ID: 1309972 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1309974 - Posted: 24 Nov 2012, 23:39:28 UTC - in response to Message 1309970.  

It's been 7 hours since recovery and I'm still getting
11/24/2012 6:22:06 PM | SETI@home | Scheduler request failed: Failure when receiving data from the peer

For me it's mostly Couldn't connect to server errors.

Grant
Darwin NT
ID: 1309974 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1309979 - Posted: 24 Nov 2012, 23:58:01 UTC

Quick script to provide the percentage of failures on your machine.


So I decided to run this on mine. Had to slightly modify it since the old builds of BOINC had successes worded as "scheduler request succeeded" rather than "completed".

Since Jan 30, 2012:

Scheduler Requests: 15685
Scheduler Success: 89 %
Scheduler Failure: 10 %
Scheduler Timeout: 1 % of total
Scheduler Timeout: 14 % of failures


Since Oct 1, 2012:

Scheduler Requests: 2877
Scheduler Success: 72 %
Scheduler Failure: 27 %
Scheduler Timeout: 3 % of total
Scheduler Timeout: 13 % of failures


Since Nov 1, 2012:

Scheduler Requests: 564
Scheduler Success: 63 %
Scheduler Failure: 36 %
Scheduler Timeout: 16 % of total
Scheduler Timeout: 43 % of failures



My log size is set to 100MB. Currently it is coming up on 9MB since it started Jan 30.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1309979 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1309984 - Posted: 25 Nov 2012, 0:07:06 UTC - in response to Message 1309970.  

At least I got through once, okay twice to report the 100+ done workunits but that was hours ago.

One machine has managed to connect a few times- taking 2 minutes or so to get a response.

Grant
Darwin NT
ID: 1309984 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1309985 - Posted: 25 Nov 2012, 0:13:04 UTC - in response to Message 1309984.  

I've managed to get to my 200 task limit, all but about 10 of them are Shorties,

Claggy
ID: 1309985 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65746
Credit: 55,293,173
RAC: 49
United States
Message 1310014 - Posted: 25 Nov 2012, 2:06:53 UTC - in response to Message 1309985.  

I've managed to get to my 200 task limit, all but about 10 of them are Shorties,

Claggy

200? I got 100 and that's it...
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1310014 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1310015 - Posted: 25 Nov 2012, 2:10:43 UTC - in response to Message 1310014.  

100 per CPU/GPU = 200 on a GPU host
ID: 1310015 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65746
Credit: 55,293,173
RAC: 49
United States
Message 1310017 - Posted: 25 Nov 2012, 2:37:44 UTC - in response to Message 1310015.  

100 per CPU/GPU = 200 on a GPU host

I only use My gpus of course...
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1310017 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1310022 - Posted: 25 Nov 2012, 2:58:44 UTC - in response to Message 1310017.  
Last modified: 25 Nov 2012, 2:59:36 UTC

100 per CPU/GPU = 200 on a GPU host

I only use My gpus of course...

0 CPU + 100 GPU = 100
That´s why you have a 100WU cache only.
ID: 1310022 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1310026 - Posted: 25 Nov 2012, 3:12:39 UTC
Last modified: 25 Nov 2012, 4:06:06 UTC

Get Shorty!

I thought 4 minute shorties were a pain until I saw where the 1 and only AstroPulse I snagged went. Check it out http://setiathome.berkeley.edu/workunit.php?wuid=1110816835 After working out how long 660,846.5 seconds is, I can't find it in me to complain about a 4 minute shorty...
ID: 1310026 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1310030 - Posted: 25 Nov 2012, 4:13:24 UTC

Wooow.. 7.64 days of run time but only 2 seconds of CPU time before erroring out. That's brutal.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1310030 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1310044 - Posted: 25 Nov 2012, 5:22:07 UTC - in response to Message 1307257.  

Timeouts, Timeouts, Timeouts!!!!!!!

Now it's Couldn't connect to server, Couldn't connect to server, Couldn't connect to server!!!
Grant
Darwin NT
ID: 1310044 · Report as offensive
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 22 · Next

Message boards : Number crunching : Panic Mode On (79) Server Problems?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.