Panic Mode On (79) Server Problems?

Author	Message
MikeN Send message Joined: 24 Jan 11 Posts: 319 Credit: 64,719,409 RAC: 85	Message 1309911 - Posted: 24 Nov 2012, 20:36:22 UTC - in response to Message 1309909. Overnight, a few hours ago, i managed to report & download some work. About 50 minutes ago i got a Scheduler timout again, then failure when receiving data from the peer, now it's nothing but couldn't connect to server, repated over & over again. Looks like still more work is required. Cricket graphs show everything maxed out so I suspect it is just a feeding frenzy after such a long outage. Everyone trying to refill their WU allowance at the same time. I only know of two solutions: 1. Serious update button abuse 2. patience, try again in a few hours when things have settled down a little. ID: 1309911 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13746 Credit: 208,696,464 RAC: 304	Message 1309918 - Posted: 24 Nov 2012, 20:51:18 UTC - in response to Message 1309911. Last modified: 24 Nov 2012, 20:56:31 UTC Cricket graphs show everything maxed out so I suspect it is just a feeding frenzy after such a long outage. Nope, that results in different errors such as the well known time out. And if they kept the routing through the campus network it would mean it's not affected by the upload & download traffic. EDIT- just pinged the Scheduler, looks like it's back off the campus network. However packet loss varies between 0 & 25%, previously it was around 75% and even then it was possible to contact the Schelduer. You'd rarely get a reply, but you could contact it. Grant Darwin NT ID: 1309918 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13746 Credit: 208,696,464 RAC: 304	Message 1309939 - Posted: 24 Nov 2012, 22:04:11 UTC - in response to Message 1309918. Cricket graphs show everything maxed out so I suspect it is just a feeding frenzy after such a long outage. Nope, that results in different errors such as the well known time out. And if they kept the routing through the campus network it would mean it's not affected by the upload & download traffic. EDIT- just pinged the Scheduler, looks like it's back off the campus network. However packet loss varies between 0 & 25%, previously it was around 75% and even then it was possible to contact the Schelduer. You'd rarely get a reply, but you could contact it. Hmm, it could be system load related- one system has managed to contact the Scheduler a couple of times- but most times are still Couldn't connect to server errors , the other system is completely unable to. But if that's the case it means they've changed some settings somewhwere. In the past, no matter how bad the load you could generally contact the Scheduler. It's just been this last month & a bit where we started getting the timeouts. As it is, inbound traffic is only around 10Mb/s, usually its around 14, closer to 20 after an outage. And i suspect the low inbound traffic is due to the inability to contact the Scheduler. Grant Darwin NT ID: 1309939 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1309944 - Posted: 24 Nov 2012, 22:10:09 UTC - in response to Message 1309939. Last modified: 24 Nov 2012, 22:10:39 UTC Cricket graphs show everything maxed out so I suspect it is just a feeding frenzy after such a long outage. Nope, that results in different errors such as the well known time out. And if they kept the routing through the campus network it would mean it's not affected by the upload & download traffic. EDIT- just pinged the Scheduler, looks like it's back off the campus network. However packet loss varies between 0 & 25%, previously it was around 75% and even then it was possible to contact the Schelduer. You'd rarely get a reply, but you could contact it. Hmm, it could be system load related- one system has managed to contact the Scheduler a couple of times- but most times are still Couldn't connect to server errors , the other system is completely unable to. But if that's the case it means they've changed some settings somewhwere. In the past, no matter how bad the load you could generally contact the Scheduler. It's just been this last month & a bit where we started getting the timeouts. As it is, inbound traffic is only around 10Mb/s, usually its around 14, closer to 20 after an outage. And i suspect the low inbound traffic is due to the inability to contact the Scheduler. Inbound traffic is about average....86K MB results received in the last hour. The rigs have been getting a tad bit of work here and there. Mostly CPU tasks, so the GPUs have been mostly falling back to Einstein. Oh well, the kitties will take whatever they can claw out of the servers for now and happily crunch it. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1309944 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13746 Credit: 208,696,464 RAC: 304	Message 1309952 - Posted: 24 Nov 2012, 22:32:20 UTC - in response to Message 1309944. Last modified: 24 Nov 2012, 22:34:16 UTC Inbound traffic is about average....86K MB results received in the last hour. The number of results per hour may be, but the number of bytes per second is way down. And since the number of results being returned is about average, that means it must be the traffic to the Scheduler that's dropped off significantly as it's no longer using the campus network. Grant Darwin NT ID: 1309952 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874	Message 1309959 - Posted: 24 Nov 2012, 22:43:16 UTC - in response to Message 1309952. Inbound traffic is about average....86K MB results received in the last hour. The number of results per hour may be, but the number of bytes per second is way down. And since the number of results being returned is about average, that means it must be the traffic to the Scheduler that's dropped off significantly as it's no longer using the campus network. Other way round. At the moment, it's been switched back, so that the scheduler traffic is attempting to use the Hurricane Electric link we're used to seeing on the gigabitethernet2_3 Cricket graph. That's the same setting that we had at the beginning of this week, and the beginning of the month. The difference seems to be that for the last three weeks, our requests reached the scheduler, but the replies got lost: now, the difficulty seems to be connecting to the scheduler in the first place, so that the full request message doesn't get sent. ID: 1309959 ·

Bill G Send message Joined: 1 Jun 01 Posts: 1282 Credit: 187,688,550 RAC: 182	Message 1309960 - Posted: 24 Nov 2012, 22:44:36 UTC - in response to Message 1309952. One of my computer had run dry....with a 5 day backoff so I did an update...got 20 lost tasks then while they were downloading got another 121 tasks. Now here is the interesting thing, normally you get a max of two downloads at a time....for awhile there I was getting 3 at the same time, of course one was an AP, so it would appear that things are changing for the AP crunching. SETI@home classic workunits 4,019 SETI@home classic CPU time 34,348 hours ID: 1309960 ·

Keith White Send message Joined: 29 May 99 Posts: 392 Credit: 13,035,233 RAC: 22	Message 1309970 - Posted: 24 Nov 2012, 23:31:03 UTC It's been 7 hours since recovery and I'm still getting 11/24/2012 6:22:06 PM \| \| Project communication failed: attempting access to reference site 11/24/2012 6:22:06 PM \| SETI@home \| Scheduler request failed: Failure when receiving data from the peer 11/24/2012 6:22:10 PM \| \| Internet access OK - project servers may be temporarily down. At least I got through once, okay twice to report the 100+ done workunits but that was hours ago. "Life is just nature's way of keeping meat fresh." - The Doctor ID: 1309970 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13746 Credit: 208,696,464 RAC: 304	Message 1309972 - Posted: 24 Nov 2012, 23:38:06 UTC - in response to Message 1308515. Last modified: 24 Nov 2012, 23:53:01 UTC Quick script to provide the percentage of failures on your machine. Thanks for that. Scheduler Requests: 821 Scheduler Success: 42 % Scheduler Failure: 57 % Scheduler Timeout: 22 % of total Scheduler Timeout: 38 % of failures That's since the 26/10/2012 18:30hrs (UTC +9:30), so just under one month's worth. Other system Scheduler Requests: 431 Scheduler Success: 54 % Scheduler Failure: 45 % Scheduler Timeout: 0 % of total Scheduler Timeout: 0 % of failures That's simce 9/11/2012 16:30 (UTC +9:30), so just over 2 weeks worth. Grant Darwin NT ID: 1309972 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13746 Credit: 208,696,464 RAC: 304	Message 1309974 - Posted: 24 Nov 2012, 23:39:28 UTC - in response to Message 1309970. It's been 7 hours since recovery and I'm still getting 11/24/2012 6:22:06 PM \| SETI@home \| Scheduler request failed: Failure when receiving data from the peer For me it's mostly Couldn't connect to server errors. Grant Darwin NT ID: 1309974 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1309979 - Posted: 24 Nov 2012, 23:58:01 UTC Quick script to provide the percentage of failures on your machine. So I decided to run this on mine. Had to slightly modify it since the old builds of BOINC had successes worded as "scheduler request succeeded" rather than "completed". Since Jan 30, 2012: Scheduler Requests: 15685 Scheduler Success: 89 % Scheduler Failure: 10 % Scheduler Timeout: 1 % of total Scheduler Timeout: 14 % of failures Since Oct 1, 2012: Scheduler Requests: 2877 Scheduler Success: 72 % Scheduler Failure: 27 % Scheduler Timeout: 3 % of total Scheduler Timeout: 13 % of failures Since Nov 1, 2012: Scheduler Requests: 564 Scheduler Success: 63 % Scheduler Failure: 36 % Scheduler Timeout: 16 % of total Scheduler Timeout: 43 % of failures My log size is set to 100MB. Currently it is coming up on 9MB since it started Jan 30. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1309979 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13746 Credit: 208,696,464 RAC: 304	Message 1309984 - Posted: 25 Nov 2012, 0:07:06 UTC - in response to Message 1309970. At least I got through once, okay twice to report the 100+ done workunits but that was hours ago. One machine has managed to connect a few times- taking 2 minutes or so to get a response. Grant Darwin NT ID: 1309984 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1309985 - Posted: 25 Nov 2012, 0:13:04 UTC - in response to Message 1309984. I've managed to get to my 200 task limit, all but about 10 of them are Shorties, Claggy ID: 1309985 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65759 Credit: 55,293,173 RAC: 49	Message 1310014 - Posted: 25 Nov 2012, 2:06:53 UTC - in response to Message 1309985. I've managed to get to my 200 task limit, all but about 10 of them are Shorties, Claggy 200? I got 100 and that's it... The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 1310014 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1310015 - Posted: 25 Nov 2012, 2:10:43 UTC - in response to Message 1310014. 100 per CPU/GPU = 200 on a GPU host ID: 1310015 ·

zoom3+1=4 Volunteer tester Send message Joined: 30 Nov 03 Posts: 65759 Credit: 55,293,173 RAC: 49	Message 1310017 - Posted: 25 Nov 2012, 2:37:44 UTC - in response to Message 1310015. 100 per CPU/GPU = 200 on a GPU host I only use My gpus of course... The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's ID: 1310017 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1310022 - Posted: 25 Nov 2012, 2:58:44 UTC - in response to Message 1310017. Last modified: 25 Nov 2012, 2:59:36 UTC 100 per CPU/GPU = 200 on a GPU host I only use My gpus of course... 0 CPU + 100 GPU = 100 ThatÂ´s why you have a 100WU cache only. ID: 1310022 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1310026 - Posted: 25 Nov 2012, 3:12:39 UTC Last modified: 25 Nov 2012, 4:06:06 UTC Get Shorty! I thought 4 minute shorties were a pain until I saw where the 1 and only AstroPulse I snagged went. Check it out http://setiathome.berkeley.edu/workunit.php?wuid=1110816835 After working out how long 660,846.5 seconds is, I can't find it in me to complain about a 4 minute shorty... ID: 1310026 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1310030 - Posted: 25 Nov 2012, 4:13:24 UTC Wooow.. 7.64 days of run time but only 2 seconds of CPU time before erroring out. That's brutal. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1310030 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13746 Credit: 208,696,464 RAC: 304	Message 1310044 - Posted: 25 Nov 2012, 5:22:07 UTC - in response to Message 1307257. Timeouts, Timeouts, Timeouts!!!!!!! Now it's Couldn't connect to server, Couldn't connect to server, Couldn't connect to server!!! Grant Darwin NT ID: 1310044 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.