Panic Mode On (74) Server problems?

Author	Message
arkayn Volunteer tester Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0	Message 1232796 - Posted: 18 May 2012, 16:12:44 UTC In other news, I still have plenty of work after this outage. I just set NNT for the time being, but I am working on my uploads. ID: 1232796 ·

David S Volunteer tester Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12	Message 1232799 - Posted: 18 May 2012, 16:15:21 UTC - in response to Message 1232778. 2 days? That's one shockingly bad power company. Ooooooyyy..... David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. ID: 1232799 ·

Khangollo Send message Joined: 1 Aug 00 Posts: 245 Credit: 36,410,524 RAC: 0	Message 1232865 - Posted: 18 May 2012, 17:35:33 UTC Last modified: 18 May 2012, 17:51:05 UTC At first I got 2 new WUs which flew through with over 100kB/s. That got me a bit worried. Now I just got a large batch of 34 WUs, all of them stuck at 1% and retrying indefinitely. Yay, we're at 100+% again!! Situation normal 8-) Edit: also, I'm getting some of this on newly assigned WUs: [error] File 23mr10ad.21569.11693.3.10.156 has wrong size: expected 375356, got 0 Checksum or signature error for 23mr10ad.21569.11693.3.10.156 And this on different machines/platforms. Perhaps not everything was shut cleanly? :( ID: 1232865 ·

Kevin Olley Send message Joined: 3 Aug 99 Posts: 906 Credit: 261,085,289 RAC: 572	Message 1232876 - Posted: 18 May 2012, 17:48:24 UTC Oh dear, managed to upload and report, then the first couple of downloads appear. They be shorties. Kevin ID: 1232876 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1232883 - Posted: 18 May 2012, 17:55:52 UTC Last modified: 18 May 2012, 17:57:25 UTC Got all of mine uploaded and reported. MB-only machine reported all of its tasks and got a full 2.5-day cache worth of new work all at once. Then the download/checksum errors happened. This WU looks like it'll be fun. _0 and _1 both failed on download. _2 managed to get it, supposedly. _3 and _4 (me) failed on download. _5 is unsent at the moment. Also means my consecutive valid streak of over 1,000 has been reset to 1. For those of you with GPUs, it doesn't take long to get a big streak.. but for a CPU that does 3-5 MBs/day.. it takes a while. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1232883 ·

arkayn Volunteer tester Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0	Message 1232903 - Posted: 18 May 2012, 18:32:27 UTC I am still working on my uploads, both machines have hot about half uploaded or so. ID: 1232903 ·

Khangollo Send message Joined: 1 Aug 00 Posts: 245 Credit: 36,410,524 RAC: 0	Message 1232935 - Posted: 18 May 2012, 19:12:35 UTC - in response to Message 1232883. This WU looks like it'll be fun. _0 and _1 both failed on download. _2 managed to get it, supposedly. _3 and _4 (me) failed on download. _5 is unsent at the moment. Check this one - Too many errors (may have bug). I'm getting a huge amount of download errors now. What's going on? ID: 1232935 ·

Donald L. Johnson Send message Joined: 5 Aug 02 Posts: 8240 Credit: 14,654,533 RAC: 20	Message 1233000 - Posted: 18 May 2012, 20:35:59 UTC - in response to Message 1232940. This WU looks like it'll be fun. _0 and _1 both failed on download. _2 managed to get it, supposedly. _3 and _4 (me) failed on download. _5 is unsent at the moment. Check this one - Too many errors (may have bug). I'm getting a huge amount of download errors now. What's going on? Yeah, me too, lots of these: 1735 SETI@home 2012-05-18 21:17:31 [error] File 23fe11ab.28540.3508.8.10.14 has wrong size: expected 375361, got 0 1736 SETI@home 2012-05-18 21:17:31 [error] Checksum or signature error for 23fe11ab.28540.3508.8.10.14 Could that be due to the RAID resynchs Matt mentioned earlier? Or maybe just saturation on the download pipe? Donald Infernal Optimist / Submariner, retired ID: 1233000 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1233051 - Posted: 18 May 2012, 21:51:22 UTC - in response to Message 1233000. This WU looks like it'll be fun. _0 and _1 both failed on download. _2 managed to get it, supposedly. _3 and _4 (me) failed on download. _5 is unsent at the moment. Check this one - Too many errors (may have bug). I'm getting a huge amount of download errors now. What's going on? Yeah, me too, lots of these: 1735 SETI@home 2012-05-18 21:17:31 [error] File 23fe11ab.28540.3508.8.10.14 has wrong size: expected 375361, got 0 1736 SETI@home 2012-05-18 21:17:31 [error] Checksum or signature error for 23fe11ab.28540.3508.8.10.14 Could that be due to the RAID resynchs Matt mentioned earlier? Or maybe just saturation on the download pipe? See Khangollo's message 1232644, posted before the power failure, showing that kind of error. Whatever the problem is, it's not an aftereffect of the power failure. Joe ID: 1233051 ·

Keith T. Volunteer tester Send message Joined: 23 Aug 99 Posts: 962 Credit: 537,293 RAC: 9	Message 1233071 - Posted: 18 May 2012, 22:17:21 UTC - in response to Message 1233051. This WU looks like it'll be fun. _0 and _1 both failed on download. _2 managed to get it, supposedly. _3 and _4 (me) failed on download. _5 is unsent at the moment. Check this one - Too many errors (may have bug). I'm getting a huge amount of download errors now. What's going on? Yeah, me too, lots of these: 1735 SETI@home 2012-05-18 21:17:31 [error] File 23fe11ab.28540.3508.8.10.14 has wrong size: expected 375361, got 0 1736 SETI@home 2012-05-18 21:17:31 [error] Checksum or signature error for 23fe11ab.28540.3508.8.10.14 Could that be due to the RAID resynchs Matt mentioned earlier? Or maybe just saturation on the download pipe? See Khangollo's message 1232644, posted before the power failure, showing that kind of error. Whatever the problem is, it's not an aftereffect of the power failure. Joe I suspect some of the drive volumes were unreachable by the download servers shortly before everything shut down. Are all the servers on the same UPS ? ID: 1233071 ·

arkayn Volunteer tester Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0	Message 1233080 - Posted: 18 May 2012, 22:29:19 UTC Finally done uploading for me. ID: 1233080 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1233082 - Posted: 18 May 2012, 22:31:09 UTC - in response to Message 1233071. I suspect some of the drive volumes were unreachable by the download servers shortly before everything shut down. Are all the servers on the same UPS ? I'm pretty sure they've got several UPSs. Just had a look at my tasks, and i've got quite a few download errors as well. If you look at the WUs, that occured for everyone else trying to download those WUs, although there several which were just downloaded today with no problems. So your theory sounds good. The Scheduler & download servers were still up, but the work file storage was down (or at least in accessable) at the time. Might be an idea for someone that can to notifiy the staff- many of those WUs have been cancelled (Too many errors (may have bug) Work Unit cancelled) when in actual fact they are OK. Grant Darwin NT ID: 1233082 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1233087 - Posted: 18 May 2012, 22:39:29 UTC - in response to Message 1233071. See Khangollo's message 1232644, posted before the power failure, showing that kind of error. Whatever the problem is, it's not an aftereffect of the power failure. Joe I suspect some of the drive volumes were unreachable by the download servers shortly before everything shut down. Are all the servers on the same UPS ? IIRC they only have 120 volt 20 amp AC circuits in the closet so are restricted to using 2200 VA UPS units. I don't know how many servers are on each UPS, but one UPS certainly won't handle all the listed servers. The post I referenced was made long enough before the power outage I don't think they're related, though of course it was in the recovery period from the scheduled Tuesday outage so is similar to the current conditions that way. Joe ID: 1233087 ·

Slavac Volunteer tester Send message Joined: 27 Apr 11 Posts: 1932 Credit: 17,952,639 RAC: 0	Message 1233093 - Posted: 18 May 2012, 22:45:01 UTC - in response to Message 1233087. They have several UPS systems in the closet. I'll ask Eric if they need upgrades/more. Executive Director GPU Users Group Inc. - brad@gpuug.org ID: 1233093 ·

Keith T. Volunteer tester Send message Joined: 23 Aug 99 Posts: 962 Credit: 537,293 RAC: 9	Message 1233100 - Posted: 18 May 2012, 22:52:11 UTC - in response to Message 1233082. I suspect some of the drive volumes were unreachable by the download servers shortly before everything shut down. Are all the servers on the same UPS ? I'm pretty sure they've got several UPSs. Just had a look at my tasks, and i've got quite a few download errors as well. If you look at the WUs, that occured for everyone else trying to download those WUs, although there several which were just downloaded today with no problems. So your theory sounds good. The Scheduler & download servers were still up, but the work file storage was down (or at least in accessable) at the time. Might be an idea for someone that can to notifiy the staff- many of those WUs have been cancelled (Too many errors (may have bug) Work Unit cancelled) when in actual fact they are OK. I hope they are on the case. Someone seems to have cancelled some WU's before they reached the Max # of errors. e.g this one http://setiathome.berkeley.edu/workunit.php?wuid=992536710 which still has 2 tasks in progress, but has been cancelled. I thought there were more than one UPS for all those boxes, I suspect the drive arrays may have gone down before the servers due to the higher power requirement of all those spinning motors. ID: 1233100 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1233106 - Posted: 18 May 2012, 22:57:06 UTC - in response to Message 1233087. See Khangollo's message 1232644, posted before the power failure, showing that kind of error. Whatever the problem is, it's not an aftereffect of the power failure. Joe I suspect some of the drive volumes were unreachable by the download servers shortly before everything shut down. Are all the servers on the same UPS ? IIRC they only have 120 volt 20 amp AC circuits in the closet so are restricted to using 2200 VA UPS units. I don't know how many servers are on each UPS, but one UPS certainly won't handle all the listed servers. The post I referenced was made long enough before the power outage I don't think they're related, though of course it was in the recovery period from the scheduled Tuesday outage so is similar to the current conditions that way. Joe Owing to timezones, I don't usually see the Tuesday maintenance outage recovery. When I went to bed on Tuesday, the outage was over: message boards were up: old work had been reported, and new work had been allocated: and I had a large download queue. When I peeked in on Wednesday morning, Cricket was already flatlined. All my queued downloads had completed, except some were showing as download error. That suggests that the zero file size error is more likely to be a splitter problem... ...except I suppose we could concoct an explanation where the download servers, switches and routers were all on a UPS which kept the downloads alive for the first few minutes of the outage, but the workunit storage array was either unprotected, or on a UPS which ran out of juice before the communications link died. That might create the symptoms observed. ID: 1233106 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1233115 - Posted: 18 May 2012, 23:13:28 UTC Another issue, maybe... Although SETI Beta is up, the usual http://setiweb.ssl.berkeley.edu/beta/ link gives a 404 not found status for me. http://setiathome.ssl.berkeley.edu/beta/ or just http://setiathome.berkeley.edu/beta/ will get the front page, but the cookies (and the certificate for logging in) are all for setiweb. Anyone else seeing the same? Joe ID: 1233115 ·

shizaru Volunteer tester Send message Joined: 14 Jun 04 Posts: 1130 Credit: 1,967,904 RAC: 0	Message 1233143 - Posted: 18 May 2012, 23:54:46 UTC - in response to Message 1233115. Another issue, maybe... Although SETI Beta is up, the usual http://setiweb.ssl.berkeley.edu/beta/ link gives a 404 not found status for me. http://setiathome.ssl.berkeley.edu/beta/ or just http://setiathome.berkeley.edu/beta/ will get the front page, but the cookies (and the certificate for logging in) are all for setiweb. Anyone else seeing the same? Joe Clicked on all three and they all work (here). ID: 1233143 ·

arkayn Volunteer tester Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0	Message 1233154 - Posted: 19 May 2012, 0:37:13 UTC - in response to Message 1233143. Another issue, maybe... Although SETI Beta is up, the usual http://setiweb.ssl.berkeley.edu/beta/ link gives a 404 not found status for me. http://setiathome.ssl.berkeley.edu/beta/ or just http://setiathome.berkeley.edu/beta/ will get the front page, but the cookies (and the certificate for logging in) are all for setiweb. Anyone else seeing the same? Joe Clicked on all three and they all work (here). All 3 links work fine for me as well. ID: 1233154 ·

Misfit Volunteer tester Send message Joined: 21 Jun 01 Posts: 21804 Credit: 2,815,091 RAC: 0	Message 1233157 - Posted: 19 May 2012, 1:09:22 UTC - in response to Message 1233106. Last modified: 19 May 2012, 1:10:01 UTC Power went down between Tuesday 10:35PM PDT and a few hours before. Beta URLs are working for me. No main GPU work available though. me@rescam.org ID: 1233157 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.