Panic Mode On (74) Server problems?

Message boards : Number crunching : Panic Mode On (74) Server problems?

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · Next

AuthorMessage
Tutankhamon "Communist"
Volunteer tester
Avatar

Send message
Joined: 1 Nov 08
Posts: 6081
Credit: 37,578,298
RAC: 14,638
Sweden
Message 1232940 - Posted: 18 May 2012, 19:20:16 UTC - in response to Message 1232935.
Last modified: 18 May 2012, 19:21:01 UTC

This WU looks like it'll be fun. _0 and _1 both failed on download. _2 managed to get it, supposedly. _3 and _4 (me) failed on download. _5 is unsent at the moment.

Check this one - Too many errors (may have bug).

I'm getting a huge amount of download errors now. What's going on?



Yeah, me too, lots of these:


1735 SETI@home 2012-05-18 21:17:31 [error] File 23fe11ab.28540.3508.8.10.14 has wrong size: expected 375361, got 0
1736 SETI@home 2012-05-18 21:17:31 [error] Checksum or signature error for 23fe11ab.28540.3508.8.10.14
This is a test of the Emergency Moron System. Had there been a real moron in the room, there would've been a small mushroom cloud in the place where the idiot had been standing.

ID: 1232940 · Report as offensive
Profile Donald L. Johnson
Avatar

Send message
Joined: 5 Aug 02
Posts: 8205
Credit: 4,324,207
RAC: 5,357
United States
Message 1233000 - Posted: 18 May 2012, 20:35:59 UTC - in response to Message 1232940.

This WU looks like it'll be fun. _0 and _1 both failed on download. _2 managed to get it, supposedly. _3 and _4 (me) failed on download. _5 is unsent at the moment.

Check this one - Too many errors (may have bug).

I'm getting a huge amount of download errors now. What's going on?



Yeah, me too, lots of these:


1735 SETI@home 2012-05-18 21:17:31 [error] File 23fe11ab.28540.3508.8.10.14 has wrong size: expected 375361, got 0
1736 SETI@home 2012-05-18 21:17:31 [error] Checksum or signature error for 23fe11ab.28540.3508.8.10.14

Could that be due to the RAID resynchs Matt mentioned earlier?
Or maybe just saturation on the download pipe?
Donald
Infernal Optimist / Submariner, retired

ID: 1233000 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1233051 - Posted: 18 May 2012, 21:51:22 UTC - in response to Message 1233000.

This WU looks like it'll be fun. _0 and _1 both failed on download. _2 managed to get it, supposedly. _3 and _4 (me) failed on download. _5 is unsent at the moment.

Check this one - Too many errors (may have bug).

I'm getting a huge amount of download errors now. What's going on?

Yeah, me too, lots of these:

1735 SETI@home 2012-05-18 21:17:31 [error] File 23fe11ab.28540.3508.8.10.14 has wrong size: expected 375361, got 0
1736 SETI@home 2012-05-18 21:17:31 [error] Checksum or signature error for 23fe11ab.28540.3508.8.10.14

Could that be due to the RAID resynchs Matt mentioned earlier?
Or maybe just saturation on the download pipe?

See Khangollo's message 1232644, posted before the power failure, showing that kind of error. Whatever the problem is, it's not an aftereffect of the power failure.
                                                                  Joe

ID: 1233051 · Report as offensive
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 743
Credit: 244,276
RAC: 0
United Kingdom
Message 1233071 - Posted: 18 May 2012, 22:17:21 UTC - in response to Message 1233051.

This WU looks like it'll be fun. _0 and _1 both failed on download. _2 managed to get it, supposedly. _3 and _4 (me) failed on download. _5 is unsent at the moment.

Check this one - Too many errors (may have bug).

I'm getting a huge amount of download errors now. What's going on?

Yeah, me too, lots of these:

1735 SETI@home 2012-05-18 21:17:31 [error] File 23fe11ab.28540.3508.8.10.14 has wrong size: expected 375361, got 0
1736 SETI@home 2012-05-18 21:17:31 [error] Checksum or signature error for 23fe11ab.28540.3508.8.10.14

Could that be due to the RAID resynchs Matt mentioned earlier?
Or maybe just saturation on the download pipe?

See Khangollo's message 1232644, posted before the power failure, showing that kind of error. Whatever the problem is, it's not an aftereffect of the power failure.
                                                                  Joe


I suspect some of the drive volumes were unreachable by the download servers shortly before everything shut down. Are all the servers on the same UPS ?

ID: 1233071 · Report as offensive
Profile arkaynProject Donor
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4097
Credit: 51,575,815
RAC: 1,694
United States
Message 1233080 - Posted: 18 May 2012, 22:29:19 UTC

Finally done uploading for me.



ID: 1233080 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 7474
Credit: 90,811,416
RAC: 44,917
Australia
Message 1233082 - Posted: 18 May 2012, 22:31:09 UTC - in response to Message 1233071.

I suspect some of the drive volumes were unreachable by the download servers shortly before everything shut down. Are all the servers on the same UPS ?

I'm pretty sure they've got several UPSs.
Just had a look at my tasks, and i've got quite a few download errors as well.
If you look at the WUs, that occured for everyone else trying to download those WUs, although there several which were just downloaded today with no problems.
So your theory sounds good.
The Scheduler & download servers were still up, but the work file storage was down (or at least in accessable) at the time.




Might be an idea for someone that can to notifiy the staff- many of those WUs have been cancelled (Too many errors (may have bug) Work Unit cancelled) when in actual fact they are OK.
Grant
Darwin NT

ID: 1233082 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1233087 - Posted: 18 May 2012, 22:39:29 UTC - in response to Message 1233071.


See Khangollo's message 1232644, posted before the power failure, showing that kind of error. Whatever the problem is, it's not an aftereffect of the power failure.
                                                                  Joe

I suspect some of the drive volumes were unreachable by the download servers shortly before everything shut down. Are all the servers on the same UPS ?

IIRC they only have 120 volt 20 amp AC circuits in the closet so are restricted to using 2200 VA UPS units. I don't know how many servers are on each UPS, but one UPS certainly won't handle all the listed servers.

The post I referenced was made long enough before the power outage I don't think they're related, though of course it was in the recovery period from the scheduled Tuesday outage so is similar to the current conditions that way.
                                                                  Joe

ID: 1233087 · Report as offensive
Profile Slavac
Volunteer tester
Avatar

Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1233093 - Posted: 18 May 2012, 22:45:01 UTC - in response to Message 1233087.

They have several UPS systems in the closet. I'll ask Eric if they need upgrades/more.




Executive Director GPU Users Group Inc. -
brad@gpuug.org

ID: 1233093 · Report as offensive
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 743
Credit: 244,276
RAC: 0
United Kingdom
Message 1233100 - Posted: 18 May 2012, 22:52:11 UTC - in response to Message 1233082.

I suspect some of the drive volumes were unreachable by the download servers shortly before everything shut down. Are all the servers on the same UPS ?

I'm pretty sure they've got several UPSs.
Just had a look at my tasks, and i've got quite a few download errors as well.
If you look at the WUs, that occured for everyone else trying to download those WUs, although there several which were just downloaded today with no problems.
So your theory sounds good.
The Scheduler & download servers were still up, but the work file storage was down (or at least in accessable) at the time.




Might be an idea for someone that can to notifiy the staff- many of those WUs have been cancelled (Too many errors (may have bug) Work Unit cancelled) when in actual fact they are OK.


I hope they are on the case. Someone seems to have cancelled some WU's before they reached the Max # of errors. e.g this one http://setiathome.berkeley.edu/workunit.php?wuid=992536710 which still has 2 tasks in progress, but has been cancelled.

I thought there were more than one UPS for all those boxes, I suspect the drive arrays may have gone down before the servers due to the higher power requirement of all those spinning motors.

ID: 1233100 · Report as offensive
Richard HaselgroveProject Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 11130
Credit: 83,450,104
RAC: 40,682
United Kingdom
Message 1233106 - Posted: 18 May 2012, 22:57:06 UTC - in response to Message 1233087.


See Khangollo's message 1232644, posted before the power failure, showing that kind of error. Whatever the problem is, it's not an aftereffect of the power failure.
                                                                  Joe

I suspect some of the drive volumes were unreachable by the download servers shortly before everything shut down. Are all the servers on the same UPS ?

IIRC they only have 120 volt 20 amp AC circuits in the closet so are restricted to using 2200 VA UPS units. I don't know how many servers are on each UPS, but one UPS certainly won't handle all the listed servers.

The post I referenced was made long enough before the power outage I don't think they're related, though of course it was in the recovery period from the scheduled Tuesday outage so is similar to the current conditions that way.
                                                                  Joe

Owing to timezones, I don't usually see the Tuesday maintenance outage recovery.

When I went to bed on Tuesday, the outage was over: message boards were up: old work had been reported, and new work had been allocated: and I had a large download queue.

When I peeked in on Wednesday morning, Cricket was already flatlined. All my queued downloads had completed, except some were showing as download error. That suggests that the zero file size error is more likely to be a splitter problem...

...except I suppose we could concoct an explanation where the download servers, switches and routers were all on a UPS which kept the downloads alive for the first few minutes of the outage, but the workunit storage array was either unprotected, or on a UPS which ran out of juice before the communications link died. That might create the symptoms observed.

ID: 1233106 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1233115 - Posted: 18 May 2012, 23:13:28 UTC

Another issue, maybe...

Although SETI Beta is up, the usual http://setiweb.ssl.berkeley.edu/beta/ link gives a 404 not found status for me. http://setiathome.ssl.berkeley.edu/beta/ or just http://setiathome.berkeley.edu/beta/ will get the front page, but the cookies (and the certificate for logging in) are all for setiweb. Anyone else seeing the same?

                                                                   Joe

ID: 1233115 · Report as offensive
Profile Alex Storey
Volunteer tester
Avatar

Send message
Joined: 14 Jun 04
Posts: 1083
Credit: 1,950,467
RAC: 195
Greece
Message 1233143 - Posted: 18 May 2012, 23:54:46 UTC - in response to Message 1233115.

Another issue, maybe...

Although SETI Beta is up, the usual http://setiweb.ssl.berkeley.edu/beta/ link gives a 404 not found status for me. http://setiathome.ssl.berkeley.edu/beta/ or just http://setiathome.berkeley.edu/beta/ will get the front page, but the cookies (and the certificate for logging in) are all for setiweb. Anyone else seeing the same?
                                                                   Joe


Clicked on all three and they all work (here).

ID: 1233143 · Report as offensive
Profile arkaynProject Donor
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4097
Credit: 51,575,815
RAC: 1,694
United States
Message 1233154 - Posted: 19 May 2012, 0:37:13 UTC - in response to Message 1233143.

Another issue, maybe...

Although SETI Beta is up, the usual http://setiweb.ssl.berkeley.edu/beta/ link gives a 404 not found status for me. http://setiathome.ssl.berkeley.edu/beta/ or just http://setiathome.berkeley.edu/beta/ will get the front page, but the cookies (and the certificate for logging in) are all for setiweb. Anyone else seeing the same?
                                                                   Joe


Clicked on all three and they all work (here).


All 3 links work fine for me as well.


ID: 1233154 · Report as offensive
Profile Misfit
Volunteer tester
Avatar

Send message
Joined: 21 Jun 01
Posts: 21790
Credit: 2,510,901
RAC: 0
United States
Message 1233157 - Posted: 19 May 2012, 1:09:22 UTC - in response to Message 1233106.
Last modified: 19 May 2012, 1:10:01 UTC

Power went down between Tuesday 10:35PM PDT and a few hours before.

Beta URLs are working for me. No main GPU work available though.



Join BOINC Synergy!

ID: 1233157 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 7474
Credit: 90,811,416
RAC: 44,917
Australia
Message 1233171 - Posted: 19 May 2012, 1:47:49 UTC - in response to Message 1233157.

No main GPU work available though.

?
GPU work is all i'm getting at the moment. Not expecting any CPU work till the GPU cache has been topped up sufficiently.
So far there's only been a couple of shorties in sight, so hopefully it won't take too long.

Grant
Darwin NT

ID: 1233171 · Report as offensive
Profile Slavac
Volunteer tester
Avatar

Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1233174 - Posted: 19 May 2012, 1:49:08 UTC - in response to Message 1233171.

Just got word from Eric. Good news is that there doesn't appear to have been any lost hardware. We were worried about lost hardware in light of Bane's death along with Dan's old workstation during the last outage.




Executive Director GPU Users Group Inc. -
brad@gpuug.org

ID: 1233174 · Report as offensive
kittymanProject Donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 45842
Credit: 814,460,047
RAC: 121,999
United States
Message 1233176 - Posted: 19 May 2012, 1:51:39 UTC - in response to Message 1233174.

Just got word from Eric. Good news is that there doesn't appear to have been any lost hardware. We were worried about lost hardware in light of Bane's death along with Dan's old workstation during the last outage.

Excellent news....
Kitties make wonderful traveling companions on your journey through life.

Have made a few friends in this life.
Most were cats.

ID: 1233176 · Report as offensive
Profile Misfit
Volunteer tester
Avatar

Send message
Joined: 21 Jun 01
Posts: 21790
Credit: 2,510,901
RAC: 0
United States
Message 1233177 - Posted: 19 May 2012, 1:53:28 UTC - in response to Message 1233171.

No main GPU work available though.

?
GPU work is all i'm getting at the moment. Not expecting any CPU work till the GPU cache has been topped up sufficiently.
So far there's only been a couple of shorties in sight, so hopefully it won't take too long.

I had to complain about it before I got any apparently.

Join BOINC Synergy!

ID: 1233177 · Report as offensive
David SProject Donor
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 17030
Credit: 20,909,837
RAC: 6,032
United States
Message 1233217 - Posted: 19 May 2012, 3:52:21 UTC - in response to Message 1233071.

This WU looks like it'll be fun. _0 and _1 both failed on download. _2 managed to get it, supposedly. _3 and _4 (me) failed on download. _5 is unsent at the moment.

Check this one - Too many errors (may have bug).

I'm getting a huge amount of download errors now. What's going on?

Yeah, me too, lots of these:

1735 SETI@home 2012-05-18 21:17:31 [error] File 23fe11ab.28540.3508.8.10.14 has wrong size: expected 375361, got 0
1736 SETI@home 2012-05-18 21:17:31 [error] Checksum or signature error for 23fe11ab.28540.3508.8.10.14

Could that be due to the RAID resynchs Matt mentioned earlier?
Or maybe just saturation on the download pipe?

See Khangollo's message 1232644, posted before the power failure, showing that kind of error. Whatever the problem is, it's not an aftereffect of the power failure.
                                                                  Joe

I suspect some of the drive volumes were unreachable by the download servers shortly before everything shut down. Are all the servers on the same UPS ?

Since these download errors were already being discussed nearly four hours before the power went out, they can't very well be a result of it.

If the Cricket graph is a reliable indicator of the time (I don't see why it wouldn't be), the outage started at just about precisely (suspiciously precisely) 2000 PDT, or 0300 UTC. I have 39 such download errors, most if not all from several hours before 0300 UTC, so once again, they are clearly not a result of the power outage.

What I'm wondering about is that some of my cancelled WUs still show as in progress for other users.

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


ID: 1233217 · Report as offensive
ClaggyProject Donor
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4622
Credit: 46,333,554
RAC: 3,115
United Kingdom
Message 1233306 - Posted: 19 May 2012, 6:54:14 UTC - in response to Message 1233115.
Last modified: 19 May 2012, 6:54:37 UTC

Another issue, maybe...

Although SETI Beta is up, the usual http://setiweb.ssl.berkeley.edu/beta/ link gives a 404 not found status for me. http://setiathome.ssl.berkeley.edu/beta/ or just http://setiathome.berkeley.edu/beta/ will get the front page, but the cookies (and the certificate for logging in) are all for setiweb. Anyone else seeing the same?
                                                                   Joe

I'm logged on the the first url already, normally have to go through the certificate error to do so, looking at my account on the 2nd and 3rd url gives me the certificate error,

Claggy

ID: 1233306 · Report as offensive
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · Next

Message boards : Number crunching : Panic Mode On (74) Server problems?


 
©2016 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.