Storage machine crash....

Message boards : News : Storage machine crash....
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1338
Credit: 36,845,996
RAC: 62,011
United States
Message 1979420 - Posted: 8 Feb 2019, 21:34:28 UTC

A machine that was holding 15% of our outgoing workunits has crashed and refuses to start back up. Short term it means that attempts to access those workunits will cause an error until the workunit is marked as bad.

Sorry for the incovenience.
@SETIEric

ID: 1979420 · Report as offensive     Reply Quote
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 24722
Credit: 45,426,790
RAC: 26,766
United States
Message 1979430 - Posted: 8 Feb 2019, 22:09:51 UTC

eeeeekk
ID: 1979430 · Report as offensive     Reply Quote
FurryGuy
Volunteer tester

Send message
Joined: 1 Jun 04
Posts: 3
Credit: 6,851,785
RAC: 7,983
United States
Message 1979445 - Posted: 8 Feb 2019, 22:54:31 UTC - in response to Message 1979420.  

A machine that was holding 15% of our outgoing workunits has crashed and refuses to start back up. Short term it means that attempts to access those workunits will cause an error until the workunit is marked as bad.

Sorry for the incovenience.
So......

should we wait for the server to catch up on its own, or should we abort any stalled download WUs?
ID: 1979445 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 10951
Credit: 155,350,505
RAC: 81,590
Australia
Message 1979446 - Posted: 8 Feb 2019, 22:58:47 UTC - in response to Message 1979445.  
Last modified: 8 Feb 2019, 23:00:42 UTC

should we wait for the server to catch up on its own, or should we abort any stalled download WUs?

I wouldn't abort as there are stalled downloads occurring even with the new work units being allocated for download, they will download eventually. But it's taking a while for them to start, and often with extended pauses & restarts to eventually download.

Oh, and even though the splitters show as running, they're not actually producing much work at the moment. So work is going to remain extremely scarce for some time yet.
Grant
Darwin NT
ID: 1979446 · Report as offensive     Reply Quote
Profile ronssito
Avatar

Send message
Joined: 8 Feb 00
Posts: 13
Credit: 25,242,189
RAC: 68,093
United States
Message 1979514 - Posted: 9 Feb 2019, 2:36:35 UTC

Thanks and great job guys!
ID: 1979514 · Report as offensive     Reply Quote
Profile Chris S Crowdfunding Project Donor*Special Project $75 donor
Volunteer tester

Send message
Joined: 19 Nov 00
Posts: 41500
Credit: 41,928,838
RAC: 104
Message 1979540 - Posted: 9 Feb 2019, 6:40:48 UTC

Thanks for keeping us in touch with events Eric.
ID: 1979540 · Report as offensive     Reply Quote
Profile J3P-0
Avatar

Send message
Joined: 1 Dec 11
Posts: 42
Credit: 10,858,037
RAC: 82,875
United States
Message 1979582 - Posted: 9 Feb 2019, 17:27:30 UTC
Last modified: 9 Feb 2019, 17:35:41 UTC

Thanks for the update, as of this AM 11:25 CST have a bunch of tasks waiting to report with nothing downloaded. Should I continue to wait, abort or will it pick back up when storage is back online?

EDIT: seems a reboot fixed my issue.
ID: 1979582 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 17182
Credit: 382,851,767
RAC: 142,540
United Kingdom
Message 1979594 - Posted: 9 Feb 2019, 18:54:20 UTC

Don't abort them as only part of the storage system has been failed and there is no way for us to identify if a task was distributed from the failed part or the part that is working correctly (only about 15% of the tasks were being by the failed computer).
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1979594 · Report as offensive     Reply Quote
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1338
Credit: 36,845,996
RAC: 62,011
United States
Message 1979678 - Posted: 10 Feb 2019, 7:28:47 UTC

The system eventually came back up and we're getting the missing workunits back online as quickly as we can. There will still be some download errors as things will be out of synchronization for a while. Some workunits that exist in the database may not have been flushed to disk before the system went down (although in theory our disk controllers shouldn't allow that to happen).
@SETIEric

ID: 1979678 · Report as offensive     Reply Quote
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2736
Credit: 420,699,286
RAC: 858,601
Canada
Message 1979679 - Posted: 10 Feb 2019, 7:53:04 UTC - in response to Message 1979678.  

Thanks for the update Eric,
I noticed some tasks making progress now. Such as this one.
ID: 1979679 · Report as offensive     Reply Quote
Profile Ray Cameron
Avatar

Send message
Joined: 2 Sep 04
Posts: 5
Credit: 107,439
RAC: 485
Canada
Message 1980093 - Posted: 13 Feb 2019, 14:06:44 UTC - in response to Message 1979420.  

Any idea when there will be data available to process?

Ray
ID: 1980093 · Report as offensive     Reply Quote
Profile Chris S Crowdfunding Project Donor*Special Project $75 donor
Volunteer tester

Send message
Joined: 19 Nov 00
Posts: 41500
Credit: 41,928,838
RAC: 104
Message 1980100 - Posted: 13 Feb 2019, 14:23:52 UTC

Hassle you could have done without I expect Eric! Thanks for keeping us workers at the coalface in touch.
ID: 1980100 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 17182
Credit: 382,851,767
RAC: 142,540
United Kingdom
Message 1980102 - Posted: 13 Feb 2019, 14:26:54 UTC

As the servers are coming back to life after a ~24hour break one can expect it to be a S-L-O-W process.
(Given the time they started to come back I would guess that it's an automated job, and was triggered by some task or other starting to live again)
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1980102 · Report as offensive     Reply Quote
Profile Ray Cameron
Avatar

Send message
Joined: 2 Sep 04
Posts: 5
Credit: 107,439
RAC: 485
Canada
Message 1980157 - Posted: 13 Feb 2019, 19:56:08 UTC - in response to Message 1980093.  

2:55 Eastern time and my computer just downloaded and I'm now processing!
ID: 1980157 · Report as offensive     Reply Quote
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1338
Credit: 36,845,996
RAC: 62,011
United States
Message 1980206 - Posted: 14 Feb 2019, 1:28:21 UTC

Sorry for the late notice. The problems we had bloated the result table to about double its normal size. Hopefully it will be back down to normal next week.
@SETIEric

ID: 1980206 · Report as offensive     Reply Quote
Sirius B Project Donor
Volunteer tester
Avatar

Send message
Joined: 26 Dec 00
Posts: 20479
Credit: 2,762,744
RAC: 279
Ireland
Message 1980209 - Posted: 14 Feb 2019, 1:42:31 UTC - in response to Message 1980206.  

Wow, still on campus at 17:20? Whatever will CRL say?

Thanks for what you guys do. :-)
ID: 1980209 · Report as offensive     Reply Quote
Profile ronssito
Avatar

Send message
Joined: 8 Feb 00
Posts: 13
Credit: 25,242,189
RAC: 68,093
United States
Message 1980298 - Posted: 14 Feb 2019, 15:24:06 UTC

3rd consecutive day of no boinc stats update
ID: 1980298 · Report as offensive     Reply Quote
Eric Korpela Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 3 Apr 99
Posts: 1338
Credit: 36,845,996
RAC: 62,011
United States
Message 1980347 - Posted: 14 Feb 2019, 19:40:04 UTC - in response to Message 1980298.  

3rd consecutive day of no boinc stats update


Not sure why that would be. Our stats files are in place and have current timestamps.

https://setiathome.berkeley.edu/stats/
@SETIEric

ID: 1980347 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 12824
Credit: 134,088,481
RAC: 41,367
United Kingdom
Message 1980349 - Posted: 14 Feb 2019, 19:50:45 UTC - in response to Message 1980347.  

BOINCstats has processed one now. Probably all three days' worth in one go, judging by the figures on my account.
ID: 1980349 · Report as offensive     Reply Quote
Profile ronssito
Avatar

Send message
Joined: 8 Feb 00
Posts: 13
Credit: 25,242,189
RAC: 68,093
United States
Message 1980355 - Posted: 14 Feb 2019, 20:11:12 UTC
Last modified: 14 Feb 2019, 20:15:22 UTC

my boinc stats update each morning so tomorrow we shall see
ID: 1980355 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : News : Storage machine crash....


 
©2019 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.