Down Time (May 01 2007)

Message boards : Technical News : Down Time (May 01 2007)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 557665 - Posted: 1 May 2007, 21:53:12 UTC
Last modified: 1 May 2007, 22:28:25 UTC

This was one of those days. Sometime in the early morning MySQL on sidious crashed and rebooted itself. It had minor indigestion and restarted on its own just fine. Eric had to restart the BOINC projects to clean the pipes.

But when I came in I found Eric dissecting our master database server, thumper. That's never a good sign. He and Jeff informed me that it lost the ability to see any of its internal drives. Tests throughout the day confirmed that diagnosis - there's something dead between the power supply and the disk controllers so the drives don't even spin up. Booting from a DVD and an "fdisk" shows nothing. This system has a "preliminary" motherboard, which is one of the reasons we got it for free, but it has no hardware support.

Meanwhile I went ahead with the usual database backup/compression while we figured out what the heck we're gonna do. We're pretty confident the data is intact and as long as some server somewhere can mount the 24 SATA drives the make up the database the SETI@home science data will be perfectly intact. Failing that, we can recover from tape but unfortunately we're at a bad point in the backup cycle so the most recent tape is a week old.

Since data loss is most likely not an issue, the upshot of thumper being down is that we can't run the splitters or the assimilators. I just restarted the scheduler, but we only had about 300,000 results to process. I checked again just now and it's already down to about 281,000. Brace yourselves for a long outage.

[Edit: things are looking better regarding previously mentioned inability to procure a replacement. In other words, we might get another server relatively quickly.]

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 557665 · Report as offensive
Jamie

Send message
Joined: 8 Feb 01
Posts: 28
Credit: 11,078,008
RAC: 0
United States
Message 557687 - Posted: 1 May 2007, 22:27:15 UTC

Are the scheduler errors that I'm getting related to this issue, or is it something separate?
ID: 557687 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 557688 - Posted: 1 May 2007, 22:29:27 UTC - in response to Message 557687.  
Last modified: 1 May 2007, 22:30:06 UTC

Are the scheduler errors that I'm getting related to this issue, or is it something separate?


That's just due to the servers coming online after the 3-hour database backup, so they are swamped with requests and dropping connections. It'll get better soon, but when work runs out (in about 3-6 hours) it'll get worse.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 557688 · Report as offensive
Iztok s52d (and friends)

Send message
Joined: 12 Jan 01
Posts: 136
Credit: 393,469,375
RAC: 116
Slovenia
Message 557689 - Posted: 1 May 2007, 22:32:35 UTC - in response to Message 557665.  

T
Since data loss is most likely not an issue, the upshot of thumper being down is that we can't run the splitters or the assimilators. I just restarted the scheduler, but we only had about 300,000 results to process. I checked again just now and it's already down to about 281,000. Brace yourselves for a long outage.

- Matt


Hi! Maybe splitters can store data somewhere temporary,
once thumper is back and tested, you can just load this data and then resume assimilators?

Good luck with boxes!

73
Iztok
ID: 557689 · Report as offensive
John-James-Connellan

Send message
Joined: 16 Jan 00
Posts: 4
Credit: 824,683
RAC: 0
Ireland
Message 557712 - Posted: 1 May 2007, 23:20:25 UTC - in response to Message 557689.  

T
Since data loss is most likely not an issue, the upshot of thumper being down is that we can't run the splitters or the assimilators. I just restarted the scheduler, but we only had about 300,000 results to process. I checked again just now and it's already down to about 281,000. Brace yourselves for a long outage.

- Matt


Hi! Maybe splitters can store data somewhere temporary,
once thumper is back and tested, you can just load this data and then resume assimilators?

Good luck with boxes!

73
Iztok

Should there be a second master science database as backup? (perhaps a generous sponser may be able to help) from Passive Seti Alpha Tester Brendan
ID: 557712 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 557719 - Posted: 1 May 2007, 23:26:36 UTC - in response to Message 557712.  

Maybe splitters can store data somewhere temporary,
once thumper is back and tested, you can just load this data and then resume assimilators?

When splitters first create work they need access to the master science database to store the workunit information - so that when results eventually appear on the other end of the pipeline they will have a workunit to match it.

Should there be a second master science database as backup?

Of course there *should* be, but lack of resources (a.k.a money) dictated that our policy was to be satisfied with a RAIDed database server with tape backups just in case. Of course RAID doesn't do you any good when *every* disk suddenly disappears according to the OS. I don't want to claim anything prematurely, but in light of this we may get a backup server after all.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 557719 · Report as offensive
WlanWorks1

Send message
Joined: 11 Jan 06
Posts: 7
Credit: 446,368
RAC: 0
Message 557878 - Posted: 2 May 2007, 3:41:21 UTC - in response to Message 557688.  

Are the scheduler errors that I'm getting related to this issue, or is it something separate?


That's just due to the servers coming online after the 3-hour database backup, so they are swamped with requests and dropping connections. It'll get better soon, but when work runs out (in about 3-6 hours) it'll get worse.

- Matt



This question may be pre-mature, and is most likely covered by your response above, but I am going to ask anyway. During the downtime today I now know was due to the "phantom hard drives" on your end. At first, My messages were advising that I was requesting new work but the project servers may down. However, after 4:06pm and CURRENTLY, when I attempt to connect, via "update" - I am getting the message "(not requesting new work or reporting completed tasks)".

Now, I know that most likely I will not gain new work, and I have ceased attempts to contact the servers as it will only add to the traffic congestion. What I want to know, though, is that message related to the downed servers and hard drives and their coming back on line, or do I have a seperate issue all of a sudden? More importantly, will I be able to request and obtain new work without any actions on My end?
ID: 557878 · Report as offensive
Odysseus
Volunteer tester
Avatar

Send message
Joined: 26 Jul 99
Posts: 1808
Credit: 6,701,347
RAC: 6
Canada
Message 557883 - Posted: 2 May 2007, 3:59:21 UTC - in response to Message 557878.  

[…] when I attempt to connect, via "update" - I am getting the message "(not requesting new work or reporting completed tasks)".

Now, I know that most likely I will not gain new work, and I have ceased attempts to contact the servers as it will only add to the traffic congestion. What I want to know, though, is that message related to the downed servers and hard drives and their coming back on line, or do I have a seperate issue all of a sudden? More importantly, will I be able to request and obtain new work without any actions on My end?

That message isn’t usually anything to worry about, unless you do have current tasks that are “Ready to report”. The first part just means BOINC thinks you have—or have had—enough work from the project for the time being. Do you run any other projects? Did you still have S@h work in progress or queued when you tried the update? Only if the answer to both questions is “no” would I assume there’s any problem at your end.

ID: 557883 · Report as offensive
Profile KWSN - Chicken of Angnor
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 9 Jul 99
Posts: 1199
Credit: 6,615,780
RAC: 0
Austria
Message 557927 - Posted: 2 May 2007, 5:33:44 UTC - in response to Message 557665.  

[...]
Since data loss is most likely not an issue, the upshot of thumper being down is that we can't run the splitters or the assimilators. I just restarted the scheduler, but we only had about 300,000 results to process. I checked again just now and it's already down to about 281,000. Brace yourselves for a long outage.

[Edit: things are looking better regarding previously mentioned inability to procure a replacement. In other words, we might get another server relatively quickly.]

- Matt

Phew, at least it isn't the disks themselves - spot of luck in an unlucky overall situation. Data is always more valuable than hardware...

That said, good luck with procuring a replacement, and maybe a new system board for Thumper so the master science DB will have a live backup.

Regards,
Simon.
Donate to SETI@Home via PayPal!

Optimized SETI@Home apps + Information
ID: 557927 · Report as offensive
Paul J. Bennett

Send message
Joined: 6 Oct 06
Posts: 4
Credit: 185,279
RAC: 0
United States
Message 557930 - Posted: 2 May 2007, 5:53:43 UTC - in response to Message 557665.  

This was one of those days. Sometime in the early morning MySQL on sidious crashed and rebooted itself. It had minor indigestion and restarted on its own just fine. Eric had to restart the BOINC projects to clean the pipes.

But when I came in I found Eric dissecting our master database server, thumper. That's never a good sign. He and Jeff informed me that it lost the ability to see any of its internal drives. Tests throughout the day confirmed that diagnosis - there's something dead between the power supply and the disk controllers so the drives don't even spin up. Booting from a DVD and an "fdisk" shows nothing. This system has a "preliminary" motherboard, which is one of the reasons we got it for free, but it has no hardware support.

Meanwhile I went ahead with the usual database backup/compression while we figured out what the heck we're gonna do. We're pretty confident the data is intact and as long as some server somewhere can mount the 24 SATA drives the make up the database the SETI@home science data will be perfectly intact. Failing that, we can recover from tape but unfortunately we're at a bad point in the backup cycle so the most recent tape is a week old.

Since data loss is most likely not an issue, the upshot of thumper being down is that we can't run the splitters or the assimilators. I just restarted the scheduler, but we only had about 300,000 results to process. I checked again just now and it's already down to about 281,000. Brace yourselves for a long outage.

[Edit: things are looking better regarding previously mentioned inability to procure a replacement. In other words, we might get another server relatively quickly.]

- Matt

I suppose that is why I am getting the message STATUS COMMUNICATION DEFERRED.
ID: 557930 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51544
Credit: 1,018,363,574
RAC: 1,004
United States
Message 557956 - Posted: 2 May 2007, 7:34:16 UTC

Hope that Sun comes through for you and gets you Thumpin' again quickly!
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 557956 · Report as offensive
Profile Qurmo
Volunteer tester
Avatar

Send message
Joined: 2 May 07
Posts: 8
Credit: 306,878
RAC: 0
Belgium
Message 557997 - Posted: 2 May 2007, 10:36:57 UTC

Hi,

I just switched from project. So I don't got any work from you guy's yet, is their an indication when I would be able to get work? I hope your problems will get solutioned quickly ;)
ID: 557997 · Report as offensive
TarracoServer
Volunteer tester

Send message
Joined: 11 Apr 07
Posts: 38
Credit: 595,022
RAC: 0
Spain
Message 558003 - Posted: 2 May 2007, 10:55:48 UTC

Hi!
Yes, that's a really problem.
Maybe, this is a bad idea (or maybe not), but what about to make a daily backup copy (only of the new data, not all the database) and store it on a virtual HD on the net?. I don't know the amount of memory needed daily of that DB (I suppose, several Gb, not several Tb ;)), but there are some free virtual HD space servers to store that data. (Of course, as security issue, can be a good idea to use a couple of that servers with the same data stored)

ID: 558003 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13961
Credit: 208,696,464
RAC: 304
Australia
Message 558022 - Posted: 2 May 2007, 11:36:43 UTC - in response to Message 557997.  

is their an indication when I would be able to get work?

Not really, but from the nature of the problem i'd suggest a couple of days- if all goes well.
Grant
Darwin NT
ID: 558022 · Report as offensive
Profile Qurmo
Volunteer tester
Avatar

Send message
Joined: 2 May 07
Posts: 8
Credit: 306,878
RAC: 0
Belgium
Message 558032 - Posted: 2 May 2007, 12:42:37 UTC

Ok, thnx for the intel, I hope it will be sooner but we will see. So I'm in hold ;)
ID: 558032 · Report as offensive
Profile A planet

Send message
Joined: 4 Jun 06
Posts: 13
Credit: 1,233,155
RAC: 0
Netherlands
Message 558039 - Posted: 2 May 2007, 13:46:47 UTC

We'll at least i see it has been accepting my completed task.
But i am unable to request new work, is there maybe an alternative solution available or something to work on while the thumper is down?
ID: 558039 · Report as offensive
Profile Steven - KO4E
Avatar

Send message
Joined: 21 Jun 99
Posts: 53
Credit: 2,434,487
RAC: 0
United States
Message 558042 - Posted: 2 May 2007, 14:12:11 UTC - in response to Message 558039.  

We'll at least i see it has been accepting my completed task.
But i am unable to request new work, is there maybe an alternative solution available or something to work on while the thumper is down?

Yes just attach to another project and do work for them while seti fixes the hardware.

SETI@home classic workunits 5,429
SETI@home classic CPU time 73,472 hours
ID: 558042 · Report as offensive
Profile Kikarn

Send message
Joined: 21 Jan 03
Posts: 1
Credit: 3,351,670
RAC: 0
Sweden
Message 558047 - Posted: 2 May 2007, 14:31:46 UTC

Why does Seti dont work , I´ll being trying to up-load , and down load work´s
..and i´ll get no response ..
Why don´t it work? i´s it not important?

ID: 558047 · Report as offensive
Conrad Human
Volunteer tester

Send message
Joined: 17 Nov 00
Posts: 67
Credit: 2,009,224
RAC: 0
South Africa
Message 558051 - Posted: 2 May 2007, 14:44:51 UTC - in response to Message 558047.  
Last modified: 2 May 2007, 14:45:39 UTC

Please read http://setiathome.berkeley.edu/forum_thread.php?id=39188&nowrap=true#557665

As i stil got +- 1 day worth of work does this afect beta aswell ?

Oh well if it goes it goes with a bang (someone find an 24 port Sata Controler)

I am sure we wil have an update from Mat later 2day


Why does Seti dont work , I´ll being trying to up-load , and down load work´s
..and i´ll get no response ..
Why don´t it work? i´s it not important?


ID: 558051 · Report as offensive
sideband@seti.usa
Avatar

Send message
Joined: 19 Jun 99
Posts: 25
Credit: 2,774,864
RAC: 0
United States
Message 558052 - Posted: 2 May 2007, 14:48:02 UTC

Could this explain why, over the last week or so, my RAC seems to have fallen (to the tune of 1K), while the output of my machines has remained relatively constant (aside from Bishop's burp and Twiggy's downtime)?

I've noted that other members of my team have been experiencing similar drops, etc, and was wondering what was going on there, too...
73 de AI8W, Chris

Abdico Concussio Fidens Servo Libertas Semper!

ID: 558052 · Report as offensive
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Technical News : Down Time (May 01 2007)


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.