Down Time (May 01 2007)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 557665 - Posted: 1 May 2007, 21:53:12 UTC Last modified: 1 May 2007, 22:28:25 UTC This was one of those days. Sometime in the early morning MySQL on sidious crashed and rebooted itself. It had minor indigestion and restarted on its own just fine. Eric had to restart the BOINC projects to clean the pipes. But when I came in I found Eric dissecting our master database server, thumper. That's never a good sign. He and Jeff informed me that it lost the ability to see any of its internal drives. Tests throughout the day confirmed that diagnosis - there's something dead between the power supply and the disk controllers so the drives don't even spin up. Booting from a DVD and an "fdisk" shows nothing. This system has a "preliminary" motherboard, which is one of the reasons we got it for free, but it has no hardware support. Meanwhile I went ahead with the usual database backup/compression while we figured out what the heck we're gonna do. We're pretty confident the data is intact and as long as some server somewhere can mount the 24 SATA drives the make up the database the SETI@home science data will be perfectly intact. Failing that, we can recover from tape but unfortunately we're at a bad point in the backup cycle so the most recent tape is a week old. Since data loss is most likely not an issue, the upshot of thumper being down is that we can't run the splitters or the assimilators. I just restarted the scheduler, but we only had about 300,000 results to process. I checked again just now and it's already down to about 281,000. Brace yourselves for a long outage. [Edit: things are looking better regarding previously mentioned inability to procure a replacement. In other words, we might get another server relatively quickly.] - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 557665 ·

Jamie Send message Joined: 8 Feb 01 Posts: 28 Credit: 11,078,008 RAC: 0	Message 557687 - Posted: 1 May 2007, 22:27:15 UTC Are the scheduler errors that I'm getting related to this issue, or is it something separate? ID: 557687 ·

Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 557688 - Posted: 1 May 2007, 22:29:27 UTC - in response to Message 557687. Last modified: 1 May 2007, 22:30:06 UTC Are the scheduler errors that I'm getting related to this issue, or is it something separate? That's just due to the servers coming online after the 3-hour database backup, so they are swamped with requests and dropping connections. It'll get better soon, but when work runs out (in about 3-6 hours) it'll get worse. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 557688 ·

Iztok s52d (and friends) Send message Joined: 12 Jan 01 Posts: 136 Credit: 393,469,375 RAC: 116	Message 557689 - Posted: 1 May 2007, 22:32:35 UTC - in response to Message 557665. T Since data loss is most likely not an issue, the upshot of thumper being down is that we can't run the splitters or the assimilators. I just restarted the scheduler, but we only had about 300,000 results to process. I checked again just now and it's already down to about 281,000. Brace yourselves for a long outage. - Matt Hi! Maybe splitters can store data somewhere temporary, once thumper is back and tested, you can just load this data and then resume assimilators? Good luck with boxes! 73 Iztok ID: 557689 ·

John-James-Connellan Send message Joined: 16 Jan 00 Posts: 4 Credit: 824,683 RAC: 0	Message 557712 - Posted: 1 May 2007, 23:20:25 UTC - in response to Message 557689. T Since data loss is most likely not an issue, the upshot of thumper being down is that we can't run the splitters or the assimilators. I just restarted the scheduler, but we only had about 300,000 results to process. I checked again just now and it's already down to about 281,000. Brace yourselves for a long outage. - Matt Hi! Maybe splitters can store data somewhere temporary, once thumper is back and tested, you can just load this data and then resume assimilators? Good luck with boxes! 73 Iztok Should there be a second master science database as backup? (perhaps a generous sponser may be able to help) from Passive Seti Alpha Tester Brendan ID: 557712 ·

Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 557719 - Posted: 1 May 2007, 23:26:36 UTC - in response to Message 557712. Maybe splitters can store data somewhere temporary, once thumper is back and tested, you can just load this data and then resume assimilators? When splitters first create work they need access to the master science database to store the workunit information - so that when results eventually appear on the other end of the pipeline they will have a workunit to match it. Should there be a second master science database as backup? Of course there should be, but lack of resources (a.k.a money) dictated that our policy was to be satisfied with a RAIDed database server with tape backups just in case. Of course RAID doesn't do you any good when every disk suddenly disappears according to the OS. I don't want to claim anything prematurely, but in light of this we may get a backup server after all. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 557719 ·

WlanWorks1 Send message Joined: 11 Jan 06 Posts: 7 Credit: 446,368 RAC: 0	Message 557878 - Posted: 2 May 2007, 3:41:21 UTC - in response to Message 557688. Are the scheduler errors that I'm getting related to this issue, or is it something separate? That's just due to the servers coming online after the 3-hour database backup, so they are swamped with requests and dropping connections. It'll get better soon, but when work runs out (in about 3-6 hours) it'll get worse. - Matt This question may be pre-mature, and is most likely covered by your response above, but I am going to ask anyway. During the downtime today I now know was due to the "phantom hard drives" on your end. At first, My messages were advising that I was requesting new work but the project servers may down. However, after 4:06pm and CURRENTLY, when I attempt to connect, via "update" - I am getting the message "(not requesting new work or reporting completed tasks)". Now, I know that most likely I will not gain new work, and I have ceased attempts to contact the servers as it will only add to the traffic congestion. What I want to know, though, is that message related to the downed servers and hard drives and their coming back on line, or do I have a seperate issue all of a sudden? More importantly, will I be able to request and obtain new work without any actions on My end? ID: 557878 ·

Odysseus Volunteer tester Send message Joined: 26 Jul 99 Posts: 1808 Credit: 6,701,347 RAC: 6	Message 557883 - Posted: 2 May 2007, 3:59:21 UTC - in response to Message 557878. [Ã¢â‚¬Â¦] when I attempt to connect, via "update" - I am getting the message "(not requesting new work or reporting completed tasks)". Now, I know that most likely I will not gain new work, and I have ceased attempts to contact the servers as it will only add to the traffic congestion. What I want to know, though, is that message related to the downed servers and hard drives and their coming back on line, or do I have a seperate issue all of a sudden? More importantly, will I be able to request and obtain new work without any actions on My end? That message isnÃ¢â‚¬â„¢t usually anything to worry about, unless you do have current tasks that are Ã¢â‚¬Å“Ready to reportÃ¢â‚¬Â. The first part just means BOINC thinks you haveÃ¢â‚¬â€or have hadÃ¢â‚¬â€enough work from the project for the time being. Do you run any other projects? Did you still have S@h work in progress or queued when you tried the update? Only if the answer to both questions is Ã¢â‚¬Å“noÃ¢â‚¬Â would I assume thereÃ¢â‚¬â„¢s any problem at your end. ID: 557883 ·

KWSN - Chicken of Angnor Volunteer developer Volunteer tester Send message Joined: 9 Jul 99 Posts: 1199 Credit: 6,615,780 RAC: 0	Message 557927 - Posted: 2 May 2007, 5:33:44 UTC - in response to Message 557665. [...] Since data loss is most likely not an issue, the upshot of thumper being down is that we can't run the splitters or the assimilators. I just restarted the scheduler, but we only had about 300,000 results to process. I checked again just now and it's already down to about 281,000. Brace yourselves for a long outage. [Edit: things are looking better regarding previously mentioned inability to procure a replacement. In other words, we might get another server relatively quickly.] - Matt Phew, at least it isn't the disks themselves - spot of luck in an unlucky overall situation. Data is always more valuable than hardware... That said, good luck with procuring a replacement, and maybe a new system board for Thumper so the master science DB will have a live backup. Regards, Simon. Donate to SETI@Home via PayPal! Optimized SETI@Home apps + Information ID: 557927 ·

Paul J. Bennett Send message Joined: 6 Oct 06 Posts: 4 Credit: 185,279 RAC: 0	Message 557930 - Posted: 2 May 2007, 5:53:43 UTC - in response to Message 557665. This was one of those days. Sometime in the early morning MySQL on sidious crashed and rebooted itself. It had minor indigestion and restarted on its own just fine. Eric had to restart the BOINC projects to clean the pipes. But when I came in I found Eric dissecting our master database server, thumper. That's never a good sign. He and Jeff informed me that it lost the ability to see any of its internal drives. Tests throughout the day confirmed that diagnosis - there's something dead between the power supply and the disk controllers so the drives don't even spin up. Booting from a DVD and an "fdisk" shows nothing. This system has a "preliminary" motherboard, which is one of the reasons we got it for free, but it has no hardware support. Meanwhile I went ahead with the usual database backup/compression while we figured out what the heck we're gonna do. We're pretty confident the data is intact and as long as some server somewhere can mount the 24 SATA drives the make up the database the SETI@home science data will be perfectly intact. Failing that, we can recover from tape but unfortunately we're at a bad point in the backup cycle so the most recent tape is a week old. Since data loss is most likely not an issue, the upshot of thumper being down is that we can't run the splitters or the assimilators. I just restarted the scheduler, but we only had about 300,000 results to process. I checked again just now and it's already down to about 281,000. Brace yourselves for a long outage. [Edit: things are looking better regarding previously mentioned inability to procure a replacement. In other words, we might get another server relatively quickly.] - Matt I suppose that is why I am getting the message STATUS COMMUNICATION DEFERRED. ID: 557930 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51544 Credit: 1,018,363,574 RAC: 1,004	Message 557956 - Posted: 2 May 2007, 7:34:16 UTC Hope that Sun comes through for you and gets you Thumpin' again quickly! "Time is simply the mechanism that keeps everything from happening all at once." ID: 557956 ·

Qurmo Volunteer tester Send message Joined: 2 May 07 Posts: 8 Credit: 306,878 RAC: 0	Message 557997 - Posted: 2 May 2007, 10:36:57 UTC Hi, I just switched from project. So I don't got any work from you guy's yet, is their an indication when I would be able to get work? I hope your problems will get solutioned quickly ;) ID: 557997 ·

TarracoServer Volunteer tester Send message Joined: 11 Apr 07 Posts: 38 Credit: 595,022 RAC: 0	Message 558003 - Posted: 2 May 2007, 10:55:48 UTC Hi! Yes, that's a really problem. Maybe, this is a bad idea (or maybe not), but what about to make a daily backup copy (only of the new data, not all the database) and store it on a virtual HD on the net?. I don't know the amount of memory needed daily of that DB (I suppose, several Gb, not several Tb ;)), but there are some free virtual HD space servers to store that data. (Of course, as security issue, can be a good idea to use a couple of that servers with the same data stored) ID: 558003 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13961 Credit: 208,696,464 RAC: 304	Message 558022 - Posted: 2 May 2007, 11:36:43 UTC - in response to Message 557997. is their an indication when I would be able to get work? Not really, but from the nature of the problem i'd suggest a couple of days- if all goes well. Grant Darwin NT ID: 558022 ·

Qurmo Volunteer tester Send message Joined: 2 May 07 Posts: 8 Credit: 306,878 RAC: 0	Message 558032 - Posted: 2 May 2007, 12:42:37 UTC Ok, thnx for the intel, I hope it will be sooner but we will see. So I'm in hold ;) ID: 558032 ·

A planet Send message Joined: 4 Jun 06 Posts: 13 Credit: 1,233,155 RAC: 0	Message 558039 - Posted: 2 May 2007, 13:46:47 UTC We'll at least i see it has been accepting my completed task. But i am unable to request new work, is there maybe an alternative solution available or something to work on while the thumper is down? ID: 558039 ·

Steven - KO4E Send message Joined: 21 Jun 99 Posts: 53 Credit: 2,434,487 RAC: 0	Message 558042 - Posted: 2 May 2007, 14:12:11 UTC - in response to Message 558039. We'll at least i see it has been accepting my completed task. But i am unable to request new work, is there maybe an alternative solution available or something to work on while the thumper is down? Yes just attach to another project and do work for them while seti fixes the hardware. SETI@home classic workunits 5,429 SETI@home classic CPU time 73,472 hours ID: 558042 ·

Kikarn Send message Joined: 21 Jan 03 Posts: 1 Credit: 3,351,670 RAC: 0	Message 558047 - Posted: 2 May 2007, 14:31:46 UTC Why does Seti dont work , IÃ‚Â´ll being trying to up-load , and down load workÃ‚Â´s ..and iÃ‚Â´ll get no response .. Why donÃ‚Â´t it work? iÃ‚Â´s it not important? ID: 558047 ·

Conrad Human Volunteer tester Send message Joined: 17 Nov 00 Posts: 67 Credit: 2,009,224 RAC: 0	Message 558051 - Posted: 2 May 2007, 14:44:51 UTC - in response to Message 558047. Last modified: 2 May 2007, 14:45:39 UTC Please read http://setiathome.berkeley.edu/forum_thread.php?id=39188&nowrap=true#557665 As i stil got +- 1 day worth of work does this afect beta aswell ? Oh well if it goes it goes with a bang (someone find an 24 port Sata Controler) I am sure we wil have an update from Mat later 2day Why does Seti dont work , IÃ‚Â´ll being trying to up-load , and down load workÃ‚Â´s ..and iÃ‚Â´ll get no response .. Why donÃ‚Â´t it work? iÃ‚Â´s it not important? ID: 558051 ·

sideband@seti.usa Send message Joined: 19 Jun 99 Posts: 25 Credit: 2,774,864 RAC: 0	Message 558052 - Posted: 2 May 2007, 14:48:02 UTC Could this explain why, over the last week or so, my RAC seems to have fallen (to the tune of 1K), while the output of my machines has remained relatively constant (aside from Bishop's burp and Twiggy's downtime)? I've noted that other members of my team have been experiencing similar drops, etc, and was wondering what was going on there, too... 73 de AI8W, Chris Abdico Concussio Fidens Servo Libertas Semper! ID: 558052 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.