Message boards :
Technical News :
Down Time III (May 03 2007)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Last night galileo crashed. Nobody could see the scheduler all evening. Most of our other systems were stuck hanging on its mounts, which explains the painfully slow web servers. Not sure what caused the crash, but it seems like a typical panic/reset which happens on machines which are up for many months at a time. Upon reboot it needed to fsck a drive and went into single user mode waiting for somebody to log in and do so. That somebody was Jeff around 8:30am this morning. Then it came up just fine. While scavenging for parts in our lab Eric discovered a media converter and I then found the right cable to allow us to hook up setifiler1 to the new gigabit switch via fibre. If there were any web glitches this morning, it was because we were in the process of doing this and cleaning out routing/arp tables afterwards. Now setifiler1 can talk gigabit to our other machines. Not sure if this helps much, but setifiler1 is an old but perfectly functioning Network Appliance NAS system containing, among other things, all the files that comprise the SETI@home public web site and tape images for splitting. Jeff and I also wrapped up moving the lingering systems in the closet off the 100 Mbit switch and onto the new switch. Lots of ethernet/power cable spaghetti back there. On the science database front, the outage continues. Not much to say about that except we're still working on getting replacement hardware. Frankly, no real time estimate on that. Some people have noticed, despite apparent claims on our website otherwise, their clients were able to get new workunits. This is because, due to some BOINC clients taking too long to process/return results or failures during validation, the BOINC backend puts these timed-out/unvalidated workunits back in the "to do" pile. I just checked and noticed we're still sending out workunit at the rate of 1 every 10+ seconds. Not exactly a lot... but not zero, either. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Carlos Send message Joined: 9 Jun 99 Posts: 30717 Credit: 57,275,487 RAC: 157 |
Well not great news, but thank you for taking the time to let us all know what is going on. If there are any suggestions as to what we the general crunching public can do please let us know. And thank you for your hard work and dedication. |
Fuzzy Hollynoodles Send message Joined: 3 Apr 99 Posts: 9659 Credit: 251,998 RAC: 0 |
When it rains it pours. :-( Keep up the good job, Eric and co. "I'm trying to maintain a shred of dignity in this world." - Me |
Xur Send message Joined: 27 Dec 00 Posts: 1 Credit: 1,023,577 RAC: 8 |
And when it pours, it is like acid rain! Good luck on getting everything back up and running! |
ML1 Send message Joined: 25 Nov 01 Posts: 21253 Credit: 7,508,002 RAC: 20 |
When it rains it pours. :-( It is also called the "domino effect", or in more extreme cases comparisons to a "house of cards" are made. Considering the spaghetti of networking and multiple filesystem mounts that s@h relies upon, it is amazing that the system is as reliable as it is! All best luck to the sysops for working up out of this outage. May the data remain intact! Regards, Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
Mullet Send message Joined: 13 Jun 02 Posts: 2 Credit: 73,503 RAC: 0 |
Hey Matt, Just an idea, you're already dating the posts, why not put the dates first, and then you won't have to make the new/ continuing news 'sticky' to keep it on top. Meanwhile, have fun, gotta love that switch room spaghetti! Keith (NetAdmin/ Cable monkey for 6 years now) |
wizardzik Send message Joined: 1 Jan 02 Posts: 1 Credit: 674,944 RAC: 0 |
Fortunatelly I have some work to do... and I hope new work will be available as soon as possible. Take care |
Michael Send message Joined: 5 Apr 04 Posts: 1 Credit: 873,964 RAC: 18 |
Just wondering if you have an ETA on when work units will be sent out?? I have been away from Seti for a year and just started again 10 days ago. Then all of a sudden POOF no work to do. Thanks Jeff and Eric for all of the hard work. Michael |
Ace Casino Send message Joined: 5 Feb 03 Posts: 285 Credit: 29,750,804 RAC: 15 |
I’ve been looking around the boards and have not seen a definitive answer on whether “reporting results†is possible, at all. Is it just difficult right now, or is it impossible until the hardware is replaced. If it’s impossible we all might as well suspend network activity, not just suspend new tasks. Thanks |
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
ADDENDUM: Well, galileo just barfed again. At least it crashed and rebooted itself cleanly this time, and the culprit revealed itself: blown CPU. No big loss, as this system is edging towards retirement anyway, and can handle its current load with one less CPU (especially as these are ancient 400 MHz sparcs). I am making strides towards turning bruno into the scheduler just as soon as I submit this post. Probably won't be enacted until next week, though. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 66362 Credit: 55,293,173 RAC: 49 |
ADDENDUM: Well, galileo just barfed again. At least it crashed and rebooted itself cleanly this time, and the culprit revealed itself: blown CPU. No big loss, as this system is edging towards retirement anyway, and can handle its current load with one less CPU (especially as these are ancient 400 MHz sparcs). Hmmmm, Attempted server suicide. Must be an old age thing with computers. ;) Another one to replace, Right after getting a backup for Thumper. Glad one less cpu didn't hurt though. :D Savoir-Faire is everywhere! The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST |
SolAirQuebec Send message Joined: 20 Oct 06 Posts: 11 Credit: 1,483,464 RAC: 0 |
Hi Matt, I didn't know if this can help, but I can pass you one of my 2 motherboards (quad opteron tyan K8WE(with SCSI) or quad opteron supermicro H8DA8 SATA & SCSI), if this can help send me a email, after the situation come back to "normal" you can resend me the board. Have a nice day, Jean |
John Clark Send message Joined: 29 Sep 99 Posts: 16515 Credit: 4,418,829 RAC: 0 |
Thanks for the updates Matt. Sorry about the additional server issues. Some day the sun will shine again! It's good to be back amongst friends and colleagues |
kevint Send message Joined: 17 May 99 Posts: 414 Credit: 11,680,240 RAC: 0 |
Thanks for the updates Matt. That is the problem - the SUN has crashed - I wonder - how that donated server (that was not accepted) would have worked out in a situation like this. |
Roy Wall (shiny sides) Send message Joined: 8 Nov 99 Posts: 5 Credit: 5,099,610 RAC: 0 |
ADDENDUM: Well, galileo just barfed again. At least it crashed and rebooted itself cleanly this time, and the culprit revealed itself: blown CPU. No big loss, as this system is edging towards retirement anyway, and can handle its current load with one less CPU (especially as these are ancient 400 MHz sparcs). Matt, What are the spacs for a server? If you would post them we may be able to pull together one or two for backup. We do thank you for your work, hope you get some sleep soon. Roy |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13855 Credit: 208,696,464 RAC: 304 |
Last night galileo crashed. Nobody could see the scheduler all evening. I've been unable to contact the Scheduler since Thumper carked it, but no problems uploading results. 4/05/2007 11:56:30|SETI@home|Scheduler request failed: couldn't connect to server But it's not a big deal to me. *shrug* I'll just keep network access disabled till my cache runs out, return those results & then disable it again till work's available again. Grant Darwin NT |
Pappa Send message Joined: 9 Jan 00 Posts: 2562 Credit: 12,301,681 RAC: 0 |
Kevin OOPS! I mistakenly acknowledged Seti USA in Down Time (May 01 2007) for their massive donation... I know that the Donated RAM (from Seti USA) to upgrade servers cost a fortune.... Then I see this CRAP! It diminishes the Major Hardware Donation that Happened (Thank You Seti USA)! You also bring up a Machine that would cost more to connect power to run it (it is only money and time they do not have)... Not to mention it can not connect and run Megabit Ethernet! What is the real "gain?" PLEASE LEAVE POLITICS OUT OF ALL THIS! If You want to Help, then there are ways to to Help... This reply is NOT Helping! Actually as I think about my reply to your reply is only showing that something is not right.... I Do appreciate Dr Dan's offer for the old multiproc... Dr Dan, Thank You! I am sorry that it did not work.... The work that everyone did on Bruno turned out better anyway... That was about 8 peoples efforts, most did not want their names mentioned they just did it... So what is the problem? I had asked you to email me so that I might be able to share what I know to present a more knowledgeable front that could move Seti Ahead... As I recall You refused... All I can say is one person pushing uphill is a very big battle, two people pushing makes it a bit easier... What happens if "WE" get 500 People Pushing in the same direction? My last conversation with Eric the next Donation Push is not properly defined or ready... When things are ready, I hope that the information reaches "everyone" that can help in whatever way they can help... Right Now Thumper is DEAD! Regards Pappa
Please consider a Donation to the Seti Project. |
Pappa Send message Joined: 9 Jan 00 Posts: 2562 Credit: 12,301,681 RAC: 0 |
Roy No one has given a link to the Sun hardware that is Thumper... I will take time to give you the link so that everyone can see... Sun Fire X4500 Server Regards Pappa ADDENDUM: Well, galileo just barfed again. At least it crashed and rebooted itself cleanly this time, and the culprit revealed itself: blown CPU. No big loss, as this system is edging towards retirement anyway, and can handle its current load with one less CPU (especially as these are ancient 400 MHz sparcs). Please consider a Donation to the Seti Project. |
Jason Safoutin Send message Joined: 8 Sep 05 Posts: 1386 Credit: 200,389 RAC: 0 |
How sad :-( I hope that the problem doesn't last too long. Thanks for the update and we all hope things improve. "By faith we understand that the universe was formed at God's command, so that what is seen was not made out of what was visible". Hebrews 11.3 |
BarryAZ Send message Joined: 1 Apr 01 Posts: 2580 Credit: 16,982,517 RAC: 0 |
OK-- so you have been able to verify that there are problems with handling uploads which may result in 'overdue' reported workunits. So is it possible (as it has been in the past when project issues make reporting of work VERY problematic) to push out dates -- I know I have a bunch of May 5 deadlines for completed units that simply can't get uploaded. ADDENDUM: Well, galileo just barfed again. At least it crashed and rebooted itself cleanly this time, and the culprit revealed itself: blown CPU. No big loss, as this system is edging towards retirement anyway, and can handle its current load with one less CPU (especially as these are ancient 400 MHz sparcs). |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.