Meh (Nov 09 2009)

Message boards : Technical News : Meh (Nov 09 2009)

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1441
Credit: 213,689
RAC: 0
United States
Message 946245 - Posted: 10 Nov 2009, 0:24:48 UTC

Our master mysql database server (mork) crashed on Sunday. The first crash when we brought mork on line way back when was a "fluke" - the crash a few weeks ago was explainable (or so we thought) - but now we're in the realm of "grave concern" about this particular server. However, the result of each crash is just an annoying chunk of downtime - the actual data remain intact after recovery, and recovery goes along without too much ado. Maybe we have just been lucky so far. I could see a flat out crash being a bit more disastrous.

Eric did the remote work of initial and post-reboot cleanup, Dan actually came up to the lab to physically power cycle the machine, which Jeff walked him through over the phone. I assumed we'd all just wait until the next day when we're all back at the lab to set things right (after all, we've have longer unexpected outages before). When I returned from prior obligations to find the projects up I was pleased by the heroic effort. Still, I quickly noticed that the splitters were in a funny state which required my intervention or else we would have immediately run out of work to send out, so I fixed all that.

Anyway, we'll have to do some extra recovery tasks tomorrow during the regular outage. This will include putting a debug kernel on mork and some other crash-test stuff that may hopefully give us clues if mork decides to disappear again.

- Matt


-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

ID: 946245 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1057
Credit: 802,388
RAC: 176
United States
Message 946295 - Posted: 10 Nov 2009, 4:00:18 UTC - in response to Message 946245.

I hate flaky hardware; I can appreciate the effort involved.

If the debug kernel doesn't save the errors before crashing, you could always do that trick of redirecting the console & stderr to a serial port. (Have a laptop or computer record the serial data.)

Here's the quick-n-dirty HOW-TO link if you need it.
http://tldp.org/HOWTO/Remote-Serial-Console-HOWTO/configure-kernel-grub.html

ID: 946295 · Report as offensive
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 743
Credit: 244,276
RAC: 0
United Kingdom
Message 946374 - Posted: 10 Nov 2009, 12:44:02 UTC

Is any of the hardware in your server closet of the vintage where it could be prone to the "Capacitor Plague" http://en.wikipedia.org/wiki/Capacitor_plague?

ID: 946374 · Report as offensive
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 743
Credit: 244,276
RAC: 0
United Kingdom
Message 959975 - Posted: 1 Jan 2010, 23:49:09 UTC
Last modified: 2 Jan 2010, 0:02:09 UTC

Happy New Year to all the staff. Thanks for working on a holiday to get the project back on line.

I suspect it was Mork which crashed again today. Any news on the hardware side of things? Could a PSU or UPS be causing power spikes due to insufficant filtering, maybe some capacitors just on the limits of tolerance?

[edit]changed "suffering from spikes" to "causing spikes".[/edit]

ID: 959975 · Report as offensive

Message boards : Technical News : Meh (Nov 09 2009)


 
©2016 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.