Meh (Nov 09 2009)


log in

Advanced search

Message boards : Technical News : Meh (Nov 09 2009)

Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1391
Credit: 74,079
RAC: 0
United States
Message 946245 - Posted: 10 Nov 2009, 0:24:48 UTC

Our master mysql database server (mork) crashed on Sunday. The first crash when we brought mork on line way back when was a "fluke" - the crash a few weeks ago was explainable (or so we thought) - but now we're in the realm of "grave concern" about this particular server. However, the result of each crash is just an annoying chunk of downtime - the actual data remain intact after recovery, and recovery goes along without too much ado. Maybe we have just been lucky so far. I could see a flat out crash being a bit more disastrous.

Eric did the remote work of initial and post-reboot cleanup, Dan actually came up to the lab to physically power cycle the machine, which Jeff walked him through over the phone. I assumed we'd all just wait until the next day when we're all back at the lab to set things right (after all, we've have longer unexpected outages before). When I returned from prior obligations to find the projects up I was pleased by the heroic effort. Still, I quickly noticed that the splitters were in a funny state which required my intervention or else we would have immediately run out of work to send out, so I fixed all that.

Anyway, we'll have to do some extra recovery tasks tomorrow during the regular outage. This will include putting a debug kernel on mork and some other crash-test stuff that may hopefully give us clues if mork decides to disappear again.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

DJStarfox
Send message
Joined: 23 May 01
Posts: 1045
Credit: 569,325
RAC: 97
United States
Message 946295 - Posted: 10 Nov 2009, 4:00:18 UTC - in response to Message 946245.

I hate flaky hardware; I can appreciate the effort involved.

If the debug kernel doesn't save the errors before crashing, you could always do that trick of redirecting the console & stderr to a serial port. (Have a laptop or computer record the serial data.)

Here's the quick-n-dirty HOW-TO link if you need it.
http://tldp.org/HOWTO/Remote-Serial-Console-HOWTO/configure-kernel-grub.html

Profile Keith T.
Volunteer tester
Avatar
Send message
Joined: 23 Aug 99
Posts: 740
Credit: 233,186
RAC: 6
United Kingdom
Message 946374 - Posted: 10 Nov 2009, 12:44:02 UTC

Is any of the hardware in your server closet of the vintage where it could be prone to the "Capacitor Plague" http://en.wikipedia.org/wiki/Capacitor_plague?

Profile Keith T.
Volunteer tester
Avatar
Send message
Joined: 23 Aug 99
Posts: 740
Credit: 233,186
RAC: 6
United Kingdom
Message 959975 - Posted: 1 Jan 2010, 23:49:09 UTC
Last modified: 2 Jan 2010, 0:02:09 UTC

Happy New Year to all the staff. Thanks for working on a holiday to get the project back on line.

I suspect it was Mork which crashed again today. Any news on the hardware side of things? Could a PSU or UPS be causing power spikes due to insufficant filtering, maybe some capacitors just on the limits of tolerance?

[edit]changed "suffering from spikes" to "causing spikes".[/edit]

Message boards : Technical News : Meh (Nov 09 2009)

Copyright © 2014 University of California