Adams (Mar 23 2009)

Message boards : Technical News : Adams (Mar 23 2009)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 878647 - Posted: 23 Mar 2009, 19:30:51 UTC

We had a crazy weekend in database-land. First and foremost, we had issues with one of the root drives on thumper (the primary science database server, among other things). We didn't completely lose the drive, but smartd has been issuing complaints recently about bad sectors, and then the whole system crashed Thursday sometime in the early evening. While I was able to get the machine back up and RAID resyncing from home that night, the timing was such that poor Jeff and Eric had to deal with the fallout the next day without me (I was in Carmel playing spy music at a corporate party - things like the theme from "Get Smart").

The drive arrangement on thumper is a little bizarre. There are 48 drives that sit in a 12x4 grid, with drive #0 in the lower left corner. However, due to the ordering of the six disk controllers on the system, the root drives (a mirrored pair) show up as /dev/sdy and /dev/sdac. This gave us a bit of a headache when installing linux on this the first time a while ago. The root mirror has a dedicated spare, which by some coincidence happens to appear as /dev/sda.

Since we never really exercised an actual root drive failure on thumper, Eric and Jeff spent Friday lost in a maze of conundrums. For example, given that grub only recognizes the first four drives in a system (/dev/sd[a-d]) how were things working all along? After some head scratching and drive swapping they got thumper back on line. We still need to replace a drive or two, and those just arrived this morning. Another confusing game plan awaits us as we take what we learned and actually try to apply it. Short story: we need to make a three way mirror of the root drives, after installing grub on the spare by booting from DVD, etc. Honestly I still don't quite get it as I write this up but I'm hoping I will after we go through the whole procedure.

And then yesterday jocelyn (the primary mysql server) had some issues. Eric restarted it, and things seemed to clear up without much ado in due time. To be safe we'll do some sweeping data integrity checks on all our databases, probably during the regular outage tomorrow.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 878647 · Report as offensive
jjwhalen
Volunteer tester
Avatar

Send message
Joined: 13 Jul 99
Posts: 8
Credit: 923,128
RAC: 0
United States
Message 878655 - Posted: 23 Mar 2009, 19:49:23 UTC - in response to Message 878647.  

Read & understood.

For planning purposes, will the validators remain off until (at least) after the weekly outage?
Best wishes:)
ID: 878655 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 878672 - Posted: 23 Mar 2009, 20:28:02 UTC - in response to Message 878655.  

For planning purposes, will the validators remain off until (at least) after the weekly outage?


Nope. I just kicked them after posting that note. Minor library path issue kept them from restarting after the jocelyn crash.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 878672 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 878763 - Posted: 24 Mar 2009, 2:44:11 UTC - in response to Message 878647.  

Wait... I thought the point of a "hot spare" was so that 1) the mirror automatically uses the spare when one drive fails, and 2) you receive notice of the failed drive without a server going down. I've not worked with six disk controllers in one system before, but my first instinct (as to why the device assignments are weird) would be to check the PCI slot numbers and order of SATA ports on the controllers (and the breakout cable/backplane). I'm sure there's a logical order to it if you had all that information.

I assume the root drive is RAID 1 with two drives? By default, GRUB installs to the master boot record (MBR) /dev/sda. Booting from a spare drive would only be a problem if your RAID does not duplicate the MBR to the other drives.

GRUB must have a much simpler view of the drive configuration for it to work at all. It probably only sees the number of logical drives (arrays) and uses BIOS order to assign letters a,b,c,etc to them. I'm still shocked that this was even an issue, as I've rebuilt/recovered bootable arrays many times before without issue.

Anyway, thank you Matt, Jeff, and Eric for taking care of everything without fuss. The project is alive thanks to your busting butt every day.
ID: 878763 · Report as offensive

Message boards : Technical News : Adams (Mar 23 2009)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.