Message boards :
Technical News :
Adams (Mar 23 2009)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
We had a crazy weekend in database-land. First and foremost, we had issues with one of the root drives on thumper (the primary science database server, among other things). We didn't completely lose the drive, but smartd has been issuing complaints recently about bad sectors, and then the whole system crashed Thursday sometime in the early evening. While I was able to get the machine back up and RAID resyncing from home that night, the timing was such that poor Jeff and Eric had to deal with the fallout the next day without me (I was in Carmel playing spy music at a corporate party - things like the theme from "Get Smart"). The drive arrangement on thumper is a little bizarre. There are 48 drives that sit in a 12x4 grid, with drive #0 in the lower left corner. However, due to the ordering of the six disk controllers on the system, the root drives (a mirrored pair) show up as /dev/sdy and /dev/sdac. This gave us a bit of a headache when installing linux on this the first time a while ago. The root mirror has a dedicated spare, which by some coincidence happens to appear as /dev/sda. Since we never really exercised an actual root drive failure on thumper, Eric and Jeff spent Friday lost in a maze of conundrums. For example, given that grub only recognizes the first four drives in a system (/dev/sd[a-d]) how were things working all along? After some head scratching and drive swapping they got thumper back on line. We still need to replace a drive or two, and those just arrived this morning. Another confusing game plan awaits us as we take what we learned and actually try to apply it. Short story: we need to make a three way mirror of the root drives, after installing grub on the spare by booting from DVD, etc. Honestly I still don't quite get it as I write this up but I'm hoping I will after we go through the whole procedure. And then yesterday jocelyn (the primary mysql server) had some issues. Eric restarted it, and things seemed to clear up without much ado in due time. To be safe we'll do some sweeping data integrity checks on all our databases, probably during the regular outage tomorrow. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
jjwhalen Send message Joined: 13 Jul 99 Posts: 8 Credit: 923,128 RAC: 0 |
Read & understood. For planning purposes, will the validators remain off until (at least) after the weekly outage? Best wishes:) |
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
For planning purposes, will the validators remain off until (at least) after the weekly outage? Nope. I just kicked them after posting that note. Minor library path issue kept them from restarting after the jocelyn crash. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
Wait... I thought the point of a "hot spare" was so that 1) the mirror automatically uses the spare when one drive fails, and 2) you receive notice of the failed drive without a server going down. I've not worked with six disk controllers in one system before, but my first instinct (as to why the device assignments are weird) would be to check the PCI slot numbers and order of SATA ports on the controllers (and the breakout cable/backplane). I'm sure there's a logical order to it if you had all that information. I assume the root drive is RAID 1 with two drives? By default, GRUB installs to the master boot record (MBR) /dev/sda. Booting from a spare drive would only be a problem if your RAID does not duplicate the MBR to the other drives. GRUB must have a much simpler view of the drive configuration for it to work at all. It probably only sees the number of logical drives (arrays) and uses BIOS order to assign letters a,b,c,etc to them. I'm still shocked that this was even an issue, as I've rebuilt/recovered bootable arrays many times before without issue. Anyway, thank you Matt, Jeff, and Eric for taking care of everything without fuss. The project is alive thanks to your busting butt every day. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.