Raw Data (Dec 14 2010)

log in

Advanced search

Message boards : Technical News : Raw Data (Dec 14 2010)

Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 1 Mar 99
Posts: 1391
Credit: 74,079
RAC: 10
United States
Message 1056130 - Posted: 14 Dec 2010, 23:19:15 UTC

So over the weekend we had a drive failure on our raw data storage server (where the data files first land after being shipped up from Arecibo). Normally a spare drive should have been pulled in, but it got into a state where the RAID was locked up, so the splitters in turn got locked up, and we ran out of workunits. The state of the raw data storage server was such that the only solution was to hard power cycle the system.

Of course, the timing was impeccable. I was busy all weekend doing a bunch of time-pressure contract work (iPhone game stuff). Dude's gotta make a living. I did put an hour or so on both Saturday night and Sunday afternoon trying to diagnose/fix the problem remotely, but didn't have the time to come into the lab. The only other qualified people to deal with this situation (Jeff and Eric) were likewise unable to do much. So it all waited until Monday morning when I got in.

I rebooted the system and sure enough it came back up okay, but was automatically resyncing the RAID... using the failed drive! Actually it wasn't clear what it was doing, so I waited for the resync to finish (around 4pm yesterday, Pacific time) to see what it actually did. Yup - it pulled in the failed drive. I figured people were starved enough for work that I fired up the splitters anyway and we were filling the pipeline soon after that.

In fact, everything was working so smoothly that we ran out of raw data to process - or at least to make multibeam workunits (we still had data to make astropulse workunits). Fine. Jeff and I took this opportunity to force fail the questionable drive on that server, and a fresh spare was sync'ed up in only a couple hours. Now we're trying our best to get more raw data onto the system (and radar blanked) and then served out to the people.

Meanwhile the new servers, and the other old ones, are chugging along nicely. The downtime yesterday afforded us the opportunity to get the weekly mysql maintenance/backup over early, and I also rigged up some tests on oscar/carolyn to see if I can indeed reset the stripe sizes of the large data partitions "live." The answer is: I *should* be able to, but there are several impossible snags, the worst of which is that live migration take 15 minutes per gigabyte - which means in our case, about 41 days. So we'll do more tests once we're fully loaded again to see exactly what stripe size we'd prefer on oscar. Then we'll move all the data off (probably temporarily to carolyn), re-RAID the thing, then move all the data back - should take less than a day (maybe next Tuesday outage?).

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4368
Credit: 36,952,911
RAC: 22,330
United Kingdom
Message 1056132 - Posted: 14 Dec 2010, 23:33:19 UTC - in response to Message 1056130.

Thanks for the update Matt,


Profile [seti.international] Dirk SadowskiProject donor
Volunteer tester
Send message
Joined: 6 Apr 07
Posts: 7172
Credit: 61,922,641
RAC: 1,850
Message 1056134 - Posted: 14 Dec 2010, 23:38:52 UTC - in response to Message 1056130.

Matt, thanks for the news!

[SETI@home Needs your Help ... $10 and you get a Star!] [Team seti.international] [Das Deutsche Cafe. The German Cafe.]

Profile perryjay
Volunteer tester
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 17,130,795
RAC: 9,729
United States
Message 1056137 - Posted: 14 Dec 2010, 23:55:48 UTC - in response to Message 1056130.

Thanks for the update Matt. Here's hoping we get a few days uptime to fill our caches. I only got one work unit last night. Needless to say it didn't last me long. Maybe this time I'll get enough to last me for awhile. :-)


Send message
Joined: 2 Nov 01
Posts: 24
Credit: 2,013,138
RAC: 0
United States
Message 1056140 - Posted: 15 Dec 2010, 0:24:27 UTC - in response to Message 1056137.

Thanks for taking the time to keep us in the loop. :)

Send message
Joined: 15 Apr 00
Posts: 3
Credit: 959,601
RAC: 0
United States
Message 1056216 - Posted: 15 Dec 2010, 8:52:03 UTC - in response to Message 1056130.

Same here Matt... thanks for taking the time to update us.

It was pretty clear there was an issue of some kind during the period you mention, because I was seeing some strange communications errors from my seti apps.

When you all first announced the "weekly outage" idea, I revised all my preferences to try to keep 10 full days of work on hand for each machine. [could be as much as 17 days worth, depending on how the settings are interpreted]

My thinking was to make everything as tolerant as possible of outages, while maximizing use of the machines.

From my perspective, these hiccups have really born out the thinking and provided a good test at the same time. There was no dormant time due to the failures or the scheduled outage.

What I want to ask is, does this methodology work for you guys, or will it result in undue stress somewhere else in the chain?

I'm sure I'm not the only one who has made such adaptations.

Thanks again.

Profile IgogoProject donor
Volunteer tester
Send message
Joined: 18 Dec 04
Posts: 104
Credit: 41,946,982
RAC: 27,642
Message 1056217 - Posted: 15 Dec 2010, 9:04:15 UTC - in response to Message 1056216.

Thanks for inform us, Matt

Profile Chris SProject donor
Volunteer tester
Send message
Joined: 19 Nov 00
Posts: 34017
Credit: 15,811,785
RAC: 13,343
United Kingdom
Message 1056218 - Posted: 15 Dec 2010, 9:17:18 UTC

Thanks for the update. So much for some people saying that we're not kept informed, because we are, and it's appreciated. I never did fully understand raids and striping it's a bit of a black art to me!
Damsel Rescuer, Uli Devotee, Julie Supporter, ES99 survivor,
Raccoon Friend, Anniet fan, PETA, IFAW, Humane Soc.

Message boards : Technical News : Raw Data (Dec 14 2010)

Copyright © 2015 University of California