Raw Data (Dec 14 2010)


log in

Advanced search

Message boards : Technical News : Raw Data (Dec 14 2010)

Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 1056130 - Posted: 14 Dec 2010, 23:19:15 UTC

So over the weekend we had a drive failure on our raw data storage server (where the data files first land after being shipped up from Arecibo). Normally a spare drive should have been pulled in, but it got into a state where the RAID was locked up, so the splitters in turn got locked up, and we ran out of workunits. The state of the raw data storage server was such that the only solution was to hard power cycle the system.

Of course, the timing was impeccable. I was busy all weekend doing a bunch of time-pressure contract work (iPhone game stuff). Dude's gotta make a living. I did put an hour or so on both Saturday night and Sunday afternoon trying to diagnose/fix the problem remotely, but didn't have the time to come into the lab. The only other qualified people to deal with this situation (Jeff and Eric) were likewise unable to do much. So it all waited until Monday morning when I got in.

I rebooted the system and sure enough it came back up okay, but was automatically resyncing the RAID... using the failed drive! Actually it wasn't clear what it was doing, so I waited for the resync to finish (around 4pm yesterday, Pacific time) to see what it actually did. Yup - it pulled in the failed drive. I figured people were starved enough for work that I fired up the splitters anyway and we were filling the pipeline soon after that.

In fact, everything was working so smoothly that we ran out of raw data to process - or at least to make multibeam workunits (we still had data to make astropulse workunits). Fine. Jeff and I took this opportunity to force fail the questionable drive on that server, and a fresh spare was sync'ed up in only a couple hours. Now we're trying our best to get more raw data onto the system (and radar blanked) and then served out to the people.

Meanwhile the new servers, and the other old ones, are chugging along nicely. The downtime yesterday afforded us the opportunity to get the weekly mysql maintenance/backup over early, and I also rigged up some tests on oscar/carolyn to see if I can indeed reset the stripe sizes of the large data partitions "live." The answer is: I *should* be able to, but there are several impossible snags, the worst of which is that live migration take 15 minutes per gigabyte - which means in our case, about 41 days. So we'll do more tests once we're fully loaded again to see exactly what stripe size we'd prefer on oscar. Then we'll move all the data off (probably temporarily to carolyn), re-RAID the thing, then move all the data back - should take less than a day (maybe next Tuesday outage?).

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4072
Credit: 32,909,918
RAC: 7,870
United Kingdom
Message 1056132 - Posted: 14 Dec 2010, 23:33:19 UTC - in response to Message 1056130.

Thanks for the update Matt,

Claggy

Profile [seti.international] Dirk SadowskiProject donor
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 7062
Credit: 60,021,139
RAC: 21,405
Germany
Message 1056134 - Posted: 14 Dec 2010, 23:38:52 UTC - in response to Message 1056130.

Matt, thanks for the news!

____________
BR



>Das Deutsche Cafe. The German Cafe.<

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 15,333,340
RAC: 11,560
United States
Message 1056137 - Posted: 14 Dec 2010, 23:55:48 UTC - in response to Message 1056130.

Thanks for the update Matt. Here's hoping we get a few days uptime to fill our caches. I only got one work unit last night. Needless to say it didn't last me long. Maybe this time I'll get enough to last me for awhile. :-)
____________


PROUD MEMBER OF Team Starfire World BOINC

Aker
Send message
Joined: 2 Nov 01
Posts: 24
Credit: 2,013,138
RAC: 0
United States
Message 1056140 - Posted: 15 Dec 2010, 0:24:27 UTC - in response to Message 1056137.

Thanks for taking the time to keep us in the loop. :)
____________

cer
Avatar
Send message
Joined: 15 Apr 00
Posts: 3
Credit: 959,601
RAC: 0
United States
Message 1056216 - Posted: 15 Dec 2010, 8:52:03 UTC - in response to Message 1056130.

Same here Matt... thanks for taking the time to update us.

It was pretty clear there was an issue of some kind during the period you mention, because I was seeing some strange communications errors from my seti apps.

When you all first announced the "weekly outage" idea, I revised all my preferences to try to keep 10 full days of work on hand for each machine. [could be as much as 17 days worth, depending on how the settings are interpreted]

My thinking was to make everything as tolerant as possible of outages, while maximizing use of the machines.

From my perspective, these hiccups have really born out the thinking and provided a good test at the same time. There was no dormant time due to the failures or the scheduled outage.

What I want to ask is, does this methodology work for you guys, or will it result in undue stress somewhere else in the chain?

I'm sure I'm not the only one who has made such adaptations.

Thanks again.

Profile IgogoProject donor
Volunteer tester
Avatar
Send message
Joined: 18 Dec 04
Posts: 100
Credit: 37,221,260
RAC: 29,860
Ukraine
Message 1056217 - Posted: 15 Dec 2010, 9:04:15 UTC - in response to Message 1056216.

Thanks for inform us, Matt

Profile Chris SProject donor
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 31456
Credit: 12,200,590
RAC: 27,954
United Kingdom
Message 1056218 - Posted: 15 Dec 2010, 9:17:18 UTC

Thanks for the update. So much for some people saying that we're not kept informed, because we are, and it's appreciated. I never did fully understand raids and striping it's a bit of a black art to me!
____________
Damsel Rescuer, Kitty Patron, Uli Devotee, Julie Supporter
ES99 Admirer, Raccoon Friend, Anniet fan, RJ45 rulz OK!


Message boards : Technical News : Raw Data (Dec 14 2010)

Copyright © 2014 University of California