Raw Data (Dec 14 2010)

Message boards : Technical News : Raw Data (Dec 14 2010)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1056130 - Posted: 14 Dec 2010, 23:19:15 UTC

So over the weekend we had a drive failure on our raw data storage server (where the data files first land after being shipped up from Arecibo). Normally a spare drive should have been pulled in, but it got into a state where the RAID was locked up, so the splitters in turn got locked up, and we ran out of workunits. The state of the raw data storage server was such that the only solution was to hard power cycle the system.

Of course, the timing was impeccable. I was busy all weekend doing a bunch of time-pressure contract work (iPhone game stuff). Dude's gotta make a living. I did put an hour or so on both Saturday night and Sunday afternoon trying to diagnose/fix the problem remotely, but didn't have the time to come into the lab. The only other qualified people to deal with this situation (Jeff and Eric) were likewise unable to do much. So it all waited until Monday morning when I got in.

I rebooted the system and sure enough it came back up okay, but was automatically resyncing the RAID... using the failed drive! Actually it wasn't clear what it was doing, so I waited for the resync to finish (around 4pm yesterday, Pacific time) to see what it actually did. Yup - it pulled in the failed drive. I figured people were starved enough for work that I fired up the splitters anyway and we were filling the pipeline soon after that.

In fact, everything was working so smoothly that we ran out of raw data to process - or at least to make multibeam workunits (we still had data to make astropulse workunits). Fine. Jeff and I took this opportunity to force fail the questionable drive on that server, and a fresh spare was sync'ed up in only a couple hours. Now we're trying our best to get more raw data onto the system (and radar blanked) and then served out to the people.

Meanwhile the new servers, and the other old ones, are chugging along nicely. The downtime yesterday afforded us the opportunity to get the weekly mysql maintenance/backup over early, and I also rigged up some tests on oscar/carolyn to see if I can indeed reset the stripe sizes of the large data partitions "live." The answer is: I *should* be able to, but there are several impossible snags, the worst of which is that live migration take 15 minutes per gigabyte - which means in our case, about 41 days. So we'll do more tests once we're fully loaded again to see exactly what stripe size we'd prefer on oscar. Then we'll move all the data off (probably temporarily to carolyn), re-RAID the thing, then move all the data back - should take less than a day (maybe next Tuesday outage?).

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1056130 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1056132 - Posted: 14 Dec 2010, 23:33:19 UTC - in response to Message 1056130.  

Thanks for the update Matt,

Claggy
ID: 1056132 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1056134 - Posted: 14 Dec 2010, 23:38:52 UTC - in response to Message 1056130.  

Matt, thanks for the news!

ID: 1056134 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1056137 - Posted: 14 Dec 2010, 23:55:48 UTC - in response to Message 1056130.  

Thanks for the update Matt. Here's hoping we get a few days uptime to fill our caches. I only got one work unit last night. Needless to say it didn't last me long. Maybe this time I'll get enough to last me for awhile. :-)


PROUD MEMBER OF Team Starfire World BOINC
ID: 1056137 · Report as offensive
Aker

Send message
Joined: 2 Nov 01
Posts: 24
Credit: 2,030,727
RAC: 0
United States
Message 1056140 - Posted: 15 Dec 2010, 0:24:27 UTC - in response to Message 1056137.  

Thanks for taking the time to keep us in the loop. :)
ID: 1056140 · Report as offensive
cer
Avatar

Send message
Joined: 15 Apr 00
Posts: 3
Credit: 959,601
RAC: 0
United States
Message 1056216 - Posted: 15 Dec 2010, 8:52:03 UTC - in response to Message 1056130.  

Same here Matt... thanks for taking the time to update us.

It was pretty clear there was an issue of some kind during the period you mention, because I was seeing some strange communications errors from my seti apps.

When you all first announced the "weekly outage" idea, I revised all my preferences to try to keep 10 full days of work on hand for each machine. [could be as much as 17 days worth, depending on how the settings are interpreted]

My thinking was to make everything as tolerant as possible of outages, while maximizing use of the machines.

From my perspective, these hiccups have really born out the thinking and provided a good test at the same time. There was no dormant time due to the failures or the scheduled outage.

What I want to ask is, does this methodology work for you guys, or will it result in undue stress somewhere else in the chain?

I'm sure I'm not the only one who has made such adaptations.

Thanks again.
ID: 1056216 · Report as offensive
Profile Igogo Project Donor
Volunteer tester
Avatar

Send message
Joined: 18 Dec 04
Posts: 125
Credit: 65,303,299
RAC: 44
Thailand
Message 1056217 - Posted: 15 Dec 2010, 9:04:15 UTC - in response to Message 1056216.  

Thanks for inform us, Matt
ID: 1056217 · Report as offensive

Message boards : Technical News : Raw Data (Dec 14 2010)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.