Raw Data (Dec 14 2010)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 1056130 - Posted: 14 Dec 2010, 23:19:15 UTC So over the weekend we had a drive failure on our raw data storage server (where the data files first land after being shipped up from Arecibo). Normally a spare drive should have been pulled in, but it got into a state where the RAID was locked up, so the splitters in turn got locked up, and we ran out of workunits. The state of the raw data storage server was such that the only solution was to hard power cycle the system. Of course, the timing was impeccable. I was busy all weekend doing a bunch of time-pressure contract work (iPhone game stuff). Dude's gotta make a living. I did put an hour or so on both Saturday night and Sunday afternoon trying to diagnose/fix the problem remotely, but didn't have the time to come into the lab. The only other qualified people to deal with this situation (Jeff and Eric) were likewise unable to do much. So it all waited until Monday morning when I got in. I rebooted the system and sure enough it came back up okay, but was automatically resyncing the RAID... using the failed drive! Actually it wasn't clear what it was doing, so I waited for the resync to finish (around 4pm yesterday, Pacific time) to see what it actually did. Yup - it pulled in the failed drive. I figured people were starved enough for work that I fired up the splitters anyway and we were filling the pipeline soon after that. In fact, everything was working so smoothly that we ran out of raw data to process - or at least to make multibeam workunits (we still had data to make astropulse workunits). Fine. Jeff and I took this opportunity to force fail the questionable drive on that server, and a fresh spare was sync'ed up in only a couple hours. Now we're trying our best to get more raw data onto the system (and radar blanked) and then served out to the people. Meanwhile the new servers, and the other old ones, are chugging along nicely. The downtime yesterday afforded us the opportunity to get the weekly mysql maintenance/backup over early, and I also rigged up some tests on oscar/carolyn to see if I can indeed reset the stripe sizes of the large data partitions "live." The answer is: I should be able to, but there are several impossible snags, the worst of which is that live migration take 15 minutes per gigabyte - which means in our case, about 41 days. So we'll do more tests once we're fully loaded again to see exactly what stripe size we'd prefer on oscar. Then we'll move all the data off (probably temporarily to carolyn), re-RAID the thing, then move all the data back - should take less than a day (maybe next Tuesday outage?). - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 1056130 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1056132 - Posted: 14 Dec 2010, 23:33:19 UTC - in response to Message 1056130. Thanks for the update Matt, Claggy ID: 1056132 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 1056134 - Posted: 14 Dec 2010, 23:38:52 UTC - in response to Message 1056130. Matt, thanks for the news! ID: 1056134 ·

perryjay Volunteer tester Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0	Message 1056137 - Posted: 14 Dec 2010, 23:55:48 UTC - in response to Message 1056130. Thanks for the update Matt. Here's hoping we get a few days uptime to fill our caches. I only got one work unit last night. Needless to say it didn't last me long. Maybe this time I'll get enough to last me for awhile. :-) PROUD MEMBER OF Team Starfire World BOINC ID: 1056137 ·

Aker Send message Joined: 2 Nov 01 Posts: 24 Credit: 2,030,727 RAC: 0	Message 1056140 - Posted: 15 Dec 2010, 0:24:27 UTC - in response to Message 1056137. Thanks for taking the time to keep us in the loop. :) ID: 1056140 ·

cer Send message Joined: 15 Apr 00 Posts: 3 Credit: 959,601 RAC: 0	Message 1056216 - Posted: 15 Dec 2010, 8:52:03 UTC - in response to Message 1056130. Same here Matt... thanks for taking the time to update us. It was pretty clear there was an issue of some kind during the period you mention, because I was seeing some strange communications errors from my seti apps. When you all first announced the "weekly outage" idea, I revised all my preferences to try to keep 10 full days of work on hand for each machine. [could be as much as 17 days worth, depending on how the settings are interpreted] My thinking was to make everything as tolerant as possible of outages, while maximizing use of the machines. From my perspective, these hiccups have really born out the thinking and provided a good test at the same time. There was no dormant time due to the failures or the scheduled outage. What I want to ask is, does this methodology work for you guys, or will it result in undue stress somewhere else in the chain? I'm sure I'm not the only one who has made such adaptations. Thanks again. ID: 1056216 ·

Igogo Volunteer tester Send message Joined: 18 Dec 04 Posts: 125 Credit: 65,303,299 RAC: 44	Message 1056217 - Posted: 15 Dec 2010, 9:04:15 UTC - in response to Message 1056216. Thanks for inform us, Matt ID: 1056217 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.