Sitting Targets (Mar 22 2011)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 1089429 - Posted: 22 Mar 2011, 22:41:34 UTC Last modified: 24 Mar 2011, 18:15:58 UTC Our raw data storage server had a drive failure early over the weekend, which locked a bunch of stuff up including workunit production. Oh well. We were able to sort it out when we all got back in the lab on Monday, but it wasn't until late in the day that enough radar-clean data was created for the splitters to chew on and make more workunits. At nearly the same time the above drive failed (during major thunderstorms here in Berkeley, which is probably just coincidental) the replica database on jocelyn crashed yet again. This system keeps losing the external storage (in the form of a Sun 3510) and mysql freaks out. We're not sure what the issue is but today we became fairly confident the problem is local to the 3510 (and not jocelyn itself). An amber light on the back of it means "RAID controller failure" which in this case means this box is pretty much useless. However, on a long shot Jeff suggested I reseat all the drives (most of which have been mounted in the system since we first got it roughly 8 years ago). I did, and the 3510 for the moment seemed willing to play nice. I started recovering the replica database one more time but the 3510 disappeared yet again. We're brainstorming where to move the database - it's not worth replacing that 3510, so we'd need other storage options... Or perhaps not have a replica but some other home-grown backup option. Meanwhile we still have creepy rpc.idmapd problems. This daemon, only on a few select systems, keeps dying at random with an "I/O Possible" message. When it dies, some mounted file systems are suddenly full of files owned by "nobody." I have a workaround for the time being - a cron job that restarts rpc.idmapd every few minutes. Had the usual Tuesday outage today. Spent that time messing with the above, dropping some unneeded science database indexes (maybe that'll speed things up as it'll free up buffer space?) and building a necessary index. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 1089429 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1089432 - Posted: 22 Mar 2011, 22:44:37 UTC - in response to Message 1089429. Thanks for the update Matt, Claggy ID: 1089432 ·

Pascal Send message Joined: 22 Jan 00 Posts: 26 Credit: 3,624,307 RAC: 0	Message 1089443 - Posted: 22 Mar 2011, 23:15:41 UTC Last modified: 22 Mar 2011, 23:19:18 UTC Hi Matt, Thanks for the information.. Question: Is that also the reason for the following? According to the webpages showing my running and finished WU's, my system should be crunching some WU's. Wu's that were created and sended today march 22 2011. But in reality there are no tasks running. Again thanks for the info. grtz, Pascal ID: 1089443 ·

Black Squirrel Prime Send message Joined: 29 Jul 07 Posts: 8 Credit: 15,317,965 RAC: 0	Message 1089444 - Posted: 22 Mar 2011, 23:15:46 UTC Thanks Matt. Keep fighting the good fight!!!! ID: 1089444 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1089445 - Posted: 22 Mar 2011, 23:24:41 UTC - in response to Message 1089443. Last modified: 23 Mar 2011, 0:22:41 UTC Hi Matt, Thanks for the information.. Question: Is that also the reason for the following? According to the webpages showing my running and finished WU's, my system should be crunching some WU's. Wu's that were created and sended today march 22 2011 15:09:13 UTC But in reality there are no tasks running. Again thanks for the info. grtz, Pascal That was around the time when scheduler contact was flaky (to both the Main and Beta projects), expect them to get resent when you contact the server next, Edit: Looks like you generated a new host ID too, if the two hosts shown are the same, you might want to 'Merge duplicate records of this computer' on your host's details page, ie: right at the Bottom of this page: Computer 5823101 Claggy ID: 1089445 ·

Rainmaker* Send message Joined: 2 Mar 11 Posts: 3 Credit: 1,350,052 RAC: 0	Message 1089449 - Posted: 22 Mar 2011, 23:34:14 UTC Matt Like most everyone else, thank you for the update. However, I'm not up on the lingo of the trade. My conception of networks pretty much ends after a P2P setup. I'll paraphrase: The RAID (familiar term) server that has all the raw data on it went kah-putz, and the replica of the RAID followed suit. The system that the RAID replica was attached to, a Sun 3510, keeps loosing external storage - another RAID - but it seems to be an issue with the 3510, not the entire system itself and we are in need of a different type of external storage system. I have no idea what rpc.idmapd is, but I think a daemon is a "handler" of sorts, and that "handler," when it dies, fills up other mounted systems with files of unknown ownership. A work-around you created restarts the daemon every so often to keep it running and not stuffing the other systems fill of files. *end of paraphrasing* I understand what distributive computing is, but when I start seeing jocelyn and mysql, my brain turns into goo and I don't understand it any more. So I am guessing that it might be a while before things are back up and running for SETI@home. If I am incorrect in my last "guess," please let me know. Paul ID: 1089449 ·

Pascal Send message Joined: 22 Jan 00 Posts: 26 Credit: 3,624,307 RAC: 0	Message 1089450 - Posted: 22 Mar 2011, 23:35:01 UTC - in response to Message 1089445. Last modified: 22 Mar 2011, 23:54:10 UTC Hi Claggy, thnx! but.. Tried forcing to contact the website. Work isn't downloaded.. you allready noticed the id's ;) Just noticed that the id of the system maybe changed?? Don't know why. didn't make any changes to the system itselve though. (no deinstall and reinstall of boinc.) and i can't, for some reason, merge the two on the website.. ?? grtz ID: 1089450 ·

Mithotar Send message Joined: 11 Apr 01 Posts: 88 Credit: 66,037,385 RAC: 50	Message 1089481 - Posted: 23 Mar 2011, 1:48:58 UTC - in response to Message 1089429. "Our raw data storage server had a drive failure early over the weekend" It would appear that perhaps the next target of a fund/donation drive has found us. ID: 1089481 ·

John William Gibson Send message Joined: 20 Sep 06 Posts: 4 Credit: 809,431 RAC: 1	Message 1089508 - Posted: 23 Mar 2011, 3:40:45 UTC Well, since I last looked it seems the only splitter went down, so no work units are being passed. One of machines was able to get one, but my primary machine is siting idle. Such is life. ID: 1089508 ·

Chris Hotte Send message Joined: 5 Aug 08 Posts: 2 Credit: 9,913,915 RAC: 2	Message 1089527 - Posted: 23 Mar 2011, 4:13:22 UTC - in response to Message 1089508. How many thousand hours on those drives? 8 years? Replace them, as the bearings are likely going going.... We have some of the same going on in our shop, at 50K hours. Our XServe RAIDs have the drives sitting on edge, which is IMHO, worse. Alternately we (you or I) could try re-racking our RAID modules the other way around to squeeze more life from the bearings. Back in the day the same principal worked for my 1541 floppy drive. ID: 1089527 ·

Michael Reaves Send message Joined: 4 Jun 99 Posts: 23 Credit: 22,801,220 RAC: 0	Message 1089664 - Posted: 23 Mar 2011, 16:34:55 UTC The drive failures are most probably caused by the Japan radiation cloud (That's what I'm telling my boss).... ID: 1089664 ·

KWSN THE Holy Hand Grenade! Volunteer tester Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0	Message 1089677 - Posted: 23 Mar 2011, 17:49:52 UTC Last modified: 23 Mar 2011, 17:51:04 UTC When you get production SETI sorted out, could someone start Beta? (...or, if it isn't affected by the production problems, just start Beta?...) . Hello, from Albany, CA!... ID: 1089677 ·

OzzFan Volunteer tester Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28	Message 1089683 - Posted: 23 Mar 2011, 18:12:14 UTC - in response to Message 1089677. You have to ask in a language that Matt understands: if (seti) == sorted out then start (seti beta) elseif start (seti beta) :-D ID: 1089683 ·

Jeff Mercer Send message Joined: 14 Aug 08 Posts: 90 Credit: 162,139 RAC: 0	Message 1089693 - Posted: 23 Mar 2011, 18:28:42 UTC Hello... just wanted to say that everything is working for me again. Not sure if everyone else is having problems or not. I'm still able to send workunits in, and I'm getting workunits to complete. Just thought I'd let you all know that things are working... (For Me.) I have to shut my computer down for a while. There is a tornado watch in my area, and it's not looking good. Hope that everyone is keeping supplied with workunits, and to Matt and the crew.... Thanks for all the updates and your hard work. ID: 1089693 ·

KWSN THE Holy Hand Grenade! Volunteer tester Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0	Message 1089898 - Posted: 24 Mar 2011, 7:04:42 UTC Still no SETI beta, though... . Hello, from Albany, CA!... ID: 1089898 ·

Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 1090041 - Posted: 24 Mar 2011, 18:00:06 UTC - in response to Message 1089898. Still no SETI beta, though... Eric's got the hood cracked open on that. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 1090041 ·

KWSN THE Holy Hand Grenade! Volunteer tester Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0	Message 1090051 - Posted: 24 Mar 2011, 19:11:55 UTC - in response to Message 1089683. Last modified: 24 Mar 2011, 19:12:43 UTC You have to ask in a language that Matt understands: if (seti) == sorted out then start (seti beta) elseif start (seti beta) :-D close, but no cigar... what I meant was If (SETI=sorted) then start beta else if not(beta_problems) then start beta If (beta_problems) then (post message on beta site). (my logic is Fortran/Basic oriented...) . Hello, from Albany, CA!... ID: 1090051 ·

OzzFan Volunteer tester Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28	Message 1090054 - Posted: 24 Mar 2011, 19:23:56 UTC - in response to Message 1090051. Excellent! It even has logic to cover if there are Beta problems. :-) ID: 1090054 ·

.clair. Send message Joined: 4 Nov 04 Posts: 1300 Credit: 55,390,408 RAC: 69	Message 1090095 - Posted: 24 Mar 2011, 21:17:21 UTC My understanding of Algol was very Basic, So i had to GO TO job center and Find something i could Do ;) ID: 1090095 ·

David S Volunteer tester Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12	Message 1090285 - Posted: 25 Mar 2011, 14:28:08 UTC - in response to Message 1089429. Why is the upload server down? I didn't see that anywhere on here. My machine has been sitting on its last 6 WUs all week and started working on Einstein again. David David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. ID: 1090285 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.