Message boards :
Technical News :
Sitting Targets (Mar 22 2011)
Message board moderation
Author | Message |
---|---|
![]() ![]() Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 ![]() |
Our raw data storage server had a drive failure early over the weekend, which locked a bunch of stuff up including workunit production. Oh well. We were able to sort it out when we all got back in the lab on Monday, but it wasn't until late in the day that enough radar-clean data was created for the splitters to chew on and make more workunits. At nearly the same time the above drive failed (during major thunderstorms here in Berkeley, which is probably just coincidental) the replica database on jocelyn crashed yet again. This system keeps losing the external storage (in the form of a Sun 3510) and mysql freaks out. We're not sure what the issue is but today we became fairly confident the problem is local to the 3510 (and not jocelyn itself). An amber light on the back of it means "RAID controller failure" which in this case means this box is pretty much useless. However, on a long shot Jeff suggested I reseat all the drives (most of which have been mounted in the system since we first got it roughly 8 years ago). I did, and the 3510 for the moment seemed willing to play nice. I started recovering the replica database one more time but the 3510 disappeared yet again. We're brainstorming where to move the database - it's not worth replacing that 3510, so we'd need other storage options... Or perhaps not have a replica but some other home-grown backup option. Meanwhile we still have creepy rpc.idmapd problems. This daemon, only on a few select systems, keeps dying at random with an "I/O Possible" message. When it dies, some mounted file systems are suddenly full of files owned by "nobody." I have a workaround for the time being - a cron job that restarts rpc.idmapd every few minutes. Had the usual Tuesday outage today. Spent that time messing with the above, dropping some unneeded science database indexes (maybe that'll speed things up as it'll free up buffer space?) and building a necessary index. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 ![]() |
Thanks for the update Matt, Claggy |
![]() Send message Joined: 22 Jan 00 Posts: 26 Credit: 3,624,307 RAC: 0 ![]() |
Hi Matt, Thanks for the information.. Question: Is that also the reason for the following? According to the webpages showing my running and finished WU's, my system should be crunching some WU's. Wu's that were created and sended today march 22 2011. But in reality there are no tasks running. Again thanks for the info. grtz, Pascal |
![]() Send message Joined: 29 Jul 07 Posts: 8 Credit: 15,317,965 RAC: 0 ![]() |
Thanks Matt. Keep fighting the good fight!!!! |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 ![]() |
Hi Matt, That was around the time when scheduler contact was flaky (to both the Main and Beta projects), expect them to get resent when you contact the server next, Edit: Looks like you generated a new host ID too, if the two hosts shown are the same, you might want to 'Merge duplicate records of this computer' on your host's details page, ie: right at the Bottom of this page: Computer 5823101 Claggy |
![]() ![]() Send message Joined: 2 Mar 11 Posts: 3 Credit: 1,350,052 RAC: 0 ![]() |
Matt Like most everyone else, thank you for the update. However, I'm not up on the lingo of the trade. My conception of networks pretty much ends after a P2P setup. I'll paraphrase: The RAID (familiar term) server that has all the raw data on it went kah-putz, and the replica of the RAID followed suit. The system that the RAID replica was attached to, a Sun 3510, keeps loosing external storage - another RAID - but it seems to be an issue with the 3510, not the entire system itself and we are in need of a different type of external storage system. I have no idea what rpc.idmapd is, but I think a daemon is a "handler" of sorts, and that "handler," when it dies, fills up other mounted systems with files of unknown ownership. A work-around you created restarts the daemon every so often to keep it running and not stuffing the other systems fill of files. ***end of paraphrasing*** I understand what distributive computing is, but when I start seeing jocelyn and mysql, my brain turns into goo and I don't understand it any more. So I am guessing that it might be a while before things are back up and running for SETI@home. If I am incorrect in my last "guess," please let me know. Paul ![]() |
![]() Send message Joined: 22 Jan 00 Posts: 26 Credit: 3,624,307 RAC: 0 ![]() |
Hi Claggy, thnx! but.. Tried forcing to contact the website. Work isn't downloaded.. you allready noticed the id's ;) Just noticed that the id of the system maybe changed?? Don't know why. didn't make any changes to the system itselve though. (no deinstall and reinstall of boinc.) and i can't, for some reason, merge the two on the website.. ?? grtz |
Mithotar ![]() Send message Joined: 11 Apr 01 Posts: 88 Credit: 66,037,385 RAC: 50 ![]() ![]() |
"Our raw data storage server had a drive failure early over the weekend" It would appear that perhaps the next target of a fund/donation drive has found us. |
![]() ![]() Send message Joined: 20 Sep 06 Posts: 4 Credit: 809,431 RAC: 1 ![]() |
Well, since I last looked it seems the only splitter went down, so no work units are being passed. One of machines was able to get one, but my primary machine is siting idle. Such is life. |
Chris Hotte Send message Joined: 5 Aug 08 Posts: 2 Credit: 9,913,915 RAC: 2 ![]() |
How many thousand hours on those drives? 8 years? Replace them, as the bearings are likely going going.... We have some of the same going on in our shop, at 50K hours. Our XServe RAIDs have the drives sitting on edge, which is IMHO, worse. Alternately we (you or I) could try re-racking our RAID modules the other way around to squeeze more life from the bearings. Back in the day the same principal worked for my 1541 floppy drive. |
Michael Reaves Send message Joined: 4 Jun 99 Posts: 23 Credit: 22,801,220 RAC: 0 ![]() |
The drive failures are most probably caused by the Japan radiation cloud (That's what I'm telling my boss).... ![]() |
![]() ![]() Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0 ![]() |
When you get production SETI sorted out, could someone start Beta? (...or, if it isn't affected by the production problems, just start Beta?...) . ![]() Hello, from Albany, CA!... |
OzzFan ![]() ![]() ![]() ![]() Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 ![]() ![]() |
You have to ask in a language that Matt understands: if (seti) == sorted out then start (seti beta) elseif start (seti beta) :-D |
![]() Send message Joined: 14 Aug 08 Posts: 90 Credit: 162,139 RAC: 0 ![]() |
Hello... just wanted to say that everything is working for me again. Not sure if everyone else is having problems or not. I'm still able to send workunits in, and I'm getting workunits to complete. Just thought I'd let you all know that things are working... (For Me.) I have to shut my computer down for a while. There is a tornado watch in my area, and it's not looking good. Hope that everyone is keeping supplied with workunits, and to Matt and the crew.... Thanks for all the updates and your hard work. |
![]() ![]() Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0 ![]() |
Still no SETI beta, though... . ![]() Hello, from Albany, CA!... |
![]() ![]() Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 ![]() |
Still no SETI beta, though... Eric's got the hood cracked open on that. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
![]() ![]() Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0 ![]() |
You have to ask in a language that Matt understands: close, but no cigar... what I meant was If (SETI=sorted) then start beta else if not(beta_problems) then start beta If (beta_problems) then (post message on beta site). (my logic is Fortran/Basic oriented...) . ![]() Hello, from Albany, CA!... |
OzzFan ![]() ![]() ![]() ![]() Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 ![]() ![]() |
Excellent! It even has logic to cover if there are Beta problems. :-) |
.clair. Send message Joined: 4 Nov 04 Posts: 1300 Credit: 55,390,408 RAC: 69 ![]() ![]() |
My understanding of Algol was very Basic, So i had to GO TO job center and Find something i could Do ;) |
David S ![]() Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12 ![]() ![]() |
Why is the upload server down? I didn't see that anywhere on here. My machine has been sitting on its last 6 WUs all week and started working on Einstein again. David David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. |
©2023 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.