Sitting Targets (Mar 22 2011)

Message boards : Technical News : Sitting Targets (Mar 22 2011)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1089429 - Posted: 22 Mar 2011, 22:41:34 UTC
Last modified: 24 Mar 2011, 18:15:58 UTC

Our raw data storage server had a drive failure early over the weekend, which locked a bunch of stuff up including workunit production. Oh well. We were able to sort it out when we all got back in the lab on Monday, but it wasn't until late in the day that enough radar-clean data was created for the splitters to chew on and make more workunits.

At nearly the same time the above drive failed (during major thunderstorms here in Berkeley, which is probably just coincidental) the replica database on jocelyn crashed yet again. This system keeps losing the external storage (in the form of a Sun 3510) and mysql freaks out. We're not sure what the issue is but today we became fairly confident the problem is local to the 3510 (and not jocelyn itself). An amber light on the back of it means "RAID controller failure" which in this case means this box is pretty much useless. However, on a long shot Jeff suggested I reseat all the drives (most of which have been mounted in the system since we first got it roughly 8 years ago). I did, and the 3510 for the moment seemed willing to play nice. I started recovering the replica database one more time but the 3510 disappeared yet again. We're brainstorming where to move the database - it's not worth replacing that 3510, so we'd need other storage options... Or perhaps not have a replica but some other home-grown backup option.

Meanwhile we still have creepy rpc.idmapd problems. This daemon, only on a few select systems, keeps dying at random with an "I/O Possible" message. When it dies, some mounted file systems are suddenly full of files owned by "nobody." I have a workaround for the time being - a cron job that restarts rpc.idmapd every few minutes.

Had the usual Tuesday outage today. Spent that time messing with the above, dropping some unneeded science database indexes (maybe that'll speed things up as it'll free up buffer space?) and building a necessary index.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1089429 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1089432 - Posted: 22 Mar 2011, 22:44:37 UTC - in response to Message 1089429.  

Thanks for the update Matt,

Claggy
ID: 1089432 · Report as offensive
Profile Pascal

Send message
Joined: 22 Jan 00
Posts: 26
Credit: 3,624,307
RAC: 0
Netherlands
Message 1089443 - Posted: 22 Mar 2011, 23:15:41 UTC
Last modified: 22 Mar 2011, 23:19:18 UTC

Hi Matt,

Thanks for the information..
Question: Is that also the reason for the following?

According to the webpages showing my running and finished WU's, my system should be crunching some WU's.
Wu's that were created and sended today march 22 2011.
But in reality there are no tasks running.

Again thanks for the info.

grtz,

Pascal
ID: 1089443 · Report as offensive
Profile Black Squirrel Prime

Send message
Joined: 29 Jul 07
Posts: 8
Credit: 15,317,965
RAC: 0
United States
Message 1089444 - Posted: 22 Mar 2011, 23:15:46 UTC

Thanks Matt. Keep fighting the good fight!!!!
ID: 1089444 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1089445 - Posted: 22 Mar 2011, 23:24:41 UTC - in response to Message 1089443.  
Last modified: 23 Mar 2011, 0:22:41 UTC

Hi Matt,

Thanks for the information..
Question: Is that also the reason for the following?

According to the webpages showing my running and finished WU's, my system should be crunching some WU's.
Wu's that were created and sended today march 22 2011 15:09:13 UTC
But in reality there are no tasks running.

Again thanks for the info.

grtz,

Pascal


That was around the time when scheduler contact was flaky (to both the Main and Beta projects), expect them to get resent when you contact the server next,

Edit: Looks like you generated a new host ID too, if the two hosts shown are the same, you might want to 'Merge duplicate records of this computer' on your host's details page,

ie: right at the Bottom of this page: Computer 5823101

Claggy
ID: 1089445 · Report as offensive
Profile Rainmaker*
Avatar

Send message
Joined: 2 Mar 11
Posts: 3
Credit: 1,350,052
RAC: 0
United States
Message 1089449 - Posted: 22 Mar 2011, 23:34:14 UTC

Matt

Like most everyone else, thank you for the update. However, I'm not up on the lingo of the trade. My conception of networks pretty much ends after a P2P setup.

I'll paraphrase: The RAID (familiar term) server that has all the raw data on it went kah-putz, and the replica of the RAID followed suit. The system that the RAID replica was attached to, a Sun 3510, keeps loosing external storage - another RAID - but it seems to be an issue with the 3510, not the entire system itself and we are in need of a different type of external storage system.

I have no idea what rpc.idmapd is, but I think a daemon is a "handler" of sorts, and that "handler," when it dies, fills up other mounted systems with files of unknown ownership. A work-around you created restarts the daemon every so often to keep it running and not stuffing the other systems fill of files.

***end of paraphrasing***

I understand what distributive computing is, but when I start seeing jocelyn and mysql, my brain turns into goo and I don't understand it any more.

So I am guessing that it might be a while before things are back up and running for SETI@home.

If I am incorrect in my last "guess," please let me know.

Paul
ID: 1089449 · Report as offensive
Profile Pascal

Send message
Joined: 22 Jan 00
Posts: 26
Credit: 3,624,307
RAC: 0
Netherlands
Message 1089450 - Posted: 22 Mar 2011, 23:35:01 UTC - in response to Message 1089445.  
Last modified: 22 Mar 2011, 23:54:10 UTC

Hi Claggy,

thnx! but..
Tried forcing to contact the website.
Work isn't downloaded..

you allready noticed the id's ;)
Just noticed that the id of the system maybe changed??
Don't know why.
didn't make any changes to the system itselve though. (no deinstall and reinstall of boinc.)

and i can't, for some reason, merge the two on the website..
??

grtz
ID: 1089450 · Report as offensive
Mithotar
Avatar

Send message
Joined: 11 Apr 01
Posts: 88
Credit: 66,037,385
RAC: 50
United States
Message 1089481 - Posted: 23 Mar 2011, 1:48:58 UTC - in response to Message 1089429.  

"Our raw data storage server had a drive failure early over the weekend"

It would appear that perhaps the next target of a fund/donation drive
has found us.


ID: 1089481 · Report as offensive
Profile John William Gibson
Avatar

Send message
Joined: 20 Sep 06
Posts: 4
Credit: 809,431
RAC: 1
United States
Message 1089508 - Posted: 23 Mar 2011, 3:40:45 UTC

Well, since I last looked it seems the only splitter went down, so no work units are being passed. One of machines was able to get one, but my primary machine is siting idle. Such is life.
ID: 1089508 · Report as offensive
Chris Hotte

Send message
Joined: 5 Aug 08
Posts: 2
Credit: 9,913,915
RAC: 2
Canada
Message 1089527 - Posted: 23 Mar 2011, 4:13:22 UTC - in response to Message 1089508.  

How many thousand hours on those drives? 8 years? Replace them, as the bearings are likely going going....

We have some of the same going on in our shop, at 50K hours. Our XServe RAIDs have the drives sitting on edge, which is IMHO, worse. Alternately we (you or I) could try re-racking our RAID modules the other way around to squeeze more life from the bearings. Back in the day the same principal worked for my 1541 floppy drive.
ID: 1089527 · Report as offensive
Michael Reaves

Send message
Joined: 4 Jun 99
Posts: 23
Credit: 22,801,220
RAC: 0
United States
Message 1089664 - Posted: 23 Mar 2011, 16:34:55 UTC

The drive failures are most probably caused by the Japan radiation cloud (That's what I'm telling my boss)....
ID: 1089664 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 1089677 - Posted: 23 Mar 2011, 17:49:52 UTC
Last modified: 23 Mar 2011, 17:51:04 UTC

When you get production SETI sorted out, could someone start Beta? (...or, if it isn't affected by the production problems, just start Beta?...)
.

Hello, from Albany, CA!...
ID: 1089677 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 1089683 - Posted: 23 Mar 2011, 18:12:14 UTC - in response to Message 1089677.  

You have to ask in a language that Matt understands:

if (seti) == sorted out
then start (seti beta)
elseif start (seti beta)


:-D
ID: 1089683 · Report as offensive
Profile Jeff Mercer

Send message
Joined: 14 Aug 08
Posts: 90
Credit: 162,139
RAC: 0
United States
Message 1089693 - Posted: 23 Mar 2011, 18:28:42 UTC

Hello... just wanted to say that everything is working for me again. Not sure if everyone else is having problems or not. I'm still able to send workunits in, and I'm getting workunits to complete. Just thought I'd let you all know that things are working... (For Me.) I have to shut my computer down for a while. There is a tornado watch in my area, and it's not looking good. Hope that everyone is keeping supplied with workunits, and to Matt and the crew.... Thanks for all the updates and your hard work.
ID: 1089693 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 1089898 - Posted: 24 Mar 2011, 7:04:42 UTC

Still no SETI beta, though...
.

Hello, from Albany, CA!...
ID: 1089898 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1090041 - Posted: 24 Mar 2011, 18:00:06 UTC - in response to Message 1089898.  

Still no SETI beta, though...


Eric's got the hood cracked open on that.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1090041 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 1090051 - Posted: 24 Mar 2011, 19:11:55 UTC - in response to Message 1089683.  
Last modified: 24 Mar 2011, 19:12:43 UTC

You have to ask in a language that Matt understands:

if (seti) == sorted out
then start (seti beta)
elseif start (seti beta)


:-D


close, but no cigar... what I meant was

If (SETI=sorted) then start beta
else if not(beta_problems) then start beta

If (beta_problems) then (post message on beta site).

(my logic is Fortran/Basic oriented...)
.

Hello, from Albany, CA!...
ID: 1090051 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 1090054 - Posted: 24 Mar 2011, 19:23:56 UTC - in response to Message 1090051.  

Excellent! It even has logic to cover if there are Beta problems. :-)
ID: 1090054 · Report as offensive
.clair.

Send message
Joined: 4 Nov 04
Posts: 1300
Credit: 55,390,408
RAC: 69
United Kingdom
Message 1090095 - Posted: 24 Mar 2011, 21:17:21 UTC

My understanding of Algol was very Basic,
So i had to GO TO job center and Find something i could Do ;)
ID: 1090095 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1090285 - Posted: 25 Mar 2011, 14:28:08 UTC - in response to Message 1089429.  

Why is the upload server down? I didn't see that anywhere on here. My machine has been sitting on its last 6 WUs all week and started working on Einstein again.

David
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1090285 · Report as offensive
1 · 2 · Next

Message boards : Technical News : Sitting Targets (Mar 22 2011)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.