Sitting Targets (Mar 22 2011)


log in

Advanced search

Message boards : Technical News : Sitting Targets (Mar 22 2011)

1 · 2 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1384
Credit: 74,079
RAC: 0
United States
Message 1089429 - Posted: 22 Mar 2011, 22:41:34 UTC
Last modified: 24 Mar 2011, 18:15:58 UTC

Our raw data storage server had a drive failure early over the weekend, which locked a bunch of stuff up including workunit production. Oh well. We were able to sort it out when we all got back in the lab on Monday, but it wasn't until late in the day that enough radar-clean data was created for the splitters to chew on and make more workunits.

At nearly the same time the above drive failed (during major thunderstorms here in Berkeley, which is probably just coincidental) the replica database on jocelyn crashed yet again. This system keeps losing the external storage (in the form of a Sun 3510) and mysql freaks out. We're not sure what the issue is but today we became fairly confident the problem is local to the 3510 (and not jocelyn itself). An amber light on the back of it means "RAID controller failure" which in this case means this box is pretty much useless. However, on a long shot Jeff suggested I reseat all the drives (most of which have been mounted in the system since we first got it roughly 8 years ago). I did, and the 3510 for the moment seemed willing to play nice. I started recovering the replica database one more time but the 3510 disappeared yet again. We're brainstorming where to move the database - it's not worth replacing that 3510, so we'd need other storage options... Or perhaps not have a replica but some other home-grown backup option.

Meanwhile we still have creepy rpc.idmapd problems. This daemon, only on a few select systems, keeps dying at random with an "I/O Possible" message. When it dies, some mounted file systems are suddenly full of files owned by "nobody." I have a workaround for the time being - a cron job that restarts rpc.idmapd every few minutes.

Had the usual Tuesday outage today. Spent that time messing with the above, dropping some unneeded science database indexes (maybe that'll speed things up as it'll free up buffer space?) and building a necessary index.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Claggy
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 3964
Credit: 31,878,964
RAC: 10,964
United Kingdom
Message 1089432 - Posted: 22 Mar 2011, 22:44:37 UTC - in response to Message 1089429.

Thanks for the update Matt,

Claggy

Profile Chris S
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 29565
Credit: 9,000,009
RAC: 27,884
United Kingdom
Message 1089434 - Posted: 22 Mar 2011, 22:49:52 UTC

As always Matt, thanks for your time in updating us.

____________
Damsel Rescuer, Kitty Patron, Uli Fan, Julie Supporter, CAMRA
ES99 Admirer, Raccoon Friend, IFAW, PETA, 5% Badge


Profile Pascal
Send message
Joined: 22 Jan 00
Posts: 23
Credit: 3,535,113
RAC: 393
Netherlands
Message 1089443 - Posted: 22 Mar 2011, 23:15:41 UTC
Last modified: 22 Mar 2011, 23:19:18 UTC

Hi Matt,

Thanks for the information..
Question: Is that also the reason for the following?

According to the webpages showing my running and finished WU's, my system should be crunching some WU's.
Wu's that were created and sended today march 22 2011.
But in reality there are no tasks running.

Again thanks for the info.

grtz,

Pascal

Profile Black Squirrel Prime
Send message
Joined: 29 Jul 07
Posts: 8
Credit: 10,014,549
RAC: 9,583
United States
Message 1089444 - Posted: 22 Mar 2011, 23:15:46 UTC

Thanks Matt. Keep fighting the good fight!!!!

Claggy
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 3964
Credit: 31,878,964
RAC: 10,964
United Kingdom
Message 1089445 - Posted: 22 Mar 2011, 23:24:41 UTC - in response to Message 1089443.
Last modified: 23 Mar 2011, 0:22:41 UTC

Hi Matt,

Thanks for the information..
Question: Is that also the reason for the following?

According to the webpages showing my running and finished WU's, my system should be crunching some WU's.
Wu's that were created and sended today march 22 2011 15:09:13 UTC
But in reality there are no tasks running.

Again thanks for the info.

grtz,

Pascal


That was around the time when scheduler contact was flaky (to both the Main and Beta projects), expect them to get resent when you contact the server next,

Edit: Looks like you generated a new host ID too, if the two hosts shown are the same, you might want to 'Merge duplicate records of this computer' on your host's details page,

ie: right at the Bottom of this page: Computer 5823101

Claggy

Profile Rainmaker*
Avatar
Send message
Joined: 2 Mar 11
Posts: 3
Credit: 300,863
RAC: 218
United States
Message 1089449 - Posted: 22 Mar 2011, 23:34:14 UTC

Matt

Like most everyone else, thank you for the update. However, I'm not up on the lingo of the trade. My conception of networks pretty much ends after a P2P setup.

I'll paraphrase: The RAID (familiar term) server that has all the raw data on it went kah-putz, and the replica of the RAID followed suit. The system that the RAID replica was attached to, a Sun 3510, keeps loosing external storage - another RAID - but it seems to be an issue with the 3510, not the entire system itself and we are in need of a different type of external storage system.

I have no idea what rpc.idmapd is, but I think a daemon is a "handler" of sorts, and that "handler," when it dies, fills up other mounted systems with files of unknown ownership. A work-around you created restarts the daemon every so often to keep it running and not stuffing the other systems fill of files.

***end of paraphrasing***

I understand what distributive computing is, but when I start seeing jocelyn and mysql, my brain turns into goo and I don't understand it any more.

So I am guessing that it might be a while before things are back up and running for SETI@home.

If I am incorrect in my last "guess," please let me know.

Paul
____________

Profile Pascal
Send message
Joined: 22 Jan 00
Posts: 23
Credit: 3,535,113
RAC: 393
Netherlands
Message 1089450 - Posted: 22 Mar 2011, 23:35:01 UTC - in response to Message 1089445.
Last modified: 22 Mar 2011, 23:54:10 UTC

Hi Claggy,

thnx! but..
Tried forcing to contact the website.
Work isn't downloaded..

you allready noticed the id's ;)
Just noticed that the id of the system maybe changed??
Don't know why.
didn't make any changes to the system itselve though. (no deinstall and reinstall of boinc.)

and i can't, for some reason, merge the two on the website..
??

grtz

Mithotar
Avatar
Send message
Joined: 11 Apr 01
Posts: 38
Credit: 14,316,985
RAC: 7,558
United States
Message 1089481 - Posted: 23 Mar 2011, 1:48:58 UTC - in response to Message 1089429.

"Our raw data storage server had a drive failure early over the weekend"

It would appear that perhaps the next target of a fund/donation drive
has found us.


____________

Profile John William Gibson
Avatar
Send message
Joined: 20 Sep 06
Posts: 4
Credit: 613,084
RAC: 274
United States
Message 1089508 - Posted: 23 Mar 2011, 3:40:45 UTC

Well, since I last looked it seems the only splitter went down, so no work units are being passed. One of machines was able to get one, but my primary machine is siting idle. Such is life.

Chris Hotte
Send message
Joined: 5 Aug 08
Posts: 2
Credit: 7,549,191
RAC: 1,013
Canada
Message 1089527 - Posted: 23 Mar 2011, 4:13:22 UTC - in response to Message 1089508.

How many thousand hours on those drives? 8 years? Replace them, as the bearings are likely going going....

We have some of the same going on in our shop, at 50K hours. Our XServe RAIDs have the drives sitting on edge, which is IMHO, worse. Alternately we (you or I) could try re-racking our RAID modules the other way around to squeeze more life from the bearings. Back in the day the same principal worked for my 1541 floppy drive.

Michael Reaves
Send message
Joined: 4 Jun 99
Posts: 21
Credit: 14,369,981
RAC: 15,480
United States
Message 1089664 - Posted: 23 Mar 2011, 16:34:55 UTC

The drive failures are most probably caused by the Japan radiation cloud (That's what I'm telling my boss)....
____________

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 1831
Credit: 7,565,309
RAC: 20,566
United States
Message 1089677 - Posted: 23 Mar 2011, 17:49:52 UTC
Last modified: 23 Mar 2011, 17:51:04 UTC

When you get production SETI sorted out, could someone start Beta? (...or, if it isn't affected by the production problems, just start Beta?...)
____________
.

OzzFan
Volunteer tester
Avatar
Send message
Joined: 9 Apr 02
Posts: 13307
Credit: 27,895,730
RAC: 16,348
United States
Message 1089683 - Posted: 23 Mar 2011, 18:12:14 UTC - in response to Message 1089677.

You have to ask in a language that Matt understands:

if (seti) == sorted out
then start (seti beta)
elseif start (seti beta)


:-D

Profile Jeff Mercer
Send message
Joined: 14 Aug 08
Posts: 90
Credit: 154,458
RAC: 727
United States
Message 1089693 - Posted: 23 Mar 2011, 18:28:42 UTC

Hello... just wanted to say that everything is working for me again. Not sure if everyone else is having problems or not. I'm still able to send workunits in, and I'm getting workunits to complete. Just thought I'd let you all know that things are working... (For Me.) I have to shut my computer down for a while. There is a tornado watch in my area, and it's not looking good. Hope that everyone is keeping supplied with workunits, and to Matt and the crew.... Thanks for all the updates and your hard work.

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 1831
Credit: 7,565,309
RAC: 20,566
United States
Message 1089898 - Posted: 24 Mar 2011, 7:04:42 UTC

Still no SETI beta, though...
____________
.

Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1384
Credit: 74,079
RAC: 0
United States
Message 1090041 - Posted: 24 Mar 2011, 18:00:06 UTC - in response to Message 1089898.

Still no SETI beta, though...


Eric's got the hood cracked open on that.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 1831
Credit: 7,565,309
RAC: 20,566
United States
Message 1090051 - Posted: 24 Mar 2011, 19:11:55 UTC - in response to Message 1089683.
Last modified: 24 Mar 2011, 19:12:43 UTC

You have to ask in a language that Matt understands:

if (seti) == sorted out
then start (seti beta)
elseif start (seti beta)


:-D


close, but no cigar... what I meant was

If (SETI=sorted) then start beta
else if not(beta_problems) then start beta

If (beta_problems) then (post message on beta site).

(my logic is Fortran/Basic oriented...)
____________
.

OzzFan
Volunteer tester
Avatar
Send message
Joined: 9 Apr 02
Posts: 13307
Credit: 27,895,730
RAC: 16,348
United States
Message 1090054 - Posted: 24 Mar 2011, 19:23:56 UTC - in response to Message 1090051.

Excellent! It even has logic to cover if there are Beta problems. :-)

clive G1FYE
Volunteer moderator
Send message
Joined: 4 Nov 04
Posts: 1300
Credit: 23,054,144
RAC: 5
United Kingdom
Message 1090095 - Posted: 24 Mar 2011, 21:17:21 UTC

My understanding of Algol was very Basic,
So i had to GO TO job center and Find something i could Do ;)

1 · 2 · Next

Message boards : Technical News : Sitting Targets (Mar 22 2011)

Copyright © 2014 University of California