Get Out of My House (Jan 18 2011)


log in

Advanced search

Message boards : Technical News : Get Out of My House (Jan 18 2011)

1 · 2 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 1068032 - Posted: 18 Jan 2011, 22:02:28 UTC

Nothing like coming back from a long holiday weekend and having one of your main production servers croak as soon as you arrive. It's a sunny day outside and I was stuck wearing my fleece jacket and fingerless gloves inside a well air-conditioned server closet.

So what happened? Not sure exactly, but bruno (the upload server, as well as the main boincadm administrative server) was all hung up as soon as we started the normal Tuesday outage. I had to reboot it, and that was that - it wouldn't come up properly again.

It seems to be a multiple-part problem. There was a disk failure, and the 3ware card in this system has always given us trouble. What kind of trouble? Well, if you reboot the system (without a full power cycle) random drives go missing. That's kind of a problem, no? I don't think this is a single broken card - a labmate has similar problems with the same model in his system (I forget the model #, but it's 24-channels). Anyway, the big RAID10 holding all the results was tagged as degraded and rebuilding now.

That's fine, except the OS (which is on separate partitions and not under the jurisdiction of the 3ware card) isn't booting either. Jeez! The good news is I can boot of a Fedora live CD and see both the root and upload storage drives, so there's no data loss. It just won't boot!

The other good news is that, if we need it, we have a backup system already: synergy! It might be getting pulled into prime time sooner than expected. It doesn't have nearly the large number of disk spindles as on bruno, but this might not be an issue - there's still plenty of disk space on it. And a lot of memory for potential file system caching. It's still undecided if we're going to make synergy the new bruno, but I'm at least copying everything there now just to be safe.

I might still be able to get bruno up this afternoon, but if not, looks like we're down for the evening (it'll take that long to copy everything over to synergy).

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 14,890,454
RAC: 11,549
United States
Message 1068038 - Posted: 18 Jan 2011, 22:13:58 UTC - in response to Message 1068032.

Ain't it grand knowing you have a spare, brand new computer just waiting to show what it can do? :-) Good luck with Bruno but if you can't get it up we'll see you tomorrow!
____________


PROUD MEMBER OF Team Starfire World BOINC

QSilver
Send message
Joined: 26 May 99
Posts: 227
Credit: 4,485,203
RAC: 3,336
United States
Message 1068040 - Posted: 18 Jan 2011, 22:17:54 UTC

Thanks for the update, Matt. And good luck!
____________

Profile Jim_S
Avatar
Send message
Joined: 23 Feb 00
Posts: 4472
Credit: 18,296,528
RAC: 5,596
United States
Message 1068042 - Posted: 18 Jan 2011, 22:23:31 UTC

Thanks for the update Matt. And I'll second that good luck!!!
____________

I Desire Peace and Justice, Jim Scott

Profile Todd Hebert
Volunteer tester
Avatar
Send message
Joined: 16 Jun 00
Posts: 647
Credit: 217,127,962
RAC: 0
United States
Message 1068043 - Posted: 18 Jan 2011, 22:25:21 UTC

Never been a big fan of the 3Ware cards of days past - they were bought by LSI to capture market share and are partnered with Intel so there is plenty of development budget behind them now.

It is good that there is a "spare" around that would be up to the task to take over where bruno left off.

I sense another fundrasier in our future - I still have access to another barebones server from Intel at a discounted price.

Todd
____________

Profile [seti.international] Dirk Sadowski
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 7022
Credit: 59,225,151
RAC: 20,573
Germany
Message 1068046 - Posted: 18 Jan 2011, 22:44:25 UTC

Matt , Todd ;-) , thanks for the news!

____________
BR



>Das Deutsche Cafe. The German Cafe.<

Claggy
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4046
Credit: 32,693,028
RAC: 611
United Kingdom
Message 1068051 - Posted: 18 Jan 2011, 23:11:41 UTC - in response to Message 1068032.

Thanks for the update Matt, good luck with Bruno,

Claggy

Profile Gary Charpentier
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12115
Credit: 6,401,332
RAC: 8,092
United States
Message 1068056 - Posted: 18 Jan 2011, 23:29:44 UTC

Good luck on Bruno and thanks for keeping us informed.

____________

Profile Zapped Sparky
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 30 Aug 08
Posts: 6637
Credit: 1,200,844
RAC: 77
United Kingdom
Message 1068058 - Posted: 18 Jan 2011, 23:43:58 UTC

That's some trouble, good luck, hope you can get Bruno going again.

Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 1068065 - Posted: 19 Jan 2011, 0:01:52 UTC

By the way, no dice on bruno. I tried a bunch of things, it still won't boot - even though I can see all the drives/data when I boot from CD. And the RAID card refuses to NOT tag the RAID as degraded.

We're probably going to be down for at least another day or two. I'll post something to the front page shortly.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 14,890,454
RAC: 11,549
United States
Message 1068071 - Posted: 19 Jan 2011, 0:27:07 UTC - in response to Message 1068065.

Well. I guess that means Hello Synergy!! If it's half as good as it appears it should be able to handle the new assignment with no problems. After checking my tasks on my BOINC Manager I see I can easily handle a couple of days downtime.

Even if I don't make it until you get back up I have some new toys coming in that I can install during the down time. I'll be here ready to go whenever you get back.
____________


PROUD MEMBER OF Team Starfire World BOINC

Saaby900T
Send message
Joined: 24 Dec 10
Posts: 76
Credit: 4,971,171
RAC: 0
United States
Message 1068081 - Posted: 19 Jan 2011, 1:15:34 UTC

Can we Still get new tasks?

Profile Gary Charpentier
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12115
Credit: 6,401,332
RAC: 8,092
United States
Message 1068095 - Posted: 19 Jan 2011, 2:14:06 UTC
Last modified: 19 Jan 2011, 2:15:10 UTC

Time for the spring cleaning. Reseat all the connectors from head to tail. I know that will take a couple hours, especially with the the connectors in the drives. Don't forget the ground strap.

It that fails, I guess Synergy is Bruno.

In any case it sounds like you have the data and that is the important thing.


Or is Bruno jealous of Synergy?
____________

Profile Bill Walker
Avatar
Send message
Joined: 4 Sep 99
Posts: 3330
Credit: 1,963,719
RAC: 2,085
Canada
Message 1068100 - Posted: 19 Jan 2011, 2:24:53 UTC

Alas, poor Bruno! I knew him, Horatio, a server of infinite jest, of most excellent fancy. He hath bore me on his back a thousand times, and now he done broke.

Hey Matt, any hints on how you are picking your thread titles these days?

____________

Profile soft^spirit
Avatar
Send message
Joined: 18 May 99
Posts: 6374
Credit: 28,628,617
RAC: 801
United States
Message 1068101 - Posted: 19 Jan 2011, 2:26:41 UTC

so...

Bruno just...

....

....Froze up? *ducks and runs*
____________

Janice

Swibby Bear
Send message
Joined: 1 Aug 01
Posts: 236
Credit: 7,250,427
RAC: 2,007
United States
Message 1068110 - Posted: 19 Jan 2011, 3:01:54 UTC

If I recall, the current Bruno is the former Bambi, and the old Bruno was ditched when Oscar and Carolyn came in. Our heads are in a swirl!

Good luck, Matt

Profile RottenMutt
Avatar
Send message
Joined: 15 Mar 01
Posts: 992
Credit: 207,654,623
RAC: 2
United States
Message 1068115 - Posted: 19 Jan 2011, 3:22:10 UTC - in response to Message 1068065.
Last modified: 19 Jan 2011, 3:23:46 UTC

By the way, no dice on bruno. I tried a bunch of things, it still won't boot - even though I can see all the drives/data when I boot from CD. And the RAID card refuses to NOT tag the RAID as degraded.
...
- Matt


is the boot load working?
how far into boot do you get?
____________

zoom314
Avatar
Send message
Joined: 30 Nov 03
Posts: 45781
Credit: 36,405,058
RAC: 7,398
Message 1068123 - Posted: 19 Jan 2011, 3:59:51 UTC

Matt, Chin Up You'll figure out what Gremlin is bugging Ya and squash It flat in no time once Ya do figure It all out. Me I just have a learning curve with Vista Business x64 and with RealTemp 3.58, It isn't pretty, As I had to start It up with Task Scheduler instead of the Startup folder, It should start when the PC does now, I may upgrade to 7 Pro sooner than I thought, At least I have the DVD already. Good Luck.
____________

-BeNt-
Avatar
Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1068131 - Posted: 19 Jan 2011, 4:34:00 UTC

Thanks for the heads up Matt, explains why all my uploads have been backed off 3 hours. Kind of figured it was just an extended outage on the upload server. Oh well thank goodness you have another server to throw in there.
____________
Traveling through space at ~67,000mph!

Profile Harri Liljeroos
Avatar
Send message
Joined: 29 May 99
Posts: 46
Credit: 18,385,397
RAC: 10,362
Finland
Message 1068183 - Posted: 19 Jan 2011, 7:37:04 UTC - in response to Message 1068100.
Last modified: 19 Jan 2011, 7:37:29 UTC

Hey Matt, any hints on how you are picking your thread titles these days?


The titles seem to be names of Kate Bush songs.
____________

1 · 2 · Next

Message boards : Technical News : Get Out of My House (Jan 18 2011)

Copyright © 2014 University of California