Thumper (Feb 07 2011)


log in

Advanced search

Message boards : Technical News : Thumper (Feb 07 2011)

Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1391
Credit: 74,079
RAC: 10
United States
Message 1075188 - Posted: 7 Feb 2011, 22:26:46 UTC

Wow what a mess this thumper OS install has been. I really don't want to go into details except that I've probably installed the OS at least twenty times over the past week, and that I'm reconsidering my career path (just kidding). It's amazing how stupidly complicated this has been - I'm just trying to get it configured the way that makes the most sense, but looks like we're going to have to stick with what works instead. There has been little pressure to rush this as nothing is directly depending on this system, but given how much of a time sink it has been and the need for its disk space is growing we need to get something going. Also ptolemy rebooted itself last night as a reminder that we really do need to start wrapping things up on this front.

Meanwhile on Friday gowron (the workunit storage server) had a drive failure that locked up the whole system until I came in (on my off day) and forced a reboot. This inspired the whole RAID to resync, which takes at least a day. Fine. We came in this morning and started the projects up and replaced the failed drive... only to have ANOTHER drive fail on the system, locking it up, etc. etc. etc. So the current resync will happen all night, leading us into our regular weekly outage tomorrow.

Oh yeah, during all this I had to force reboot bruno (the general BOINC administrative and upload server) which apparently spiralled out of control last night due to gowron's missing mount.

All the newer systems are still working great, and science database tests and improvements continue along as planned.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4249
Credit: 34,988,188
RAC: 21,132
United Kingdom
Message 1075191 - Posted: 7 Feb 2011, 22:37:01 UTC - in response to Message 1075188.

Thanks for the update Matt, and for all your efforts, and for coming in on your day off,

Claggy

Profile Jeff Mercer
Send message
Joined: 14 Aug 08
Posts: 90
Credit: 162,139
RAC: 0
United States
Message 1075197 - Posted: 7 Feb 2011, 22:47:08 UTC

Hello, and once again, THANKS for the update. Sorry that the people of PROJECT SETI are having all these problems. Oh well, whatchagonnado ? Patch it all up and get it running again !!! In the meantime, since I'm out of work now, I guess I'll just send out some email to some friends or maybe play a game or two. I'll be here, ready and waiting, for more work, whenever you get things running again. Thanks for all your hard work.

Profile [seti.international] Dirk Sadowski
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 7124
Credit: 61,641,412
RAC: 15,681
Germany
Message 1075203 - Posted: 7 Feb 2011, 23:11:15 UTC - in response to Message 1075188.

Matt, thanks for the news!

____________
BR

SETI@home Needs your Help ... $10 & U get a Star!

Team seti.international

Das Deutsche Cafe. The German Cafe.

Swibby Bear
Send message
Joined: 1 Aug 01
Posts: 236
Credit: 7,276,504
RAC: 1
United States
Message 1075210 - Posted: 7 Feb 2011, 23:39:49 UTC

Matt, you just can't seem to catch a break.

Are all (many) of the hard drives of a similiar age where all the "baby-boomers" are going to "retire" at nearly the same time? Maybe we should try to update dozens of drives?

(Now I'm about to tread where I don't know much about the topic) I thought so-called "hot spares" would kick in automagically to keep things going while a replacement drive was being installed. This didn't happen, I guess?

Thanks for all you do.

Whit

Profile Zapped SparkyProject donor
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 30 Aug 08
Posts: 9423
Credit: 1,337,275
RAC: 789
United Kingdom
Message 1075211 - Posted: 7 Feb 2011, 23:40:05 UTC

Thanks for the news, coming in on your day off and for all the stressful work you've done this week.

Ham Todd
Send message
Joined: 25 Apr 03
Posts: 11
Credit: 13,758,515
RAC: 5,209
United States
Message 1075227 - Posted: 8 Feb 2011, 1:09:59 UTC

Nice job Matt and thanks!
____________

Kafo
Send message
Joined: 17 Dec 00
Posts: 15
Credit: 4,095,901
RAC: 2,813
Canada
Message 1075237 - Posted: 8 Feb 2011, 1:40:08 UTC

I have been looking at all the issues the servers have been having the last couple of months and think I have figured out what is causing all the problems. There is a work unit that has has picked up an alien signal. Since these aliens don't want to be found, they have been zapping the servers every time it is sent out. All we need do is find and delete this work unit and we will be able to get back to crunching without all these interuptions.

Profile UliProject donor
Volunteer tester
Avatar
Send message
Joined: 6 Feb 00
Posts: 10141
Credit: 5,475,477
RAC: 226
Germany
Message 1075263 - Posted: 8 Feb 2011, 3:45:30 UTC - in response to Message 1075237.

I have been looking at all the issues the servers have been having the last couple of months and think I have figured out what is causing all the problems. There is a work unit that has has picked up an alien signal. Since these aliens don't want to be found, they have been zapping the servers every time it is sent out. All we need do is find and delete this work unit and we will be able to get back to crunching without all these interuptions.


Snicker, with your sense of humor, I hope to see you in the Cafe sometime.

A bit of money should be rolling in, because of the Superbowl challenge by the GPU User group and Seti Netherland donations drive.

____________
Pluto will always be a planet to me.
Only 8 Anni Shirts Size XL left. Just PM me for details.
Cash Donation Specialist

Seti Ambassador

Jesse Viviano
Send message
Joined: 27 Feb 00
Posts: 95
Credit: 492,386
RAC: 1,235
United States
Message 1075272 - Posted: 8 Feb 2011, 4:13:50 UTC - in response to Message 1075210.

Matt, you just can't seem to catch a break.

Are all (many) of the hard drives of a similiar age where all the "baby-boomers" are going to "retire" at nearly the same time? Maybe we should try to update dozens of drives?

(Now I'm about to tread where I don't know much about the topic) I thought so-called "hot spares" would kick in automagically to keep things going while a replacement drive was being installed. This didn't happen, I guess?

Thanks for all you do.

Whit

There are several problems with a RAID with failed drives. First, a RAID with a failed disk runs really slowly. Second, a RAID probably must be taken offline before it can try to rebuild onto the hot spare because there would otherwise be too much traffic to allow the rebuild to complete speedily. Otherwise, the rebuild could take very long to finish because the rebuilding process would have to compete with the traffic coming into and out of it. Third, a RAID with a failed disk (or two failed disks in the case of RAID 6) is less reliable than a single disk because the failure of another disk will destroy the RAID because there will not be enough data and parity to reconstruct the data and parity on the spare disks when they are made available.

Therefore, hot spares do not instantly work. They work as long as there are enough good disks around to allow the lost data and parity to be recalculated.

Replacing all of the disks would take a very long time that SETI@home users will not put up with and would suck up funds that SETI@home (and almost everybody else) seems to always be in short supply of. (If SETI@home had enough funds, it would already have replaced that 100Mbps line to Hurricane Electric with a gigabit line so that it could fully utilize the gigabit port that SETI@home is paying for to do away with the ninety-something megabits per second cap it is sometimes running into, among other things.)

Profile Mad Max
Volunteer tester
Avatar
Send message
Joined: 16 Mar 00
Posts: 474
Credit: 86,371,705
RAC: 33,100
United States
Message 1075662 - Posted: 9 Feb 2011, 22:46:10 UTC - in response to Message 1075237.

I have been looking at all the issues the servers have been having the last couple of months and think I have figured out what is causing all the problems. There is a work unit that has has picked up an alien signal. Since these aliens don't want to be found, they have been zapping the servers every time it is sent out. All we need do is find and delete this work unit and we will be able to get back to crunching without all these interuptions.


I want that WU.
____________
IAS - Where Space Is Golden!

Swibby Bear
Send message
Joined: 1 Aug 01
Posts: 236
Credit: 7,276,504
RAC: 1
United States
Message 1075696 - Posted: 10 Feb 2011, 1:20:22 UTC - in response to Message 1075272.

Matt, you just can't seem to catch a break.

Are all (many) of the hard drives of a similiar age where all the "baby-boomers" are going to "retire" at nearly the same time? Maybe we should try to update dozens of drives?

(Now I'm about to tread where I don't know much about the topic) I thought so-called "hot spares" would kick in automagically to keep things going while a replacement drive was being installed. This didn't happen, I guess?

Thanks for all you do.

Whit

There are several problems with a RAID with failed drives. First, a RAID with a failed disk runs really slowly. Second, a RAID probably must be taken offline before it can try to rebuild onto the hot spare because there would otherwise be too much traffic to allow the rebuild to complete speedily. Otherwise, the rebuild could take very long to finish because the rebuilding process would have to compete with the traffic coming into and out of it. Third, a RAID with a failed disk (or two failed disks in the case of RAID 6) is less reliable than a single disk because the failure of another disk will destroy the RAID because there will not be enough data and parity to reconstruct the data and parity on the spare disks when they are made available.

Therefore, hot spares do not instantly work. They work as long as there are enough good disks around to allow the lost data and parity to be recalculated.

Replacing all of the disks would take a very long time that SETI@home users will not put up with and would suck up funds that SETI@home (and almost everybody else) seems to always be in short supply of. (If SETI@home had enough funds, it would already have replaced that 100Mbps line to Hurricane Electric with a gigabit line so that it could fully utilize the gigabit port that SETI@home is paying for to do away with the ninety-something megabits per second cap it is sometimes running into, among other things.)


Jesse -- Thanks for the in-depth explanation. I guess my idea of "automagical" isn't quite so magical. Still, I guess RAID is a magical concept on data preservation. Thanks again.

Richard Plumley
Volunteer tester
Avatar
Send message
Joined: 21 Jan 02
Posts: 9
Credit: 3,273,909
RAC: 0
United States
Message 1075730 - Posted: 10 Feb 2011, 4:44:38 UTC

sounds like Matt needs a hug, some Tylenol and a couple of cold ones. I could almost feel his headache as I read his post.
____________

Profile [AF>FRANCE]peronik
Volunteer tester
Send message
Joined: 5 Jan 03
Posts: 6
Credit: 700,907
RAC: 50
France
Message 1075979 - Posted: 11 Feb 2011, 0:15:53 UTC

if you don't want to have some more bug with RAID configuration why dont have a look at the ZFS file system .http://en.wikipedia.org/wiki/ZFS
it's a file system that include directly the RAID concepts without other soft or hardware
this file system is include in sun-like distro and freebsd-like distros

if you can't switch to FREEbsd you could try debian 6.0 "Squeeze"( the last stable version )
debian 6.0 had 2 kernel the standard linux and the new kFREEbsd the kernel of FREEBSD http://www.debian.org/ports/kfreebsd-gnu/index.en.html

the kFREEbsd-amd64 version (the 64bits one) had ZFS as a standard file system
____________

RadioFree 102.3 FM
Send message
Joined: 29 Jul 99
Posts: 1
Credit: 114,227,322
RAC: 33,031
United States
Message 1077672 - Posted: 15 Feb 2011, 23:40:18 UTC

RE: Mysterious Synergy Reboots & U.P.S.,

Matt, I think that you may be on the right track. It sounds like the U.P.S. is doing it's Semi-Weekly Self Test... I have loads of APC UPS's and sometimes very sensitive loads combined with possibly "iffy" batteries or a touchy U.P.S. may cause enough of a glitch on the AC line that it's enough to cause a "Hiccup" on the powered Systems. If it is an APC, you should be able to disable the tests within it's configuration interface.

I hope that it's this simple!!! Good Luck!!!

BTW. If you don't already have any, you might want to consider getting some A.P.C. Managed P.D.U's. (model AP7930 or AP7900). In this way, you'll be able to hard Power off and Re-power any load from a Browser (or Telnet), within your LAN or VPN. I have had many of these for over 10 years with no problems or failures whatsoever on all kinds of "Mission Critical" Equipment in my 24/7 Broadcast Plant...

Best to all of you!,

Barry R. Clark
Director of Engineering,
Information Technologies
and Communications.

Taxi Productions Inc.
KJLH Radio Los Angeles
161 North La Brea Avenue
Inglewood, CA 90301-1707
(310) 330-2222 Office.
mailto:bclark@kjlhradio.com
www.radiofree.fm
www.kjlh.fm

You can follow RadioFree 102.3 on Twitter @RadioFreeKJLH...

Message boards : Technical News : Thumper (Feb 07 2011)

Copyright © 2014 University of California