Thumper (Feb 07 2011)

Message boards : Technical News : Thumper (Feb 07 2011)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1075188 - Posted: 7 Feb 2011, 22:26:46 UTC

Wow what a mess this thumper OS install has been. I really don't want to go into details except that I've probably installed the OS at least twenty times over the past week, and that I'm reconsidering my career path (just kidding). It's amazing how stupidly complicated this has been - I'm just trying to get it configured the way that makes the most sense, but looks like we're going to have to stick with what works instead. There has been little pressure to rush this as nothing is directly depending on this system, but given how much of a time sink it has been and the need for its disk space is growing we need to get something going. Also ptolemy rebooted itself last night as a reminder that we really do need to start wrapping things up on this front.

Meanwhile on Friday gowron (the workunit storage server) had a drive failure that locked up the whole system until I came in (on my off day) and forced a reboot. This inspired the whole RAID to resync, which takes at least a day. Fine. We came in this morning and started the projects up and replaced the failed drive... only to have ANOTHER drive fail on the system, locking it up, etc. etc. etc. So the current resync will happen all night, leading us into our regular weekly outage tomorrow.

Oh yeah, during all this I had to force reboot bruno (the general BOINC administrative and upload server) which apparently spiralled out of control last night due to gowron's missing mount.

All the newer systems are still working great, and science database tests and improvements continue along as planned.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1075188 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1075191 - Posted: 7 Feb 2011, 22:37:01 UTC - in response to Message 1075188.  

Thanks for the update Matt, and for all your efforts, and for coming in on your day off,

Claggy
ID: 1075191 · Report as offensive
Profile Jeff Mercer

Send message
Joined: 14 Aug 08
Posts: 90
Credit: 162,139
RAC: 0
United States
Message 1075197 - Posted: 7 Feb 2011, 22:47:08 UTC

Hello, and once again, THANKS for the update. Sorry that the people of PROJECT SETI are having all these problems. Oh well, whatchagonnado ? Patch it all up and get it running again !!! In the meantime, since I'm out of work now, I guess I'll just send out some email to some friends or maybe play a game or two. I'll be here, ready and waiting, for more work, whenever you get things running again. Thanks for all your hard work.
ID: 1075197 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1075203 - Posted: 7 Feb 2011, 23:11:15 UTC - in response to Message 1075188.  

Matt, thanks for the news!

ID: 1075203 · Report as offensive
Swibby Bear

Send message
Joined: 1 Aug 01
Posts: 246
Credit: 7,945,093
RAC: 0
United States
Message 1075210 - Posted: 7 Feb 2011, 23:39:49 UTC

Matt, you just can't seem to catch a break.

Are all (many) of the hard drives of a similiar age where all the "baby-boomers" are going to "retire" at nearly the same time? Maybe we should try to update dozens of drives?

(Now I'm about to tread where I don't know much about the topic) I thought so-called "hot spares" would kick in automagically to keep things going while a replacement drive was being installed. This didn't happen, I guess?

Thanks for all you do.

Whit
ID: 1075210 · Report as offensive
Profile Dimly Lit Lightbulb 😀
Volunteer tester
Avatar

Send message
Joined: 30 Aug 08
Posts: 15399
Credit: 7,423,413
RAC: 1
United Kingdom
Message 1075211 - Posted: 7 Feb 2011, 23:40:05 UTC

Thanks for the news, coming in on your day off and for all the stressful work you've done this week.
ID: 1075211 · Report as offensive
Ham Todd

Send message
Joined: 25 Apr 03
Posts: 11
Credit: 29,459,893
RAC: 45
United States
Message 1075227 - Posted: 8 Feb 2011, 1:09:59 UTC

Nice job Matt and thanks!
ID: 1075227 · Report as offensive
Kafo

Send message
Joined: 17 Dec 00
Posts: 19
Credit: 15,432,367
RAC: 18
Canada
Message 1075237 - Posted: 8 Feb 2011, 1:40:08 UTC

I have been looking at all the issues the servers have been having the last couple of months and think I have figured out what is causing all the problems. There is a work unit that has has picked up an alien signal. Since these aliens don't want to be found, they have been zapping the servers every time it is sent out. All we need do is find and delete this work unit and we will be able to get back to crunching without all these interuptions.
ID: 1075237 · Report as offensive
Profile Uli
Volunteer tester
Avatar

Send message
Joined: 6 Feb 00
Posts: 10923
Credit: 5,996,015
RAC: 1
Germany
Message 1075263 - Posted: 8 Feb 2011, 3:45:30 UTC - in response to Message 1075237.  

I have been looking at all the issues the servers have been having the last couple of months and think I have figured out what is causing all the problems. There is a work unit that has has picked up an alien signal. Since these aliens don't want to be found, they have been zapping the servers every time it is sent out. All we need do is find and delete this work unit and we will be able to get back to crunching without all these interuptions.


Snicker, with your sense of humor, I hope to see you in the Cafe sometime.

A bit of money should be rolling in, because of the Superbowl challenge by the GPU User group and Seti Netherland donations drive.

Pluto will always be a planet to me.

Seti Ambassador
Not to late to order an Anni Shirt
ID: 1075263 · Report as offensive
Jesse Viviano

Send message
Joined: 27 Feb 00
Posts: 100
Credit: 3,949,583
RAC: 0
United States
Message 1075272 - Posted: 8 Feb 2011, 4:13:50 UTC - in response to Message 1075210.  

Matt, you just can't seem to catch a break.

Are all (many) of the hard drives of a similiar age where all the "baby-boomers" are going to "retire" at nearly the same time? Maybe we should try to update dozens of drives?

(Now I'm about to tread where I don't know much about the topic) I thought so-called "hot spares" would kick in automagically to keep things going while a replacement drive was being installed. This didn't happen, I guess?

Thanks for all you do.

Whit

There are several problems with a RAID with failed drives. First, a RAID with a failed disk runs really slowly. Second, a RAID probably must be taken offline before it can try to rebuild onto the hot spare because there would otherwise be too much traffic to allow the rebuild to complete speedily. Otherwise, the rebuild could take very long to finish because the rebuilding process would have to compete with the traffic coming into and out of it. Third, a RAID with a failed disk (or two failed disks in the case of RAID 6) is less reliable than a single disk because the failure of another disk will destroy the RAID because there will not be enough data and parity to reconstruct the data and parity on the spare disks when they are made available.

Therefore, hot spares do not instantly work. They work as long as there are enough good disks around to allow the lost data and parity to be recalculated.

Replacing all of the disks would take a very long time that SETI@home users will not put up with and would suck up funds that SETI@home (and almost everybody else) seems to always be in short supply of. (If SETI@home had enough funds, it would already have replaced that 100Mbps line to Hurricane Electric with a gigabit line so that it could fully utilize the gigabit port that SETI@home is paying for to do away with the ninety-something megabits per second cap it is sometimes running into, among other things.)
ID: 1075272 · Report as offensive
Profile Mad Max
Volunteer tester
Avatar

Send message
Joined: 16 Mar 00
Posts: 475
Credit: 213,231,775
RAC: 407
United States
Message 1075662 - Posted: 9 Feb 2011, 22:46:10 UTC - in response to Message 1075237.  

I have been looking at all the issues the servers have been having the last couple of months and think I have figured out what is causing all the problems. There is a work unit that has has picked up an alien signal. Since these aliens don't want to be found, they have been zapping the servers every time it is sent out. All we need do is find and delete this work unit and we will be able to get back to crunching without all these interuptions.


I want that WU.
IAS - Where Space Is Golden!
ID: 1075662 · Report as offensive
Swibby Bear

Send message
Joined: 1 Aug 01
Posts: 246
Credit: 7,945,093
RAC: 0
United States
Message 1075696 - Posted: 10 Feb 2011, 1:20:22 UTC - in response to Message 1075272.  

Matt, you just can't seem to catch a break.

Are all (many) of the hard drives of a similiar age where all the "baby-boomers" are going to "retire" at nearly the same time? Maybe we should try to update dozens of drives?

(Now I'm about to tread where I don't know much about the topic) I thought so-called "hot spares" would kick in automagically to keep things going while a replacement drive was being installed. This didn't happen, I guess?

Thanks for all you do.

Whit

There are several problems with a RAID with failed drives. First, a RAID with a failed disk runs really slowly. Second, a RAID probably must be taken offline before it can try to rebuild onto the hot spare because there would otherwise be too much traffic to allow the rebuild to complete speedily. Otherwise, the rebuild could take very long to finish because the rebuilding process would have to compete with the traffic coming into and out of it. Third, a RAID with a failed disk (or two failed disks in the case of RAID 6) is less reliable than a single disk because the failure of another disk will destroy the RAID because there will not be enough data and parity to reconstruct the data and parity on the spare disks when they are made available.

Therefore, hot spares do not instantly work. They work as long as there are enough good disks around to allow the lost data and parity to be recalculated.

Replacing all of the disks would take a very long time that SETI@home users will not put up with and would suck up funds that SETI@home (and almost everybody else) seems to always be in short supply of. (If SETI@home had enough funds, it would already have replaced that 100Mbps line to Hurricane Electric with a gigabit line so that it could fully utilize the gigabit port that SETI@home is paying for to do away with the ninety-something megabits per second cap it is sometimes running into, among other things.)


Jesse -- Thanks for the in-depth explanation. I guess my idea of "automagical" isn't quite so magical. Still, I guess RAID is a magical concept on data preservation. Thanks again.
ID: 1075696 · Report as offensive
Richard Plumley
Volunteer tester
Avatar

Send message
Joined: 21 Jan 02
Posts: 9
Credit: 3,273,909
RAC: 0
United States
Message 1075730 - Posted: 10 Feb 2011, 4:44:38 UTC

sounds like Matt needs a hug, some Tylenol and a couple of cold ones. I could almost feel his headache as I read his post.
ID: 1075730 · Report as offensive
Profile [AF>FRANCE]peronik
Volunteer tester

Send message
Joined: 5 Jan 03
Posts: 6
Credit: 700,907
RAC: 0
France
Message 1075979 - Posted: 11 Feb 2011, 0:15:53 UTC

if you don't want to have some more bug with RAID configuration why dont have a look at the ZFS file system .http://en.wikipedia.org/wiki/ZFS
it's a file system that include directly the RAID concepts without other soft or hardware
this file system is include in sun-like distro and freebsd-like distros

if you can't switch to FREEbsd you could try debian 6.0 "Squeeze"( the last stable version )
debian 6.0 had 2 kernel the standard linux and the new kFREEbsd the kernel of FREEBSD http://www.debian.org/ports/kfreebsd-gnu/index.en.html

the kFREEbsd-amd64 version (the 64bits one) had ZFS as a standard file system
ID: 1075979 · Report as offensive
FreeNeo

Send message
Joined: 29 Jul 99
Posts: 1
Credit: 123,153,938
RAC: 0
United States
Message 1077672 - Posted: 15 Feb 2011, 23:40:18 UTC

RE: Mysterious Synergy Reboots & U.P.S.,

Matt, I think that you may be on the right track. It sounds like the U.P.S. is doing it's Semi-Weekly Self Test... I have loads of APC UPS's and sometimes very sensitive loads combined with possibly "iffy" batteries or a touchy U.P.S. may cause enough of a glitch on the AC line that it's enough to cause a "Hiccup" on the powered Systems. If it is an APC, you should be able to disable the tests within it's configuration interface.

I hope that it's this simple!!! Good Luck!!!

BTW. If you don't already have any, you might want to consider getting some A.P.C. Managed P.D.U's. (model AP7930 or AP7900). In this way, you'll be able to hard Power off and Re-power any load from a Browser (or Telnet), within your LAN or VPN. I have had many of these for over 10 years with no problems or failures whatsoever on all kinds of "Mission Critical" Equipment in my 24/7 Broadcast Plant...

Best to all of you!,

Barry R. Clark
Director of Engineering,
Information Technologies
and Communications.

Taxi Productions Inc.
KJLH Radio Los Angeles
161 North La Brea Avenue
Inglewood, CA 90301-1707
(310) 330-2222 Office.
mailto:bclark@kjlhradio.com
www.radiofree.fm
www.kjlh.fm

You can follow RadioFree 102.3 on Twitter @RadioFreeKJLH...
ID: 1077672 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1077924 - Posted: 16 Feb 2011, 15:57:21 UTC - in response to Message 1077672.  

RE: Mysterious Synergy Reboots & U.P.S.,

Matt, I think that you may be on the right track. It sounds like the U.P.S. is doing it's Semi-Weekly Self Test... I have loads of APC UPS's and sometimes very sensitive loads combined with possibly "iffy" batteries or a touchy U.P.S. may cause enough of a glitch on the AC line that it's enough to cause a "Hiccup" on the powered Systems. If it is an APC, you should be able to disable the tests within it's configuration interface.

I hope that it's this simple!!! Good Luck!!!

BTW. If you don't already have any, you might want to consider getting some A.P.C. Managed P.D.U's. (model AP7930 or AP7900). In this way, you'll be able to hard Power off and Re-power any load from a Browser (or Telnet), within your LAN or VPN. I have had many of these for over 10 years with no problems or failures whatsoever on all kinds of "Mission Critical" Equipment in my 24/7 Broadcast Plant...

Best to all of you!,

Barry R. Clark
Director of Engineering,
Information Technologies
and Communications.

Taxi Productions Inc.
KJLH Radio Los Angeles
161 North La Brea Avenue
Inglewood, CA 90301-1707
(310) 330-2222 Office.
mailto:bclark@kjlhradio.com
www.radiofree.fm
www.kjlh.fm

You can follow RadioFree 102.3 on Twitter @RadioFreeKJLH...
Very nice and informative post, sir.

Might consider tagging along with the Avenue, 91.1.......

Rather new station......couple of years or so.

Independent, non profit.

Play some gawd awful good tunes.





"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1077924 · Report as offensive

Message boards : Technical News : Thumper (Feb 07 2011)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.