Thumper (Feb 07 2011)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 1075188 - Posted: 7 Feb 2011, 22:26:46 UTC Wow what a mess this thumper OS install has been. I really don't want to go into details except that I've probably installed the OS at least twenty times over the past week, and that I'm reconsidering my career path (just kidding). It's amazing how stupidly complicated this has been - I'm just trying to get it configured the way that makes the most sense, but looks like we're going to have to stick with what works instead. There has been little pressure to rush this as nothing is directly depending on this system, but given how much of a time sink it has been and the need for its disk space is growing we need to get something going. Also ptolemy rebooted itself last night as a reminder that we really do need to start wrapping things up on this front. Meanwhile on Friday gowron (the workunit storage server) had a drive failure that locked up the whole system until I came in (on my off day) and forced a reboot. This inspired the whole RAID to resync, which takes at least a day. Fine. We came in this morning and started the projects up and replaced the failed drive... only to have ANOTHER drive fail on the system, locking it up, etc. etc. etc. So the current resync will happen all night, leading us into our regular weekly outage tomorrow. Oh yeah, during all this I had to force reboot bruno (the general BOINC administrative and upload server) which apparently spiralled out of control last night due to gowron's missing mount. All the newer systems are still working great, and science database tests and improvements continue along as planned. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 1075188 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1075191 - Posted: 7 Feb 2011, 22:37:01 UTC - in response to Message 1075188. Thanks for the update Matt, and for all your efforts, and for coming in on your day off, Claggy ID: 1075191 ·

Jeff Mercer Send message Joined: 14 Aug 08 Posts: 90 Credit: 162,139 RAC: 0	Message 1075197 - Posted: 7 Feb 2011, 22:47:08 UTC Hello, and once again, THANKS for the update. Sorry that the people of PROJECT SETI are having all these problems. Oh well, whatchagonnado ? Patch it all up and get it running again !!! In the meantime, since I'm out of work now, I guess I'll just send out some email to some friends or maybe play a game or two. I'll be here, ready and waiting, for more work, whenever you get things running again. Thanks for all your hard work. ID: 1075197 ·

Dirk Sadowski Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 1075203 - Posted: 7 Feb 2011, 23:11:15 UTC - in response to Message 1075188. Matt, thanks for the news! ID: 1075203 ·

Swibby Bear Send message Joined: 1 Aug 01 Posts: 246 Credit: 7,945,093 RAC: 0	Message 1075210 - Posted: 7 Feb 2011, 23:39:49 UTC Matt, you just can't seem to catch a break. Are all (many) of the hard drives of a similiar age where all the "baby-boomers" are going to "retire" at nearly the same time? Maybe we should try to update dozens of drives? (Now I'm about to tread where I don't know much about the topic) I thought so-called "hot spares" would kick in automagically to keep things going while a replacement drive was being installed. This didn't happen, I guess? Thanks for all you do. Whit ID: 1075210 ·

Dimly Lit Lightbulb ðŸ˜€ Volunteer tester Send message Joined: 30 Aug 08 Posts: 15401 Credit: 7,423,413 RAC: 1	Message 1075211 - Posted: 7 Feb 2011, 23:40:05 UTC Thanks for the news, coming in on your day off and for all the stressful work you've done this week. ID: 1075211 ·

Ham Todd Send message Joined: 25 Apr 03 Posts: 11 Credit: 29,459,893 RAC: 45	Message 1075227 - Posted: 8 Feb 2011, 1:09:59 UTC Nice job Matt and thanks! ID: 1075227 ·

Kafo Send message Joined: 17 Dec 00 Posts: 19 Credit: 15,432,367 RAC: 18	Message 1075237 - Posted: 8 Feb 2011, 1:40:08 UTC I have been looking at all the issues the servers have been having the last couple of months and think I have figured out what is causing all the problems. There is a work unit that has has picked up an alien signal. Since these aliens don't want to be found, they have been zapping the servers every time it is sent out. All we need do is find and delete this work unit and we will be able to get back to crunching without all these interuptions. ID: 1075237 ·

Uli Volunteer tester Send message Joined: 6 Feb 00 Posts: 10923 Credit: 5,996,015 RAC: 1	Message 1075263 - Posted: 8 Feb 2011, 3:45:30 UTC - in response to Message 1075237. I have been looking at all the issues the servers have been having the last couple of months and think I have figured out what is causing all the problems. There is a work unit that has has picked up an alien signal. Since these aliens don't want to be found, they have been zapping the servers every time it is sent out. All we need do is find and delete this work unit and we will be able to get back to crunching without all these interuptions. Snicker, with your sense of humor, I hope to see you in the Cafe sometime. A bit of money should be rolling in, because of the Superbowl challenge by the GPU User group and Seti Netherland donations drive. Pluto will always be a planet to me. Seti Ambassador Not to late to order an Anni Shirt ID: 1075263 ·

Jesse Viviano Send message Joined: 27 Feb 00 Posts: 100 Credit: 3,949,583 RAC: 0	Message 1075272 - Posted: 8 Feb 2011, 4:13:50 UTC - in response to Message 1075210. Matt, you just can't seem to catch a break. Are all (many) of the hard drives of a similiar age where all the "baby-boomers" are going to "retire" at nearly the same time? Maybe we should try to update dozens of drives? (Now I'm about to tread where I don't know much about the topic) I thought so-called "hot spares" would kick in automagically to keep things going while a replacement drive was being installed. This didn't happen, I guess? Thanks for all you do. Whit There are several problems with a RAID with failed drives. First, a RAID with a failed disk runs really slowly. Second, a RAID probably must be taken offline before it can try to rebuild onto the hot spare because there would otherwise be too much traffic to allow the rebuild to complete speedily. Otherwise, the rebuild could take very long to finish because the rebuilding process would have to compete with the traffic coming into and out of it. Third, a RAID with a failed disk (or two failed disks in the case of RAID 6) is less reliable than a single disk because the failure of another disk will destroy the RAID because there will not be enough data and parity to reconstruct the data and parity on the spare disks when they are made available. Therefore, hot spares do not instantly work. They work as long as there are enough good disks around to allow the lost data and parity to be recalculated. Replacing all of the disks would take a very long time that SETI@home users will not put up with and would suck up funds that SETI@home (and almost everybody else) seems to always be in short supply of. (If SETI@home had enough funds, it would already have replaced that 100Mbps line to Hurricane Electric with a gigabit line so that it could fully utilize the gigabit port that SETI@home is paying for to do away with the ninety-something megabits per second cap it is sometimes running into, among other things.) ID: 1075272 ·

Mad Max Volunteer tester Send message Joined: 16 Mar 00 Posts: 475 Credit: 213,231,775 RAC: 407	Message 1075662 - Posted: 9 Feb 2011, 22:46:10 UTC - in response to Message 1075237. I have been looking at all the issues the servers have been having the last couple of months and think I have figured out what is causing all the problems. There is a work unit that has has picked up an alien signal. Since these aliens don't want to be found, they have been zapping the servers every time it is sent out. All we need do is find and delete this work unit and we will be able to get back to crunching without all these interuptions. I want that WU. *IAS - Where Space Is Golden!* ID: 1075662 ·

Swibby Bear Send message Joined: 1 Aug 01 Posts: 246 Credit: 7,945,093 RAC: 0	Message 1075696 - Posted: 10 Feb 2011, 1:20:22 UTC - in response to Message 1075272. Matt, you just can't seem to catch a break. Are all (many) of the hard drives of a similiar age where all the "baby-boomers" are going to "retire" at nearly the same time? Maybe we should try to update dozens of drives? (Now I'm about to tread where I don't know much about the topic) I thought so-called "hot spares" would kick in automagically to keep things going while a replacement drive was being installed. This didn't happen, I guess? Thanks for all you do. Whit There are several problems with a RAID with failed drives. First, a RAID with a failed disk runs really slowly. Second, a RAID probably must be taken offline before it can try to rebuild onto the hot spare because there would otherwise be too much traffic to allow the rebuild to complete speedily. Otherwise, the rebuild could take very long to finish because the rebuilding process would have to compete with the traffic coming into and out of it. Third, a RAID with a failed disk (or two failed disks in the case of RAID 6) is less reliable than a single disk because the failure of another disk will destroy the RAID because there will not be enough data and parity to reconstruct the data and parity on the spare disks when they are made available. Therefore, hot spares do not instantly work. They work as long as there are enough good disks around to allow the lost data and parity to be recalculated. Replacing all of the disks would take a very long time that SETI@home users will not put up with and would suck up funds that SETI@home (and almost everybody else) seems to always be in short supply of. (If SETI@home had enough funds, it would already have replaced that 100Mbps line to Hurricane Electric with a gigabit line so that it could fully utilize the gigabit port that SETI@home is paying for to do away with the ninety-something megabits per second cap it is sometimes running into, among other things.) Jesse -- Thanks for the in-depth explanation. I guess my idea of "automagical" isn't quite so magical. Still, I guess RAID is a magical concept on data preservation. Thanks again. ID: 1075696 ·

Richard Plumley Volunteer tester Send message Joined: 21 Jan 02 Posts: 9 Credit: 3,273,909 RAC: 0	Message 1075730 - Posted: 10 Feb 2011, 4:44:38 UTC sounds like Matt needs a hug, some Tylenol and a couple of cold ones. I could almost feel his headache as I read his post. ID: 1075730 ·

[AF>FRANCE]peronik Volunteer tester Send message Joined: 5 Jan 03 Posts: 6 Credit: 700,907 RAC: 0	Message 1075979 - Posted: 11 Feb 2011, 0:15:53 UTC if you don't want to have some more bug with RAID configuration why dont have a look at the ZFS file system .http://en.wikipedia.org/wiki/ZFS it's a file system that include directly the RAID concepts without other soft or hardware this file system is include in sun-like distro and freebsd-like distros if you can't switch to FREEbsd you could try debian 6.0 "Squeeze"( the last stable version ) debian 6.0 had 2 kernel the standard linux and the new kFREEbsd the kernel of FREEBSD http://www.debian.org/ports/kfreebsd-gnu/index.en.html the kFREEbsd-amd64 version (the 64bits one) had ZFS as a standard file system ID: 1075979 ·

FreeNeo Send message Joined: 29 Jul 99 Posts: 1 Credit: 123,153,938 RAC: 0	Message 1077672 - Posted: 15 Feb 2011, 23:40:18 UTC RE: Mysterious Synergy Reboots & U.P.S., Matt, I think that you may be on the right track. It sounds like the U.P.S. is doing it's Semi-Weekly Self Test... I have loads of APC UPS's and sometimes very sensitive loads combined with possibly "iffy" batteries or a touchy U.P.S. may cause enough of a glitch on the AC line that it's enough to cause a "Hiccup" on the powered Systems. If it is an APC, you should be able to disable the tests within it's configuration interface. I hope that it's this simple!!! Good Luck!!! BTW. If you don't already have any, you might want to consider getting some A.P.C. Managed P.D.U's. (model AP7930 or AP7900). In this way, you'll be able to hard Power off and Re-power any load from a Browser (or Telnet), within your LAN or VPN. I have had many of these for over 10 years with no problems or failures whatsoever on all kinds of "Mission Critical" Equipment in my 24/7 Broadcast Plant... Best to all of you!, Barry R. Clark Director of Engineering, Information Technologies and Communications. Taxi Productions Inc. KJLH Radio Los Angeles 161 North La Brea Avenue Inglewood, CA 90301-1707 (310) 330-2222 Office. mailto:bclark@kjlhradio.com www.radiofree.fm www.kjlh.fm You can follow RadioFree 102.3 on Twitter @RadioFreeKJLH... ID: 1077672 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51492 Credit: 1,018,363,574 RAC: 1,004	Message 1077924 - Posted: 16 Feb 2011, 15:57:21 UTC - in response to Message 1077672. RE: Mysterious Synergy Reboots & U.P.S., Matt, I think that you may be on the right track. It sounds like the U.P.S. is doing it's Semi-Weekly Self Test... I have loads of APC UPS's and sometimes very sensitive loads combined with possibly "iffy" batteries or a touchy U.P.S. may cause enough of a glitch on the AC line that it's enough to cause a "Hiccup" on the powered Systems. If it is an APC, you should be able to disable the tests within it's configuration interface. I hope that it's this simple!!! Good Luck!!! BTW. If you don't already have any, you might want to consider getting some A.P.C. Managed P.D.U's. (model AP7930 or AP7900). In this way, you'll be able to hard Power off and Re-power any load from a Browser (or Telnet), within your LAN or VPN. I have had many of these for over 10 years with no problems or failures whatsoever on all kinds of "Mission Critical" Equipment in my 24/7 Broadcast Plant... Best to all of you!, Barry R. Clark Director of Engineering, Information Technologies and Communications. Taxi Productions Inc. KJLH Radio Los Angeles 161 North La Brea Avenue Inglewood, CA 90301-1707 (310) 330-2222 Office. mailto:bclark@kjlhradio.com www.radiofree.fm www.kjlh.fm You can follow RadioFree 102.3 on Twitter @RadioFreeKJLH... Very nice and informative post, sir. Might consider tagging along with the Avenue, 91.1....... Rather new station......couple of years or so. Independent, non profit. Play some gawd awful good tunes. "Time is simply the mechanism that keeps everything from happening all at once." ID: 1077924 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.