Looking for the New Sound (Feb 26 2008)

Author	Message
Matt Lebofsky Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0	Message 718843 - Posted: 27 Feb 2008, 0:09:25 UTC Let's see.. it's been a bit since I last wrote. I've been mostly working on code to pull pulses out of the database, which uncovered a couple general minor bugs that had to be fixed. These were successfully dumped and handed off to Josh to find good candidates for initial Astropulse analysis. Not much going on over the weekend but the science database server (thumper) is not performing. Jeff and I scanned all kinds of data during different tests and we're convinced it's the RAID configuration more than anything else. We're going to have to reconfigure all the file systems on that at some point. Painful, but we may be able to do it piece by piece without too much disruption. Today we actually upgraded the way-out-of-date OS on thumper, which was also a bit painful, but ultimately successful. It should have been up and running by now, but thanks to an 8 Terabyte ext3 filesystem that hasn't been checked in over 180 days, a forced check is running and will probably be running all night. Not sure if we'll implement the secondary server (bambi) in the meantime - it may be too late in the day to attempt that. We'll let the project run as best it can until we run out of work (we'll probably keep a buffer of work just so the recovery later isn't as painful). Meanwhile, the assimilator queue is growing and growing until we either let it drain, or we reconfigure thumper. Oh yeah.. bane (one of the download servers) just went kaput. Spent 20 minutes trying to figure out what went wrong with its network. Oh - the cable came out of the switch. Click. Voila! In good news, Jeff has been hammering on the new router today, and we got over a major hurdle of getting IOS installed on it. Only thing left now is configuration. It might be ready tomorrow! Buckle your seatbelts. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude ID: 718843 ·

Jesse Viviano Send message Joined: 27 Feb 00 Posts: 100 Credit: 3,949,583 RAC: 0	Message 718898 - Posted: 27 Feb 2008, 1:24:19 UTC - in response to Message 718843. Let's see.. it's been a bit since I last wrote. I've been mostly working on code to pull pulses out of the database, which uncovered a couple general minor bugs that had to be fixed. These were successfully dumped and handed off to Josh to find good candidates for initial Astropulse analysis. Not much going on over the weekend but the science database server (thumper) is not performing. Jeff and I scanned all kinds of data during different tests and we're convinced it's the RAID configuration more than anything else. We're going to have to reconfigure all the file systems on that at some point. Painful, but we may be able to do it piece by piece without too much disruption. Today we actually upgraded the way-out-of-date OS on thumper, which was also a bit painful, but ultimately successful. It should have been up and running by now, but thanks to an 8 Terabyte ext3 filesystem that hasn't been checked in over 180 days, a forced check is running and will probably be running all night. Not sure if we'll implement the secondary server (bambi) in the meantime - it may be too late in the day to attempt that. We'll let the project run as best it can until we run out of work (we'll probably keep a buffer of work just so the recovery later isn't as painful). Meanwhile, the assimilator queue is growing and growing until we either let it drain, or we reconfigure thumper. Oh yeah.. bane (one of the download servers) just went kaput. Spent 20 minutes trying to figure out what went wrong with its network. Oh - the cable came out of the switch. Click. Voila! In good news, Jeff has been hammering on the new router today, and we got over a major hurdle of getting IOS installed on it. Only thing left now is configuration. It might be ready tomorrow! Buckle your seatbelts. - Matt I once questioned why you would have CatOS on a router. Seeing the photo of the router in the swtich closet showed me that it is not really a pure router, but a Catalyst 6504-E multilayer switch that can act as a router as well, which means that it was running CatOS on the switch portion of the multilayer switch, and IOS on the router portion of the switch. I wonder if you will use that multilayer switch's switching capabilities. If you used it as a switch and as a router by having your Internet-facing servers plugged directly into it, then you could have less latency between these servers and the SETI@home clients. This could reduce the number of connections they have to have open at once because less latency means that the connections can close sooner. By the way, on an unrelated note but related to the router, do you have any plans for the big IPv6 switchover whenever we finally run out of IPv4 addresses? ID: 718898 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 718918 - Posted: 27 Feb 2008, 2:24:06 UTC - in response to Message 718843. Last modified: 27 Feb 2008, 2:38:51 UTC Not much going on over the weekend but the science database server (thumper) is not performing. Jeff and I scanned all kinds of data during different tests and we're convinced it's the RAID configuration more than anything else. We're going to have to reconfigure all the file systems on that at some point. Painful, but we may be able to do it piece by piece without too much disruption. You could flip Bambi as primary, reconfigure the drives, and then replicate the database again to Thumper. Hope you have a backup just before doing this. :) Bambi seems to do well, right? What's her RAID configuration? Can you do the same with Thumper? Do the database logs have their own array now? Also, if it's running on any flavor of Linux, I've heard it helps to put elevator=deadline in the kernel parameters upon boot. If it's SunOS, then nevermind. ID: 718918 ·

David Volunteer tester Send message Joined: 19 May 99 Posts: 411 Credit: 1,426,457 RAC: 0	Message 718998 - Posted: 27 Feb 2008, 7:16:12 UTC - in response to Message 718843. Oh yeah.. bane (one of the download servers) just went kaput. Spent 20 minutes trying to figure out what went wrong with its network. Oh - the cable came out of the switch. Click. Voila! Thanks for the updates Matt. Next time dont forget to look for the easy fixes first lol ID: 718998 ·

speedimic Volunteer tester Send message Joined: 28 Sep 02 Posts: 362 Credit: 16,590,653 RAC: 0	Message 719087 - Posted: 27 Feb 2008, 14:36:00 UTC Maybe now is the time to start th FS - discussion (--> Sleepless in Oakland) if you reconfigure the file systems anyway. Not much going on over the weekend but the science database server (thumper) is not performing. Jeff and I scanned all kinds of data during different tests and we're convinced it's the RAID configuration more than anything else. We're going to have to reconfigure all the file systems on that at some point. Painful, but we may be able to do it piece by piece without too much disruption. Today we actually upgraded the way-out-of-date OS on thumper, which was also a bit painful, but ultimately successful. It should have been up and running by now, but thanks to an 8 Terabyte ext3 filesystem that hasn't been checked in over 180 days, a forced check is running and will probably be running all night. Not sure if we'll implement the secondary server (bambi) in the meantime - it may be too late in the day to attempt that. We'll let the project run as best it can until we run out of work (we'll probably keep a buffer of work just so the recovery later isn't as painful). mic. ID: 719087 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 719123 - Posted: 27 Feb 2008, 16:55:01 UTC - in response to Message 719087. Maybe now is the time to start th FS - discussion (--> Sleepless in Oakland) if you reconfigure the file systems anyway. You're not going to convince him to get rid of ext3. However, for a database, ext2 makes a little more sense. There's no reason to have a filesystem journal when the database has its own journaling mechanism (log files). That may be a performance boost (except for running fsck -f). Using any other filesystem brings greater risk. Of course, they had better go RAID 10 or else. :) ID: 719123 ·

sjf Volunteer tester Send message Joined: 17 Aug 99 Posts: 5 Credit: 10,617,892 RAC: 0	Message 719127 - Posted: 27 Feb 2008, 17:12:07 UTC You can just disable periodic checking, assuming you trust your storage hardware. Check out man tune2fs ... or: for i in `mount -t ext3 \| awk '{print $1}'`; do tune2fs -C 0 -i 0 $i; done A lot of better storage hardware can validate media and parity data on the fly. ID: 719127 ·

KWSN THE Holy Hand Grenade! Volunteer tester Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0	Message 719144 - Posted: 27 Feb 2008, 17:47:24 UTC - in response to Message 719127. You can just disable periodic checking, assuming you trust your storage hardware. Check out man tune2fs ... or: for i in `mount -t ext3 \| awk '{print $1}'`; do tune2fs -C 0 -i 0 $i; done A lot of better storage hardware can validate media and parity data on the fly. Given the number of disk failures in the recent past, I doubt that the Berkeley staff "trusts their storage hardware"! . Hello, from Albany, CA!... ID: 719144 ·

speedimic Volunteer tester Send message Joined: 28 Sep 02 Posts: 362 Credit: 16,590,653 RAC: 0	Message 719148 - Posted: 27 Feb 2008, 17:54:40 UTC From what I read, that 8Tb partition is for storing WUs/Results - not for the database. You're not going to convince him to get rid of ext3. However, for a database, ext2 makes a little more sense. There's no reason to have a filesystem journal when the database has its own journaling mechanism (log files). That may be a performance boost (except for running fsck -f). Using any other filesystem brings greater risk. Of course, they had better go RAID 10 or else. :) mic. ID: 719148 ·

Neil Walker Volunteer tester Send message Joined: 23 May 99 Posts: 288 Credit: 18,101,056 RAC: 0	Message 719239 - Posted: 27 Feb 2008, 21:49:25 UTC - in response to Message 719123. Of course, they had better go RAID 10 or else. :) Either you are kidding or you don't know what you are talking about. :P RAID 10 is shorthand for RAID 1 + 0. AFAIK, The S@H team have always used RAID 5. That is the minimum for an application of this kind. Be lucky Neil ID: 719239 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 719362 - Posted: 28 Feb 2008, 5:07:40 UTC - in response to Message 719239. Of course, they had better go RAID 10 or else. :) Either you are kidding or you don't know what you are talking about. :P RAID 10 is shorthand for RAID 1 + 0. AFAIK, The S@H team have always used RAID 5. That is the minimum for an application of this kind. I was kidding about the "they'd better...or else" part. However, I am not as much kidding about RAID 1+0. Especially with database loads, RAID 10 is superior in performance and redundancy. See this for an explanation if you need it: http://www.bytepile.com/raid_class.php ID: 719362 ·

Neil Walker Volunteer tester Send message Joined: 23 May 99 Posts: 288 Credit: 18,101,056 RAC: 0	Message 719410 - Posted: 28 Feb 2008, 8:29:43 UTC - in response to Message 719362. See this for an explanation if you need it: http://www.bytepile.com/raid_class.php I don't. ;) Maybe you should read it again in the context of the needs of S@H and the resources available.;) Be lucky Neil ID: 719410 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 719478 - Posted: 28 Feb 2008, 15:01:57 UTC - in response to Message 719410. I don't. ;) Maybe you should read it again in the context of the needs of S@H and the resources available.;) I did read that with SETI in mind. Unless there's an unsolvable reason why they can't make a RAID10, then that is what I recommend. For all the non-database servers, RAID5 is fine because it allows them lower cost per GB. Besides, Matt has already made his decision. ID: 719478 ·

sjf Volunteer tester Send message Joined: 17 Aug 99 Posts: 5 Credit: 10,617,892 RAC: 0	Message 719495 - Posted: 28 Feb 2008, 16:41:39 UTC - in response to Message 719144. You can just disable periodic checking, assuming you trust your storage hardware. Check out man tune2fs ... or: for i in `mount -t ext3 \| awk '{print $1}'`; do tune2fs -C 0 -i 0 $i; done A lot of better storage hardware can validate media and parity data on the fly. Given the number of disk failures in the recent past, I doubt that the Berkeley staff "trusts their storage hardware"! Disk failures are irrelevant if you're using reliable RAID controllers and you're replacing disks in a a timely manner. Both of which they seem to be doing. ID: 719495 ·

KWSN THE Holy Hand Grenade! Volunteer tester Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0	Message 719502 - Posted: 28 Feb 2008, 17:30:00 UTC - in response to Message 719495. Last modified: 28 Feb 2008, 17:30:51 UTC You can just disable periodic checking, assuming you trust your storage hardware. Check out man tune2fs ... or: for i in `mount -t ext3 \| awk '{print $1}'`; do tune2fs -C 0 -i 0 $i; done A lot of better storage hardware can validate media and parity data on the fly. Given the number of disk failures in the recent past, I doubt that the Berkeley staff "trusts their storage hardware"! Disk failures are irrelevant if you're using reliable RAID controllers and you're replacing disks in a a timely manner. Both of which they seem to be doing. I was replying to the "assuming you trust your storage hardware"... AND IIRC, they've also had some RAID controller failures... [irony intended] what are you, a member of the Borg? [/irony] ;-) . Hello, from Albany, CA!... ID: 719502 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.