Looking for the New Sound (Feb 26 2008)

Message boards : Technical News : Looking for the New Sound (Feb 26 2008)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 718843 - Posted: 27 Feb 2008, 0:09:25 UTC

Let's see.. it's been a bit since I last wrote. I've been mostly working on code to pull pulses out of the database, which uncovered a couple general minor bugs that had to be fixed. These were successfully dumped and handed off to Josh to find good candidates for initial Astropulse analysis.

Not much going on over the weekend but the science database server (thumper) is not performing. Jeff and I scanned all kinds of data during different tests and we're convinced it's the RAID configuration more than anything else. We're going to have to reconfigure all the file systems on that at some point. Painful, but we may be able to do it piece by piece without too much disruption.

Today we actually upgraded the way-out-of-date OS on thumper, which was also a bit painful, but ultimately successful. It should have been up and running by now, but thanks to an 8 Terabyte ext3 filesystem that hasn't been checked in over 180 days, a forced check is running and will probably be running all night. Not sure if we'll implement the secondary server (bambi) in the meantime - it may be too late in the day to attempt that. We'll let the project run as best it can until we run out of work (we'll probably keep a buffer of work just so the recovery later isn't as painful).

Meanwhile, the assimilator queue is growing and growing until we either let it drain, or we reconfigure thumper.

Oh yeah.. bane (one of the download servers) just went kaput. Spent 20 minutes trying to figure out what went wrong with its network. Oh - the cable came out of the switch. Click. Voila!

In good news, Jeff has been hammering on the new router today, and we got over a major hurdle of getting IOS installed on it. Only thing left now is configuration. It might be ready tomorrow!

Buckle your seatbelts.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 718843 · Report as offensive
Jesse Viviano

Send message
Joined: 27 Feb 00
Posts: 100
Credit: 3,949,583
RAC: 0
United States
Message 718898 - Posted: 27 Feb 2008, 1:24:19 UTC - in response to Message 718843.  

Let's see.. it's been a bit since I last wrote. I've been mostly working on code to pull pulses out of the database, which uncovered a couple general minor bugs that had to be fixed. These were successfully dumped and handed off to Josh to find good candidates for initial Astropulse analysis.

Not much going on over the weekend but the science database server (thumper) is not performing. Jeff and I scanned all kinds of data during different tests and we're convinced it's the RAID configuration more than anything else. We're going to have to reconfigure all the file systems on that at some point. Painful, but we may be able to do it piece by piece without too much disruption.

Today we actually upgraded the way-out-of-date OS on thumper, which was also a bit painful, but ultimately successful. It should have been up and running by now, but thanks to an 8 Terabyte ext3 filesystem that hasn't been checked in over 180 days, a forced check is running and will probably be running all night. Not sure if we'll implement the secondary server (bambi) in the meantime - it may be too late in the day to attempt that. We'll let the project run as best it can until we run out of work (we'll probably keep a buffer of work just so the recovery later isn't as painful).

Meanwhile, the assimilator queue is growing and growing until we either let it drain, or we reconfigure thumper.

Oh yeah.. bane (one of the download servers) just went kaput. Spent 20 minutes trying to figure out what went wrong with its network. Oh - the cable came out of the switch. Click. Voila!

In good news, Jeff has been hammering on the new router today, and we got over a major hurdle of getting IOS installed on it. Only thing left now is configuration. It might be ready tomorrow!

Buckle your seatbelts.

- Matt

I once questioned why you would have CatOS on a router. Seeing the photo of the router in the swtich closet showed me that it is not really a pure router, but a Catalyst 6504-E multilayer switch that can act as a router as well, which means that it was running CatOS on the switch portion of the multilayer switch, and IOS on the router portion of the switch. I wonder if you will use that multilayer switch's switching capabilities. If you used it as a switch and as a router by having your Internet-facing servers plugged directly into it, then you could have less latency between these servers and the SETI@home clients. This could reduce the number of connections they have to have open at once because less latency means that the connections can close sooner.

By the way, on an unrelated note but related to the router, do you have any plans for the big IPv6 switchover whenever we finally run out of IPv4 addresses?
ID: 718898 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 718918 - Posted: 27 Feb 2008, 2:24:06 UTC - in response to Message 718843.  
Last modified: 27 Feb 2008, 2:38:51 UTC

Not much going on over the weekend but the science database server (thumper) is not performing. Jeff and I scanned all kinds of data during different tests and we're convinced it's the RAID configuration more than anything else. We're going to have to reconfigure all the file systems on that at some point. Painful, but we may be able to do it piece by piece without too much disruption.

You could flip Bambi as primary, reconfigure the drives, and then replicate the database again to Thumper. Hope you have a backup just before doing this. :)

Bambi seems to do well, right? What's her RAID configuration? Can you do the same with Thumper? Do the database logs have their own array now?

Also, if it's running on any flavor of Linux, I've heard it helps to put elevator=deadline in the kernel parameters upon boot. If it's SunOS, then nevermind.
ID: 718918 · Report as offensive
Profile David
Volunteer tester
Avatar

Send message
Joined: 19 May 99
Posts: 411
Credit: 1,426,457
RAC: 0
Australia
Message 718998 - Posted: 27 Feb 2008, 7:16:12 UTC - in response to Message 718843.  

Oh yeah.. bane (one of the download servers) just went kaput. Spent 20 minutes trying to figure out what went wrong with its network. Oh - the cable came out of the switch. Click. Voila!


Thanks for the updates Matt. Next time dont forget to look for the easy fixes first lol
ID: 718998 · Report as offensive
Profile speedimic
Volunteer tester
Avatar

Send message
Joined: 28 Sep 02
Posts: 362
Credit: 16,590,653
RAC: 0
Germany
Message 719087 - Posted: 27 Feb 2008, 14:36:00 UTC

Maybe now is the time to start th FS - discussion (--> Sleepless in Oakland) if you reconfigure the file systems anyway.



Not much going on over the weekend but the science database server (thumper) is not performing. Jeff and I scanned all kinds of data during different tests and we're convinced it's the RAID configuration more than anything else. We're going to have to reconfigure all the file systems on that at some point. Painful, but we may be able to do it piece by piece without too much disruption.

Today we actually upgraded the way-out-of-date OS on thumper, which was also a bit painful, but ultimately successful. It should have been up and running by now, but thanks to an 8 Terabyte ext3 filesystem that hasn't been checked in over 180 days, a forced check is running and will probably be running all night. Not sure if we'll implement the secondary server (bambi) in the meantime - it may be too late in the day to attempt that. We'll let the project run as best it can until we run out of work (we'll probably keep a buffer of work just so the recovery later isn't as painful).

mic.


ID: 719087 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 719123 - Posted: 27 Feb 2008, 16:55:01 UTC - in response to Message 719087.  

Maybe now is the time to start th FS - discussion (--> Sleepless in Oakland) if you reconfigure the file systems anyway.


You're not going to convince him to get rid of ext3. However, for a database, ext2 makes a little more sense. There's no reason to have a filesystem journal when the database has its own journaling mechanism (log files). That may be a performance boost (except for running fsck -f). Using any other filesystem brings greater risk. Of course, they had better go RAID 10 or else. :)
ID: 719123 · Report as offensive
Profile sjf
Volunteer tester

Send message
Joined: 17 Aug 99
Posts: 5
Credit: 10,617,892
RAC: 0
United States
Message 719127 - Posted: 27 Feb 2008, 17:12:07 UTC

You can just disable periodic checking, assuming you trust your storage hardware. Check out man tune2fs ... or:
for i in `mount -t ext3 | awk '{print $1}'`; do tune2fs -C 0 -i 0 $i; done

A lot of better storage hardware can validate media and parity data on the fly.
ID: 719127 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 719144 - Posted: 27 Feb 2008, 17:47:24 UTC - in response to Message 719127.  

You can just disable periodic checking, assuming you trust your storage hardware. Check out man tune2fs ... or:
for i in `mount -t ext3 | awk '{print $1}'`; do tune2fs -C 0 -i 0 $i; done

A lot of better storage hardware can validate media and parity data on the fly.


Given the number of disk failures in the recent past, I doubt that the Berkeley staff "trusts their storage hardware"!

.

Hello, from Albany, CA!...
ID: 719144 · Report as offensive
Profile speedimic
Volunteer tester
Avatar

Send message
Joined: 28 Sep 02
Posts: 362
Credit: 16,590,653
RAC: 0
Germany
Message 719148 - Posted: 27 Feb 2008, 17:54:40 UTC

From what I read, that 8Tb partition is for storing WUs/Results - not for the database.

You're not going to convince him to get rid of ext3. However, for a database, ext2 makes a little more sense. There's no reason to have a filesystem journal when the database has its own journaling mechanism (log files). That may be a performance boost (except for running fsck -f). Using any other filesystem brings greater risk. Of course, they had better go RAID 10 or else. :)

mic.


ID: 719148 · Report as offensive
Profile Neil Walker
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 288
Credit: 18,101,056
RAC: 0
United Kingdom
Message 719239 - Posted: 27 Feb 2008, 21:49:25 UTC - in response to Message 719123.  

Of course, they had better go RAID 10 or else. :)


Either you are kidding or you don't know what you are talking about. :P RAID 10 is shorthand for RAID 1 + 0. AFAIK, The S@H team have always used RAID 5. That is the minimum for an application of this kind.



Be lucky

Neil



ID: 719239 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 719362 - Posted: 28 Feb 2008, 5:07:40 UTC - in response to Message 719239.  

Of course, they had better go RAID 10 or else. :)


Either you are kidding or you don't know what you are talking about. :P RAID 10 is shorthand for RAID 1 + 0. AFAIK, The S@H team have always used RAID 5. That is the minimum for an application of this kind.


I was kidding about the "they'd better...or else" part.

However, I am not as much kidding about RAID 1+0. Especially with database loads, RAID 10 is superior in performance and redundancy.

See this for an explanation if you need it:
http://www.bytepile.com/raid_class.php
ID: 719362 · Report as offensive
Profile Neil Walker
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 288
Credit: 18,101,056
RAC: 0
United Kingdom
Message 719410 - Posted: 28 Feb 2008, 8:29:43 UTC - in response to Message 719362.  

See this for an explanation if you need it:
http://www.bytepile.com/raid_class.php


I don't. ;) Maybe you should read it again in the context of the needs of S@H and the resources available.;)



Be lucky

Neil



ID: 719410 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 719478 - Posted: 28 Feb 2008, 15:01:57 UTC - in response to Message 719410.  

I don't. ;) Maybe you should read it again in the context of the needs of S@H and the resources available.;)


I did read that with SETI in mind. Unless there's an unsolvable reason why they can't make a RAID10, then that is what I recommend. For all the non-database servers, RAID5 is fine because it allows them lower cost per GB.

Besides, Matt has already made his decision.
ID: 719478 · Report as offensive
Profile sjf
Volunteer tester

Send message
Joined: 17 Aug 99
Posts: 5
Credit: 10,617,892
RAC: 0
United States
Message 719495 - Posted: 28 Feb 2008, 16:41:39 UTC - in response to Message 719144.  

You can just disable periodic checking, assuming you trust your storage hardware. Check out man tune2fs ... or:
for i in `mount -t ext3 | awk '{print $1}'`; do tune2fs -C 0 -i 0 $i; done

A lot of better storage hardware can validate media and parity data on the fly.


Given the number of disk failures in the recent past, I doubt that the Berkeley staff "trusts their storage hardware"!


Disk failures are irrelevant if you're using reliable RAID controllers and you're replacing disks in a a timely manner. Both of which they seem to be doing.
ID: 719495 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 719502 - Posted: 28 Feb 2008, 17:30:00 UTC - in response to Message 719495.  
Last modified: 28 Feb 2008, 17:30:51 UTC

You can just disable periodic checking, assuming you trust your storage hardware. Check out man tune2fs ... or:
for i in `mount -t ext3 | awk '{print $1}'`; do tune2fs -C 0 -i 0 $i; done

A lot of better storage hardware can validate media and parity data on the fly.


Given the number of disk failures in the recent past, I doubt that the Berkeley staff "trusts their storage hardware"!


Disk failures are irrelevant if you're using reliable RAID controllers and you're replacing disks in a a timely manner. Both of which they seem to be doing.


I was replying to the "assuming you trust your storage hardware"... AND IIRC, they've also had some RAID controller failures...

[irony intended] what are you, a member of the Borg? [/irony] ;-)
.

Hello, from Albany, CA!...
ID: 719502 · Report as offensive

Message boards : Technical News : Looking for the New Sound (Feb 26 2008)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.