Unsolved Mysteries (Sep 16 2008)

Message boards : Technical News : Unsolved Mysteries (Sep 16 2008)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 808898 - Posted: 16 Sep 2008, 22:25:07 UTC

Another week, another database maintenance outage. This one was short but busy. We actually had major upgrade plans for one server but feared this would take all day and lock out the servers so we postponed it until less week which may be less stressful.

Eric cleared a bunch of space of the workunit storage so that bottleneck has been alleviated for now, i.e we have elbow room to create enough workunits to keep up with demand. However this leads us to the first of two mysteries today. You see, he's moving all the beta workunits to our new homemade NAS box (ptolemy). While this move has been already been helpful, it's taking forever to complete. Why are the disks pegged at 100% utilization? Lack of spindles? PCI bus traffic? Old/slow controller cards? RAID5 biting us again? We'll either sort that out or eventually give up on this machine as anything more than archival storage.

The other mystery has been a known issue for some time, but with the down time we revisited the problem: our secondary science database server, bambi, works great except for the fact that upon reboot there's a random chance one or two (or three) drives simply don't show up on the 3ware controller, causing all kinds of RAID panics/rebuilds. It's never clear why this happens, or when it will happen, and when it does it's not always the same drives that disappear.

However, a full power cycle always works. The only difference really is that the drives have to spin up on power cycle, but not on reboot. So we've been assuming there's some spin-up settings that need to be tweaked. There's been talk of making bambi the primary database server, so today we looked for those settings. Couldn't find them - nothing in the regular motherboard BIOS, and nothing useful in the 3ware BIOS - and the latter was moot because the drives would have already disappeared according to the 3ware BIOS, so all the spin-up problems are happening before the 3ware is aware. I find nothing about this in any documentation or on the web. It's not a showstopper, we can still use bambi as the backup that it is, but this pretty much means we'll never be able to fully trust bambi as a "main" server.

Oh yeah.. other stuff. The mysql replica croaked this morning just before we arrived - a partition on the server filled up. Apparently when upgrading the OS we missed a sym link somewhere. So the replica is resync'ing yet again. Also messing around getting the CUDA development/testing server up and running.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 808898 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20443
Credit: 7,508,002
RAC: 20
United Kingdom
Message 808902 - Posted: 16 Sep 2008, 22:46:16 UTC - in response to Message 808898.  
Last modified: 16 Sep 2008, 22:46:41 UTC

... a random chance one or two (or three) drives simply don't show up on the 3ware controller, causing all kinds of RAID panics/rebuilds. It's never clear why this happens, or when it will happen, and when it does it's not always the same drives that disappear.

However, a full power cycle always works. The only difference really is that the drives have to spin up on power cycle, but not on reboot. So we've been assuming there's some spin-up settings that need to be tweaked...

Err... Sorry but does not compute...

"A full power cycle always works..." That means drives spin-up, so why the spin-up settings search?

Or if you mean the random failures are on power-up (and spin-up), then...

PSU overload during spin-up?
Are the spin-ups staggered?
Can the drives themselves be set to delay/stagger their spin-up? (There's an extra sata pin for just that.)


"Random" suggests hardware or timing races...

Good luck,

Regards,
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 808902 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 808908 - Posted: 16 Sep 2008, 23:00:23 UTC - in response to Message 808902.  

"A full power cycle always works..." That means drives spin-up, so why the spin-up settings search?


Power cycle (remove all power from system, then turn back on): no problem.

Reboot (type "reboot" or ctrl-alt-del, etc.): problem.

Drives go through full on spin up during power cycle (i.e. staggered, with set delays, etc.).

Drives go through "quick" spin up during reboot, and there is seemingly no way to change that.

This is the only difference as far as we can tell, and why this is the drive spin-up issues are the leading suspect.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 808908 · Report as offensive
Urs Echternacht
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 692
Credit: 135,197,781
RAC: 211
Germany
Message 808912 - Posted: 16 Sep 2008, 23:15:48 UTC
Last modified: 16 Sep 2008, 23:16:45 UTC

Matt, of course you have checked that none of the drives (if that is a feature of that hdd's) have noise reduction active ? (...just some wild guess...)
_\|/_
U r s
ID: 808912 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30720
Credit: 53,134,872
RAC: 32
United States
Message 808924 - Posted: 16 Sep 2008, 23:45:46 UTC - in response to Message 808908.  

"A full power cycle always works..." That means drives spin-up, so why the spin-up settings search?


Power cycle (remove all power from system, then turn back on): no problem.

Reboot (type "reboot" or ctrl-alt-del, etc.): problem.

Drives go through full on spin up during power cycle (i.e. staggered, with set delays, etc.).

Drives go through "quick" spin up during reboot, and there is seemingly no way to change that.

This is the only difference as far as we can tell, and why this is the drive spin-up issues are the leading suspect.

- Matt


Ah, I bet what is going on in the reboot is some of the drives have gone to sleep. When they get the reboot they all spin up at once and the power supply can't keep up. Might check the sleep settings and disable.

Gary

ID: 808924 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65825
Credit: 55,293,173
RAC: 49
United States
Message 808966 - Posted: 17 Sep 2008, 1:51:06 UTC - in response to Message 808908.  

"A full power cycle always works..." That means drives spin-up, so why the spin-up settings search?


Power cycle (remove all power from system, then turn back on): no problem.

Reboot (type "reboot" or ctrl-alt-del, etc.): problem.

Drives go through full on spin up during power cycle (i.e. staggered, with set delays, etc.).

Drives go through "quick" spin up during reboot, and there is seemingly no way to change that.

This is the only difference as far as we can tell, and why this is the drive spin-up issues are the leading suspect.

- Matt

Are You sure the Delay for the hdds isn't too tight? I mean If It's set at 5 seconds or so, this could be the problem. To fix It I'd change that Bios setting to something bigger. Good Luck.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 808966 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 809003 - Posted: 17 Sep 2008, 3:43:56 UTC - in response to Message 808898.  
Last modified: 17 Sep 2008, 3:49:24 UTC

What is the model number of your 3ware controller in Bambi? Is it PCI, PCI-X, or PCI-Express? How many of those cards are installed?

I read about 3ware PCI-X cards having a frequency setting issue on warm boots.... And if your motherboard doesn't do a full PCI bus reset on reboot, this could be the issue. Does pressing the reset button on the server make a difference on this issue?

Also, has anyone every tried using PostgreSQL in place of MySQL for BOINC? I wonder if there would be any benefit and what the time/effort cost would be.
ID: 809003 · Report as offensive
Profile KyleFL

Send message
Joined: 20 May 99
Posts: 17
Credit: 2,332,249
RAC: 0
Germany
Message 809075 - Posted: 17 Sep 2008, 6:05:44 UTC

I agree.
It doesn´t seem a problem with the drives, but with the Raid-Controller.
Or maybe just the combination of Controller/HDDs is unlucky.

We had some serious trouble with a Intel Raidcontroller in our Domainontroller. He lost drives randomlly and did crash our system. After switching to an other Controller everything went back stable and the server is running since without problems.


*snip*
Why are the disks pegged at 100% utilization? Lack of spindles? PCI bus traffic? Old/slow controller cards? RAID5 biting us again? We'll either sort that out or eventually give up on this machine as anything more than archival storage.
*snip*

Propably the controller card. Raid5 demands a lot of Crunching Power for the parity data. If the controller is relaying on the CPU (most cheap conrtollers are doing that), then the CPU gets really busy.
Do you have a virus scanner running on that box? That could be a reason, too, because the scanner would check every file that is copied on its HDDs. I had this issue on a PIII 1ghz - after killing Norton AV my transfer speeds got up to ~40MB/s -- before I couldn´t get over 10MB/s


Regards, KyleFL

ID: 809075 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 809175 - Posted: 17 Sep 2008, 16:33:52 UTC

Matt,

The "signals already found" portion of the Science Status page has been frozen since at least last Friday...

Give the script a kick when ya gotta chance, OK?
.

Hello, from Albany, CA!...
ID: 809175 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 809176 - Posted: 17 Sep 2008, 16:39:07 UTC

Thanks for all the tips/suggestions. There are some helpful thoughts in there though some pertain to our situation more than others - I don't have the time to address them individually.

Bear in mind, given the size of our "sysadmin team" (me, Jeff, Eric, and Bob) and the amount of time we all get to spend on sysadmin adds up to about one and a half FTE's (tops) running the databases, websites, and the entire backend of one of the world's biggest supercomputing projects, more often than not our best answer to any such mystery is to move on to something more important - maybe we'll figure it out later if we *really* need to. Frustrating, but we just don't have the manpower.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 809176 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 809182 - Posted: 17 Sep 2008, 17:11:05 UTC

I think we all appreciate your points below, and your service, Matt.
ID: 809182 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 809254 - Posted: 17 Sep 2008, 23:02:16 UTC


. . . Indeed that is true PA - and Matt, Jeff, Eric & Bob Deserve Accolades for All that they do @ Berkeley - Thanks 'TeAm'

and Thanks for the Post Updates Matt . . .\




BOINC Wiki . . .

Science Status Page . . .
ID: 809254 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 809329 - Posted: 18 Sep 2008, 1:02:18 UTC - in response to Message 809176.  

Thanks for all the tips/suggestions. There are some helpful thoughts in there though some pertain to our situation more than others - I don't have the time to address them individually.

Bear in mind, given the size of our "sysadmin team" (me, Jeff, Eric, and Bob) and the amount of time we all get to spend on sysadmin adds up to about one and a half FTE's (tops) running the databases, websites, and the entire backend of one of the world's biggest supercomputing projects, more often than not our best answer to any such mystery is to move on to something more important - maybe we'll figure it out later if we *really* need to. Frustrating, but we just don't have the manpower.

- Matt


OK, if your 3ware RAID is PCI-X, someone on another forum said they fixed the bus speed to 133MHz with a jumper on the motherboard. This solved the warm boot issue. If you can locate the motherboard book to Bambi, you could determine if this may help pretty quickly.

If it's PCI-express or PCI, you're batting in the dark with bios settings and firmware updates....

I know you're too busy to respond, but I wish you luck in getting this solved!
ID: 809329 · Report as offensive
Bad Spartan
Volunteer tester
Avatar

Send message
Joined: 14 Mar 03
Posts: 6
Credit: 47,663,261
RAC: 117
Puerto Rico
Message 809350 - Posted: 18 Sep 2008, 2:22:30 UTC

Have you verified any compatibility issues between the hard drive manufacturer and the RAID card? It could be a firmware issue between them...
ID: 809350 · Report as offensive
Profile Neil Blaikie
Volunteer tester
Avatar

Send message
Joined: 17 May 99
Posts: 143
Credit: 6,652,341
RAC: 0
Canada
Message 809357 - Posted: 18 Sep 2008, 2:52:26 UTC

Again, I know you guys don't have the manpower to be able to tackle such an issue "head-on" so to speak. I do however agree with the other comments posted that it is more than likely a raid controller problem / settings issue.

As mentioned as well it could be a hard drive compatibility issue with the raid controller, have had that problem several times and a simple new raid card has nearly always solved the problem (cheaper than replacing a massive amount of drives for sure!)

Might be worth during an outage to get an electro-mechnical student (or you guys) to do a voltage test on the psu's and see if they could be causing the problem. It is feasible that a psu overload spike might cause the drives to spin up at different speeds during a power / reboot cycle. There is also the possibility that you have a "dirty" power supply to the server room. Servers are very sensitive to even small voltage changes in supply current. 3.2 volts in a system I fixed recently was enough to "lose" 6 out of 48 drives during a power cycle.

You guys do a fantastic job with the limit resources you have and all of you work very hard to keep "joe public" happy and content.

Thanks for the info as always and hope it gets sorted soon.
ID: 809357 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 809514 - Posted: 18 Sep 2008, 16:56:13 UTC - in response to Message 809357.  

Again, I know you guys don't have the manpower to be able to tackle such an issue "head-on" so to speak. I do however agree with the other comments posted that it is more than likely a raid controller problem / settings issue.

As mentioned as well it could be a hard drive compatibility issue with the raid controller, have had that problem several times and a simple new raid card has nearly always solved the problem (cheaper than replacing a massive amount of drives for sure!)

Might be worth during an outage to get an electro-mechnical student (or you guys) to do a voltage test on the psu's and see if they could be causing the problem. It is feasible that a psu overload spike might cause the drives to spin up at different speeds during a power / reboot cycle. There is also the possibility that you have a "dirty" power supply to the server room. Servers are very sensitive to even small voltage changes in supply current. 3.2 volts in a system I fixed recently was enough to "lose" 6 out of 48 drives during a power cycle.

You guys do a fantastic job with the limit resources you have and all of you work very hard to keep "Joe public" happy and content.

Thanks for the info as always and hope it gets sorted soon.


As most, if not all, critical systems at the lab are on UPS, I doubt that voltage would be an issue - but a UPS could have developed a problem and be a few volts low...
.

Hello, from Albany, CA!...
ID: 809514 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 809563 - Posted: 18 Sep 2008, 21:25:12 UTC - in response to Message 809514.  

Again, I know you guys don't have the manpower to be able to tackle such an issue "head-on" so to speak. I do however agree with the other comments posted that it is more than likely a raid controller problem / settings issue.

As mentioned as well it could be a hard drive compatibility issue with the raid controller, have had that problem several times and a simple new raid card has nearly always solved the problem (cheaper than replacing a massive amount of drives for sure!)

Might be worth during an outage to get an electro-mechnical student (or you guys) to do a voltage test on the psu's and see if they could be causing the problem. It is feasible that a psu overload spike might cause the drives to spin up at different speeds during a power / reboot cycle. There is also the possibility that you have a "dirty" power supply to the server room. Servers are very sensitive to even small voltage changes in supply current. 3.2 volts in a system I fixed recently was enough to "lose" 6 out of 48 drives during a power cycle.

You guys do a fantastic job with the limit resources you have and all of you work very hard to keep "Joe public" happy and content.

Thanks for the info as always and hope it gets sorted soon.


As most, if not all, critical systems at the lab are on UPS, I doubt that voltage would be an issue - but a UPS could have developed a problem and be a few volts low...

Remember that the UPS converts the battery voltage to A/C and then the PC power supply converts the AC to the relevant voltages for the motherboard.

A failing power supply could be failing even if the line voltage is good.

But don't forget: Matt says the drives always seem to come up if you power cycle the system, but not if you reboot without cycling power.

Drives need the most power on start-up because they aren't spinning -- it takes more torque (and more energy) to bring a drive up to speed than it does to keep it spinning at 7200 (or 10,000, or whatever).

That maximum load happens right after the system is powered on.

On a reboot, the drives should already be spinning, and there was enough power just before rebooting, but not after?

Doesn't make sense.

Same with the staggered start-up settings. It'd make sense if they were having trouble on power up but not on a reboot -- not the other way 'round.

That said, I don't have any other ideas. Would fiddling with the power up settings in the RAID card BIOS help? Don't think so, but if I was there, I'd sure give it a try -- it's easy to try and you might be pleasantly surprised. If not, you only wasted a few minutes.

Luck is often better than skill.

-- Ned
ID: 809563 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30720
Credit: 53,134,872
RAC: 32
United States
Message 809568 - Posted: 18 Sep 2008, 22:01:58 UTC - in response to Message 809563.  

Again, I know you guys don't have the manpower to be able to tackle such an issue "head-on" so to speak. I do however agree with the other comments posted that it is more than likely a raid controller problem / settings issue.

As mentioned as well it could be a hard drive compatibility issue with the raid controller, have had that problem several times and a simple new raid card has nearly always solved the problem (cheaper than replacing a massive amount of drives for sure!)

Might be worth during an outage to get an electro-mechnical student (or you guys) to do a voltage test on the psu's and see if they could be causing the problem. It is feasible that a psu overload spike might cause the drives to spin up at different speeds during a power / reboot cycle. There is also the possibility that you have a "dirty" power supply to the server room. Servers are very sensitive to even small voltage changes in supply current. 3.2 volts in a system I fixed recently was enough to "lose" 6 out of 48 drives during a power cycle.

You guys do a fantastic job with the limit resources you have and all of you work very hard to keep "Joe public" happy and content.

Thanks for the info as always and hope it gets sorted soon.


As most, if not all, critical systems at the lab are on UPS, I doubt that voltage would be an issue - but a UPS could have developed a problem and be a few volts low...

Remember that the UPS converts the battery voltage to A/C and then the PC power supply converts the AC to the relevant voltages for the motherboard.

A failing power supply could be failing even if the line voltage is good.

But don't forget: Matt says the drives always seem to come up if you power cycle the system, but not if you reboot without cycling power.

Drives need the most power on start-up because they aren't spinning -- it takes more torque (and more energy) to bring a drive up to speed than it does to keep it spinning at 7200 (or 10,000, or whatever).

That maximum load happens right after the system is powered on.

On a reboot, the drives should already be spinning, and there was enough power just before rebooting, but not after?

Doesn't make sense.

Same with the staggered start-up settings. It'd make sense if they were having trouble on power up but not on a reboot -- not the other way 'round.

That said, I don't have any other ideas. Would fiddling with the power up settings in the RAID card BIOS help? Don't think so, but if I was there, I'd sure give it a try -- it's easy to try and you might be pleasantly surprised. If not, you only wasted a few minutes.

Luck is often better than skill.

-- Ned


I suspect it is a spin down problem, not a spin up problem. The I/O pattern may keep only part of the drives active. The rest may not see a lot of accesses and go to sleep. This isn't a problem in normal access to the array as only a couple of drives at a time should need to spin up. However when the controler gets the reset signal it tells all the drives to spin up and doesn't do a staggered start, because it isn't a cold start.

Knowing this you just do a cold start and not a reboot. Not a big issue. Might even replace the reboot command with one giving a warning message to remind the operators of the problem.

If you have time you could make sure the drive sleep is disabled on each drive but that uses more electricity. Last option would be to write a script that accesses a block on each drive a couple of seconds apart and when done does the reboot so all the drives are spun up at reboot time.

Gary


ID: 809568 · Report as offensive
Profile KyleFL

Send message
Joined: 20 May 99
Posts: 17
Credit: 2,332,249
RAC: 0
Germany
Message 809569 - Posted: 18 Sep 2008, 22:03:19 UTC

After having too much trouble with Hardware Raid-Controllers we configurated the drives as Software RAID 1 (Windows 2003 Server).

This way - if the controller fails, we won´t have any problems moving them to an other controller.
For some time we did run a RAID combination with a IDE/SATA and a SCSI Drive -- no problems - even through the preformance was limited to the slower IDE-Drive.

After that it was really easy to kick the IDE drive out of the RAID in the Computer Manager and put a new SCSI into it (6h synchronizing the data and everything was back at 100%)

The performance should be much better than a RAID5 System.


If we ever would need a Server with MUCH HDD-Space like the guys at SETI do, I would definetly go with a Software RAID 1 IDE/SATA System. That way you would get 3TB with 4 drives (using the new Seagate 1.5TB drives) with a 0-8/15 mainboard with a RAID1 setting
Adding an a additional cheap (~20€) SATA controller would boost this to another 3TB - that way a server with ~12TB RAID-1 HDD Space should be possible with really low costs (the most expensive part would be the W2k3 license - maybe MS will give a special rebate to SETI :)

Hardware: ~1200€ for System (Intel Q6600 with Board+4GB non ECC memory) + 4HDDs
~700€ for every additional 3TB (4 drives+controller)



Regards, KyleFL


PS: I don´t believe in a power supply related problem in that case.
ID: 809569 · Report as offensive

Message boards : Technical News : Unsolved Mysteries (Sep 16 2008)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.