The Gates of Delirium (Dec 06 2012)

Message boards : Technical News : The Gates of Delirium (Dec 06 2012)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Jeff Cobb Project Donor
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Mar 99
Posts: 122
Credit: 40,367
RAC: 0
United States
Message 1311865 - Posted: 6 Dec 2012, 18:50:30 UTC

We have recently come out of a painful outage. Last Thursday, 11/29, there was an unexpected power outage at Space Sciences Lab. It lasted some 20 minutes. Eric came over as quickly as he could to shut machines down, but he works in another building from where our machine room is, so the UPS's had run out their fairly short on-battery time by the time he got there. It was a perfect storm in that both Matt and I (who work a few feet from the machine room) were both out.

Most machines came through OK, but three did not. Lando, an older administrative work horse (and splitter machine) appears to be dead. We have some spares from which to choose its replacement. More tragic was the fact that the master BOINC database, and its replica, suffered unrepairable corruption. This was an astonishing bit of bad luck. Both machines are on UPS and both machines have battery backed RAID controllers. One would think that all database logging would have at least made it to the RAID controller, but it obviously did not.

In order to recover the master database, we had to actually delete all of the underlying files and then recreate all of the databases from scratch before recovering from backup. A simple recovery from the backup did not work. After recreating the databases and then recovering from the backup, we ran all of the MySQL binary logs to recover up to a point in time just before the outage. Then we took a fresh backup of the database in case the next step did more harm than good. The next step was to run an extensive table check/repair on all tables in both the production and beta databases. All tables reported OK. Good! We then brought the projects up and used the fresh backup to restore the replica.

One might ask why we don't have machines automatically shut down in an on-battery situation. A good question with a lot of history. To make a long story short, our server complex has enough cross dependencies that if machines come down in the "wrong" order, other machines can hang. Plus some of of old UPS's would hiccup and cause a spurious shutdown (I'm not sure if our current crop have this problem). This was enough of a headache that we went with a very simple design. Our database machines would have battery backed RAID and be on UPS with no automatic shutdown. The theory was that the UPS would hold the machines for the duration of very short (one or two minute) power outages and, beyond that, the RAID controllers would save any pending IO. This very simple design has served us well but, as we see, not in all cases.

Eric came up with a good compromise. We will configure the BOINC replica database machine to immediately shut down (after stopping the database and unmounting its file system in case the shutdown hangs) upon detecting an on-battery condition. Nothing is dependent on this machine, so a spurious shutdown would not be a disaster. This should prevent a disaster of this magnitude from recurring.

ID: 1311865 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1311875 - Posted: 6 Dec 2012, 19:21:34 UTC - in response to Message 1311865.  

Thanks for the update Jeff,

Claggy
ID: 1311875 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1311877 - Posted: 6 Dec 2012, 19:26:11 UTC

Thank you for the news, Jeff!
Soooooo glad the DB was recoverable.

Some better UPS solutions needed for the servers?

Meow.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1311877 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9958
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1311888 - Posted: 6 Dec 2012, 19:46:10 UTC

Jeff, Thank you for taking the time to explain.

It is appreciated.
ID: 1311888 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 1311890 - Posted: 6 Dec 2012, 19:48:09 UTC

It also looks like Matt might be back on Monday as his last show was on December 2nd.

ID: 1311890 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 31005
Credit: 53,134,872
RAC: 32
United States
Message 1311893 - Posted: 6 Dec 2012, 20:04:30 UTC

Thank you for the update. We all wondered how the damage got so extensive.

ID: 1311893 · Report as offensive
Profile Dimly Lit Lightbulb 😀
Volunteer tester
Avatar

Send message
Joined: 30 Aug 08
Posts: 15399
Credit: 7,423,413
RAC: 1
United Kingdom
Message 1311905 - Posted: 6 Dec 2012, 20:25:46 UTC

Thanks for the news Jeff.

Member of the People Encouraging Niceness In Society club.

ID: 1311905 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1311918 - Posted: 6 Dec 2012, 20:53:52 UTC
Last modified: 6 Dec 2012, 21:10:38 UTC

Thanks so much for the explanation. Knowing helps us live with it.

[edit]Would it be possible to write a script that would live on the machine that needs to be shut down last that would send instructions to the others to shut down in proper order? Maybe even get feedback when they are down so it knows when to send the next command?
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1311918 · Report as offensive
BarryAZ

Send message
Joined: 1 Apr 01
Posts: 2580
Credit: 16,982,517
RAC: 0
United States
Message 1311926 - Posted: 6 Dec 2012, 21:20:06 UTC

Let me add my thanks to the detailed explanation -- and knowing that you've reconfigured a battery shut down process to reduce database rebuild scenarios is nice to see. It seems that power outages on the campus are like 100 year storms -- the years are not 'human years' but rather equivalent years for some much shorter lived entity.
ID: 1311926 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22526
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1311928 - Posted: 6 Dec 2012, 21:32:48 UTC

Well done gents.
I dare say there was much midnight oil, expletives and coffee involved in the process.

Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1311928 · Report as offensive
Richard J. Wyatt
Volunteer tester

Send message
Joined: 22 Aug 99
Posts: 5
Credit: 6,792,698
RAC: 28
Canada
Message 1311935 - Posted: 6 Dec 2012, 21:46:12 UTC

My most heart-felt thanks to the Boys With The Baling Wire, once again.

While some scientists may be doing frivolous work on multi-million dollar equipment, the SETI team continues to attempt to answer an age-old question on equipment (in some cases) that other teams would throw out as obsolete.

It just goes to show that it's not the toys you have to do it with, it's the spirit of the adventurers.

You guys have, and probably always will, amaze me.

Regards,

Richard
ID: 1311935 · Report as offensive
Profile dancer42
Volunteer tester

Send message
Joined: 2 Jun 02
Posts: 455
Credit: 2,422,890
RAC: 1
United States
Message 1311971 - Posted: 6 Dec 2012, 23:01:13 UTC

the top off cycle of the jell cell will degrade them over time.

It may be wise to replace them every 3 to 5 years to insure that you have

the capacity to do what you need.

and in this case more is always better.

also a monitor plugged in to a ups before help arrives is a wast of backup time

until help arrives.

a conveniently placed power strip plugged into the ups can be used to re-

power anything needed for shutdown once help arrives.

a turned on monitor can kill a ups in minutes when just the computer may have stayed up for an hour or more.
ID: 1311971 · Report as offensive
QSilver

Send message
Joined: 26 May 99
Posts: 232
Credit: 6,452,764
RAC: 0
United States
Message 1311979 - Posted: 6 Dec 2012, 23:18:55 UTC

Jeff, very appreciative that you took the time to describes the travails of the last week.
ID: 1311979 · Report as offensive
Swibby Bear

Send message
Joined: 1 Aug 01
Posts: 246
Credit: 7,945,093
RAC: 0
United States
Message 1312043 - Posted: 7 Dec 2012, 5:12:11 UTC

To echo others, THANKS, Jeff, for the extensive write-up of the problem. It is a very well-written detailed explanation of the situation that should satiate (almost) all of us geeks out here. Thanks to you and Eric for the careful troubleshooting to get the databases back on track. Matt must be crazy with envy that he missed all the excitement! (NOT)
Whit
ID: 1312043 · Report as offensive
Thomas
Volunteer tester

Send message
Joined: 9 Dec 11
Posts: 1499
Credit: 1,345,576
RAC: 0
France
Message 1312116 - Posted: 7 Dec 2012, 11:30:02 UTC - in response to Message 1311865.  
Last modified: 7 Dec 2012, 11:30:34 UTC

Thanks very much Jeff for this detailed report.
Well done !
It's a series of bad luck that hit the lab.
I think many members should be aware of this post to judge the extent of the problem and stop extrapolate without knowing.
ID: 1312116 · Report as offensive
TwiztedDreamz

Send message
Joined: 18 Jan 06
Posts: 1
Credit: 31,769
RAC: 0
United States
Message 1312135 - Posted: 7 Dec 2012, 13:34:16 UTC

Maybe there is away to do something like a solar power battery backup system. i know there is not much funding for the project. and i am sure the solution would cost a pretty penny. but i was thinking that with all the volunteers, there might be ways to do like a separate fund raising for this. a friend of mine lives in an area where he is far out from the city where the power tends to go out even in a sprinkle of rain in most cases. we ended up installing a system like this to power his entire house in case the power goes out again. and sure enough it has many times. the longest he had to run his house on solar battery backup was about 5 to 6 days. i know a house power demand is nothing near the demand of all the computer systems. i just thought i would try to suggest something that might be possible. maybe even help spark idea's from others as well. ;)
ID: 1312135 · Report as offensive
{+BDC} djarril
Avatar

Send message
Joined: 6 Jun 11
Posts: 1
Credit: 7,417,080
RAC: 0
France
Message 1312138 - Posted: 7 Dec 2012, 13:53:52 UTC - in response to Message 1311865.  

Thanks Jeff !

Bad situation you lived. I hope you still slept :)

Just a suggestion for you. Because all the servers are dependant, why don't you use a "UPS monitor client/server" ? The best is to use a laptop to monitor all the UPS involved and if one fails, it orders via the software all the other servers to shutdown in the good order. The bad thing is that all servers have to be restarted manually in the good order or via WOL (Wake On Lan) if you can script it.
Hope it helps !

Best regards
Grade : Major Orpailleur
ID: 1312138 · Report as offensive
Draconian
Volunteer tester

Send message
Joined: 16 Mar 03
Posts: 21
Credit: 1,809,058
RAC: 0
United States
Message 1312140 - Posted: 7 Dec 2012, 13:58:30 UTC

I appreciate the information as to what happened with the outage, however - proper system design and architecture should never allow a system to be brought to it's knees.
I understand that this is pretty much a volunteer operation and such - but, imagine - if I told my boss that a 20 minute power outage would result in approx a week of downtime - well - I'm fired. Volunteer or not - it is a lot of downtime for a simple failure. Is it a matter of being able to dedicate time or lack of funding? Time? Get more / another volunteer in charge - funding / equipment - say the word and we fund it.

This is not a criticism of you - only a criticism that it wasn't prevented. Identify what you need to prevent this type of issue in the future and let us know. We pay your bills, so to speak, with our computers - and, when needed, our checkbooks. As a systems admin, I have a hard time with a 20 minute power outage causing this disruption in service - my boss would kill me.
Let us know WHAT you need - and don't be shy. The first obvious need - is a reliable UPS - let's say - at least an hour battery and safe shutdown. How much?
ID: 1312140 · Report as offensive
Profile dancer42
Volunteer tester

Send message
Joined: 2 Jun 02
Posts: 455
Credit: 2,422,890
RAC: 1
United States
Message 1312156 - Posted: 7 Dec 2012, 14:54:50 UTC
Last modified: 7 Dec 2012, 14:57:30 UTC

Ups's are expensive and add nothing until you need them.

so most people only buy the minimum they think they need.

Due to bad timing they could not hold long enough, this had unfortunate consequences.

the question is what to do now?

the suggestions I made earlier to maximize up time before help arrives could help.



and while modifying the ups's seti already has by adding more battery's might

not be pretty it is a lot cheaper than getting larger ups's.

Another concern that may by should be addressed here is whether the power for

the computers is run through a line isolation transformer?

even the idea about solar power is not necessarily bad, in daylight it would up the hold time.

what is the answer to this problem ,I think all of the above, perhaps the

gpuusers group could make a fund raiser list broken do into unit's 1 person

could donate for or buy.



ps they used up the bailing wire I sent a long time ago can some one send some duck tape!

LOL
ID: 1312156 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1312165 - Posted: 7 Dec 2012, 15:29:08 UTC - in response to Message 1311971.  

the top off cycle of the jell cell will degrade them over time.

It may be wise to replace them every 3 to 5 years to insure that you have
the capacity to do what you need.
and in this case more is always better.

also a monitor plugged in to a ups before help arrives is a wast of backup time
until help arrives.
a conveniently placed power strip plugged into the ups can be used to re-
power anything needed for shutdown once help arrives.
a turned on monitor can kill a ups in minutes when just the computer may have stayed up for an hour or more.

I'd say, once they've run themselves down once, replace them -- they won't give anything like the same amount of run time for the next outage. My UPS lasted about 2 hours running one computer (which has Boinc set to stop when on battery)(I think), my DSL modem and router, and a couple of radio scanners before it shut down the computer. The next time my power went out, just a few months later, it only lasted 20 minutes.

However, I have yet to take my own advice. ;-)

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1312165 · Report as offensive
1 · 2 · 3 · Next

Message boards : Technical News : The Gates of Delirium (Dec 06 2012)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.