The Gates of Delirium (Dec 06 2012)


log in

Advanced search

Message boards : Technical News : The Gates of Delirium (Dec 06 2012)

1 · 2 · 3 · Next
Author Message
Jeff Cobb
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 1 Mar 99
Posts: 110
Credit: 40,367
RAC: 0
United States
Message 1311865 - Posted: 6 Dec 2012, 18:50:30 UTC

We have recently come out of a painful outage. Last Thursday, 11/29, there was an unexpected power outage at Space Sciences Lab. It lasted some 20 minutes. Eric came over as quickly as he could to shut machines down, but he works in another building from where our machine room is, so the UPS's had run out their fairly short on-battery time by the time he got there. It was a perfect storm in that both Matt and I (who work a few feet from the machine room) were both out.

Most machines came through OK, but three did not. Lando, an older administrative work horse (and splitter machine) appears to be dead. We have some spares from which to choose its replacement. More tragic was the fact that the master BOINC database, and its replica, suffered unrepairable corruption. This was an astonishing bit of bad luck. Both machines are on UPS and both machines have battery backed RAID controllers. One would think that all database logging would have at least made it to the RAID controller, but it obviously did not.

In order to recover the master database, we had to actually delete all of the underlying files and then recreate all of the databases from scratch before recovering from backup. A simple recovery from the backup did not work. After recreating the databases and then recovering from the backup, we ran all of the MySQL binary logs to recover up to a point in time just before the outage. Then we took a fresh backup of the database in case the next step did more harm than good. The next step was to run an extensive table check/repair on all tables in both the production and beta databases. All tables reported OK. Good! We then brought the projects up and used the fresh backup to restore the replica.

One might ask why we don't have machines automatically shut down in an on-battery situation. A good question with a lot of history. To make a long story short, our server complex has enough cross dependencies that if machines come down in the "wrong" order, other machines can hang. Plus some of of old UPS's would hiccup and cause a spurious shutdown (I'm not sure if our current crop have this problem). This was enough of a headache that we went with a very simple design. Our database machines would have battery backed RAID and be on UPS with no automatic shutdown. The theory was that the UPS would hold the machines for the duration of very short (one or two minute) power outages and, beyond that, the RAID controllers would save any pending IO. This very simple design has served us well but, as we see, not in all cases.

Eric came up with a good compromise. We will configure the BOINC replica database machine to immediately shut down (after stopping the database and unmounting its file system in case the shutdown hangs) upon detecting an on-battery condition. Nothing is dependent on this machine, so a spurious shutdown would not be a disaster. This should prevent a disaster of this magnitude from recurring.

____________

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4067
Credit: 32,888,891
RAC: 7,060
United Kingdom
Message 1311875 - Posted: 6 Dec 2012, 19:21:34 UTC - in response to Message 1311865.

Thanks for the update Jeff,

Claggy

msattlerProject donor
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38910
Credit: 578,315,839
RAC: 517,516
United States
Message 1311877 - Posted: 6 Dec 2012, 19:26:11 UTC

Thank you for the news, Jeff!
Soooooo glad the DB was recoverable.

Some better UPS solutions needed for the servers?

Meow.
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 26 May 99
Posts: 6904
Credit: 25,731,432
RAC: 39,211
United Kingdom
Message 1311888 - Posted: 6 Dec 2012, 19:46:10 UTC

Jeff, Thank you for taking the time to explain.

It is appreciated.
____________


Today is life, the only life we're sure of. Make the most of today.

Profile arkaynProject donor
Volunteer tester
Avatar
Send message
Joined: 14 May 99
Posts: 3622
Credit: 48,544,119
RAC: 32,087
United States
Message 1311890 - Posted: 6 Dec 2012, 19:48:09 UTC

It also looks like Matt might be back on Monday as his last show was on December 2nd.
____________

Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12395
Credit: 6,703,853
RAC: 8,733
United States
Message 1311893 - Posted: 6 Dec 2012, 20:04:30 UTC

Thank you for the update. We all wondered how the damage got so extensive.

____________

Profile Zapped SparkyProject donor
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 30 Aug 08
Posts: 7330
Credit: 1,237,633
RAC: 1,298
United Kingdom
Message 1311905 - Posted: 6 Dec 2012, 20:25:46 UTC

Thanks for the news Jeff.
____________
In an alternate universe, it was a ZX81 that asked for clothes, boots and motorcycle.

Client error 418: I'm a teapot

Tropical Goldfish Fish 15: Squeaky bras 'R us

Illusions of normality sufferer

N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 11162
Credit: 13,943,404
RAC: 12,346
United States
Message 1311918 - Posted: 6 Dec 2012, 20:53:52 UTC
Last modified: 6 Dec 2012, 21:10:38 UTC

Thanks so much for the explanation. Knowing helps us live with it.

[edit]Would it be possible to write a script that would live on the machine that needs to be shut down last that would send instructions to the others to shut down in proper order? Maybe even get feedback when they are down so it knows when to send the next command?
____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


BarryAZ
Send message
Joined: 1 Apr 01
Posts: 2580
Credit: 12,052,651
RAC: 4,561
United States
Message 1311926 - Posted: 6 Dec 2012, 21:20:06 UTC

Let me add my thanks to the detailed explanation -- and knowing that you've reconfigured a battery shut down process to reduce database rebuild scenarios is nice to see. It seems that power outages on the campus are like 100 year storms -- the years are not 'human years' but rather equivalent years for some much shorter lived entity.

rob smithProject donor
Volunteer tester
Send message
Joined: 7 Mar 03
Posts: 8302
Credit: 55,196,629
RAC: 75,511
United Kingdom
Message 1311928 - Posted: 6 Dec 2012, 21:32:48 UTC

Well done gents.
I dare say there was much midnight oil, expletives and coffee involved in the process.

____________
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?

Richard J. Wyatt
Volunteer tester
Send message
Joined: 22 Aug 99
Posts: 5
Credit: 3,435,001
RAC: 2,130
Canada
Message 1311935 - Posted: 6 Dec 2012, 21:46:12 UTC

My most heart-felt thanks to the Boys With The Baling Wire, once again.

While some scientists may be doing frivolous work on multi-million dollar equipment, the SETI team continues to attempt to answer an age-old question on equipment (in some cases) that other teams would throw out as obsolete.

It just goes to show that it's not the toys you have to do it with, it's the spirit of the adventurers.

You guys have, and probably always will, amaze me.

Regards,

Richard

Profile dancer42
Volunteer tester
Send message
Joined: 2 Jun 02
Posts: 436
Credit: 1,093,724
RAC: 852
United States
Message 1311971 - Posted: 6 Dec 2012, 23:01:13 UTC

the top off cycle of the jell cell will degrade them over time.

It may be wise to replace them every 3 to 5 years to insure that you have

the capacity to do what you need.

and in this case more is always better.

also a monitor plugged in to a ups before help arrives is a wast of backup time

until help arrives.

a conveniently placed power strip plugged into the ups can be used to re-

power anything needed for shutdown once help arrives.

a turned on monitor can kill a ups in minutes when just the computer may have stayed up for an hour or more.
____________

QSilver
Send message
Joined: 26 May 99
Posts: 228
Credit: 4,587,292
RAC: 3,003
United States
Message 1311979 - Posted: 6 Dec 2012, 23:18:55 UTC

Jeff, very appreciative that you took the time to describes the travails of the last week.
____________

Swibby Bear
Send message
Joined: 1 Aug 01
Posts: 236
Credit: 7,276,138
RAC: 415
United States
Message 1312043 - Posted: 7 Dec 2012, 5:12:11 UTC

To echo others, THANKS, Jeff, for the extensive write-up of the problem. It is a very well-written detailed explanation of the situation that should satiate (almost) all of us geeks out here. Thanks to you and Eric for the careful troubleshooting to get the databases back on track. Matt must be crazy with envy that he missed all the excitement! (NOT)
Whit

Profile {BDC} Thomas DupontProject donor
Volunteer tester
Avatar
Send message
Joined: 9 Dec 11
Posts: 3726
Credit: 1,310,582
RAC: 780
France
Message 1312116 - Posted: 7 Dec 2012, 11:30:02 UTC - in response to Message 1311865.
Last modified: 7 Dec 2012, 11:30:34 UTC

Thanks very much Jeff for this detailed report.
Well done !
It's a series of bad luck that hit the lab.
I think many members should be aware of this post to judge the extent of the problem and stop extrapolate without knowing.
____________
Team Founder BRIGADE DU COSMOS




BRIGADE DU COSMOS is proudly sponsored by Zenovia Digital Exchange

TwiztedDreamz
Send message
Joined: 18 Jan 06
Posts: 1
Credit: 30,107
RAC: 0
United States
Message 1312135 - Posted: 7 Dec 2012, 13:34:16 UTC

Maybe there is away to do something like a solar power battery backup system. i know there is not much funding for the project. and i am sure the solution would cost a pretty penny. but i was thinking that with all the volunteers, there might be ways to do like a separate fund raising for this. a friend of mine lives in an area where he is far out from the city where the power tends to go out even in a sprinkle of rain in most cases. we ended up installing a system like this to power his entire house in case the power goes out again. and sure enough it has many times. the longest he had to run his house on solar battery backup was about 5 to 6 days. i know a house power demand is nothing near the demand of all the computer systems. i just thought i would try to suggest something that might be possible. maybe even help spark idea's from others as well. ;)
____________

{+BDC} djarril
Avatar
Send message
Joined: 6 Jun 11
Posts: 2
Credit: 6,789,735
RAC: 4,089
France
Message 1312138 - Posted: 7 Dec 2012, 13:53:52 UTC - in response to Message 1311865.

Thanks Jeff !

Bad situation you lived. I hope you still slept :)

Just a suggestion for you. Because all the servers are dependant, why don't you use a "UPS monitor client/server" ? The best is to use a laptop to monitor all the UPS involved and if one fails, it orders via the software all the other servers to shutdown in the good order. The bad thing is that all servers have to be restarted manually in the good order or via WOL (Wake On Lan) if you can script it.
Hope it helps !

Best regards
____________
Grade : Prestige Kilo

Draconian
Volunteer tester
Send message
Joined: 16 Mar 03
Posts: 21
Credit: 1,809,058
RAC: 0
United States
Message 1312140 - Posted: 7 Dec 2012, 13:58:30 UTC

I appreciate the information as to what happened with the outage, however - proper system design and architecture should never allow a system to be brought to it's knees.
I understand that this is pretty much a volunteer operation and such - but, imagine - if I told my boss that a 20 minute power outage would result in approx a week of downtime - well - I'm fired. Volunteer or not - it is a lot of downtime for a simple failure. Is it a matter of being able to dedicate time or lack of funding? Time? Get more / another volunteer in charge - funding / equipment - say the word and we fund it.

This is not a criticism of you - only a criticism that it wasn't prevented. Identify what you need to prevent this type of issue in the future and let us know. We pay your bills, so to speak, with our computers - and, when needed, our checkbooks. As a systems admin, I have a hard time with a 20 minute power outage causing this disruption in service - my boss would kill me.
Let us know WHAT you need - and don't be shy. The first obvious need - is a reliable UPS - let's say - at least an hour battery and safe shutdown. How much?
____________

Profile dancer42
Volunteer tester
Send message
Joined: 2 Jun 02
Posts: 436
Credit: 1,093,724
RAC: 852
United States
Message 1312156 - Posted: 7 Dec 2012, 14:54:50 UTC
Last modified: 7 Dec 2012, 14:57:30 UTC

Ups's are expensive and add nothing until you need them.

so most people only buy the minimum they think they need.

Due to bad timing they could not hold long enough, this had unfortunate consequences.

the question is what to do now?

the suggestions I made earlier to maximize up time before help arrives could help.



and while modifying the ups's seti already has by adding more battery's might

not be pretty it is a lot cheaper than getting larger ups's.

Another concern that may by should be addressed here is whether the power for

the computers is run through a line isolation transformer?

even the idea about solar power is not necessarily bad, in daylight it would up the hold time.

what is the answer to this problem ,I think all of the above, perhaps the

gpuusers group could make a fund raiser list broken do into unit's 1 person

could donate for or buy.



ps they used up the bailing wire I sent a long time ago can some one send some duck tape!

LOL
____________

N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 11162
Credit: 13,943,404
RAC: 12,346
United States
Message 1312165 - Posted: 7 Dec 2012, 15:29:08 UTC - in response to Message 1311971.

the top off cycle of the jell cell will degrade them over time.

It may be wise to replace them every 3 to 5 years to insure that you have
the capacity to do what you need.
and in this case more is always better.

also a monitor plugged in to a ups before help arrives is a wast of backup time
until help arrives.
a conveniently placed power strip plugged into the ups can be used to re-
power anything needed for shutdown once help arrives.
a turned on monitor can kill a ups in minutes when just the computer may have stayed up for an hour or more.

I'd say, once they've run themselves down once, replace them -- they won't give anything like the same amount of run time for the next outage. My UPS lasted about 2 hours running one computer (which has Boinc set to stop when on battery)(I think), my DSL modem and router, and a couple of radio scanners before it shut down the computer. The next time my power went out, just a few months later, it only lasted 20 minutes.

However, I have yet to take my own advice. ;-)

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


1 · 2 · 3 · Next

Message boards : Technical News : The Gates of Delirium (Dec 06 2012)

Copyright © 2014 University of California