Message boards :
Technical News :
The Gates of Delirium (Dec 06 2012)
Message board moderation
Author | Message |
---|---|
Jeff Cobb Send message Joined: 1 Mar 99 Posts: 122 Credit: 40,367 RAC: 0 |
We have recently come out of a painful outage. Last Thursday, 11/29, there was an unexpected power outage at Space Sciences Lab. It lasted some 20 minutes. Eric came over as quickly as he could to shut machines down, but he works in another building from where our machine room is, so the UPS's had run out their fairly short on-battery time by the time he got there. It was a perfect storm in that both Matt and I (who work a few feet from the machine room) were both out. Most machines came through OK, but three did not. Lando, an older administrative work horse (and splitter machine) appears to be dead. We have some spares from which to choose its replacement. More tragic was the fact that the master BOINC database, and its replica, suffered unrepairable corruption. This was an astonishing bit of bad luck. Both machines are on UPS and both machines have battery backed RAID controllers. One would think that all database logging would have at least made it to the RAID controller, but it obviously did not. In order to recover the master database, we had to actually delete all of the underlying files and then recreate all of the databases from scratch before recovering from backup. A simple recovery from the backup did not work. After recreating the databases and then recovering from the backup, we ran all of the MySQL binary logs to recover up to a point in time just before the outage. Then we took a fresh backup of the database in case the next step did more harm than good. The next step was to run an extensive table check/repair on all tables in both the production and beta databases. All tables reported OK. Good! We then brought the projects up and used the fresh backup to restore the replica. One might ask why we don't have machines automatically shut down in an on-battery situation. A good question with a lot of history. To make a long story short, our server complex has enough cross dependencies that if machines come down in the "wrong" order, other machines can hang. Plus some of of old UPS's would hiccup and cause a spurious shutdown (I'm not sure if our current crop have this problem). This was enough of a headache that we went with a very simple design. Our database machines would have battery backed RAID and be on UPS with no automatic shutdown. The theory was that the UPS would hold the machines for the duration of very short (one or two minute) power outages and, beyond that, the RAID controllers would save any pending IO. This very simple design has served us well but, as we see, not in all cases. Eric came up with a good compromise. We will configure the BOINC replica database machine to immediately shut down (after stopping the database and unmounting its file system in case the shutdown hangs) upon detecting an on-battery condition. Nothing is dependent on this machine, so a spurious shutdown would not be a disaster. This should prevent a disaster of this magnitude from recurring. |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Thanks for the update Jeff, Claggy |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
Thank you for the news, Jeff! Soooooo glad the DB was recoverable. Some better UPS solutions needed for the servers? Meow. "Time is simply the mechanism that keeps everything from happening all at once." |
Bernie Vine Send message Joined: 26 May 99 Posts: 9958 Credit: 103,452,613 RAC: 328 |
Jeff, Thank you for taking the time to explain. It is appreciated. |
arkayn Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0 |
|
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 31005 Credit: 53,134,872 RAC: 32 |
Thank you for the update. We all wondered how the damage got so extensive. |
Dimly Lit Lightbulb 😀 Send message Joined: 30 Aug 08 Posts: 15399 Credit: 7,423,413 RAC: 1 |
Thanks for the news Jeff. Member of the People Encouraging Niceness In Society club. |
David S Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12 |
Thanks so much for the explanation. Knowing helps us live with it. [edit]Would it be possible to write a script that would live on the machine that needs to be shut down last that would send instructions to the others to shut down in proper order? Maybe even get feedback when they are down so it knows when to send the next command? David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. |
BarryAZ Send message Joined: 1 Apr 01 Posts: 2580 Credit: 16,982,517 RAC: 0 |
Let me add my thanks to the detailed explanation -- and knowing that you've reconfigured a battery shut down process to reduce database rebuild scenarios is nice to see. It seems that power outages on the campus are like 100 year storms -- the years are not 'human years' but rather equivalent years for some much shorter lived entity. |
rob smith Send message Joined: 7 Mar 03 Posts: 22526 Credit: 416,307,556 RAC: 380 |
Well done gents. I dare say there was much midnight oil, expletives and coffee involved in the process. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Richard J. Wyatt Send message Joined: 22 Aug 99 Posts: 5 Credit: 6,792,698 RAC: 28 |
My most heart-felt thanks to the Boys With The Baling Wire, once again. While some scientists may be doing frivolous work on multi-million dollar equipment, the SETI team continues to attempt to answer an age-old question on equipment (in some cases) that other teams would throw out as obsolete. It just goes to show that it's not the toys you have to do it with, it's the spirit of the adventurers. You guys have, and probably always will, amaze me. Regards, Richard |
dancer42 Send message Joined: 2 Jun 02 Posts: 455 Credit: 2,422,890 RAC: 1 |
the top off cycle of the jell cell will degrade them over time. It may be wise to replace them every 3 to 5 years to insure that you have the capacity to do what you need. and in this case more is always better. also a monitor plugged in to a ups before help arrives is a wast of backup time until help arrives. a conveniently placed power strip plugged into the ups can be used to re- power anything needed for shutdown once help arrives. a turned on monitor can kill a ups in minutes when just the computer may have stayed up for an hour or more. |
QSilver Send message Joined: 26 May 99 Posts: 232 Credit: 6,452,764 RAC: 0 |
Jeff, very appreciative that you took the time to describes the travails of the last week. |
Swibby Bear Send message Joined: 1 Aug 01 Posts: 246 Credit: 7,945,093 RAC: 0 |
To echo others, THANKS, Jeff, for the extensive write-up of the problem. It is a very well-written detailed explanation of the situation that should satiate (almost) all of us geeks out here. Thanks to you and Eric for the careful troubleshooting to get the databases back on track. Matt must be crazy with envy that he missed all the excitement! (NOT) Whit |
Thomas Send message Joined: 9 Dec 11 Posts: 1499 Credit: 1,345,576 RAC: 0 |
Thanks very much Jeff for this detailed report. Well done ! It's a series of bad luck that hit the lab. I think many members should be aware of this post to judge the extent of the problem and stop extrapolate without knowing. |
TwiztedDreamz Send message Joined: 18 Jan 06 Posts: 1 Credit: 31,769 RAC: 0 |
Maybe there is away to do something like a solar power battery backup system. i know there is not much funding for the project. and i am sure the solution would cost a pretty penny. but i was thinking that with all the volunteers, there might be ways to do like a separate fund raising for this. a friend of mine lives in an area where he is far out from the city where the power tends to go out even in a sprinkle of rain in most cases. we ended up installing a system like this to power his entire house in case the power goes out again. and sure enough it has many times. the longest he had to run his house on solar battery backup was about 5 to 6 days. i know a house power demand is nothing near the demand of all the computer systems. i just thought i would try to suggest something that might be possible. maybe even help spark idea's from others as well. ;) |
{+BDC} djarril Send message Joined: 6 Jun 11 Posts: 1 Credit: 7,417,080 RAC: 0 |
Thanks Jeff ! Bad situation you lived. I hope you still slept :) Just a suggestion for you. Because all the servers are dependant, why don't you use a "UPS monitor client/server" ? The best is to use a laptop to monitor all the UPS involved and if one fails, it orders via the software all the other servers to shutdown in the good order. The bad thing is that all servers have to be restarted manually in the good order or via WOL (Wake On Lan) if you can script it. Hope it helps ! Best regards Grade : Major Orpailleur |
Draconian Send message Joined: 16 Mar 03 Posts: 21 Credit: 1,809,058 RAC: 0 |
I appreciate the information as to what happened with the outage, however - proper system design and architecture should never allow a system to be brought to it's knees. I understand that this is pretty much a volunteer operation and such - but, imagine - if I told my boss that a 20 minute power outage would result in approx a week of downtime - well - I'm fired. Volunteer or not - it is a lot of downtime for a simple failure. Is it a matter of being able to dedicate time or lack of funding? Time? Get more / another volunteer in charge - funding / equipment - say the word and we fund it. This is not a criticism of you - only a criticism that it wasn't prevented. Identify what you need to prevent this type of issue in the future and let us know. We pay your bills, so to speak, with our computers - and, when needed, our checkbooks. As a systems admin, I have a hard time with a 20 minute power outage causing this disruption in service - my boss would kill me. Let us know WHAT you need - and don't be shy. The first obvious need - is a reliable UPS - let's say - at least an hour battery and safe shutdown. How much? |
dancer42 Send message Joined: 2 Jun 02 Posts: 455 Credit: 2,422,890 RAC: 1 |
Ups's are expensive and add nothing until you need them. so most people only buy the minimum they think they need. Due to bad timing they could not hold long enough, this had unfortunate consequences. the question is what to do now? the suggestions I made earlier to maximize up time before help arrives could help. and while modifying the ups's seti already has by adding more battery's might not be pretty it is a lot cheaper than getting larger ups's. Another concern that may by should be addressed here is whether the power for the computers is run through a line isolation transformer? even the idea about solar power is not necessarily bad, in daylight it would up the hold time. what is the answer to this problem ,I think all of the above, perhaps the gpuusers group could make a fund raiser list broken do into unit's 1 person could donate for or buy. ps they used up the bailing wire I sent a long time ago can some one send some duck tape! LOL |
David S Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12 |
the top off cycle of the jell cell will degrade them over time. I'd say, once they've run themselves down once, replace them -- they won't give anything like the same amount of run time for the next outage. My UPS lasted about 2 hours running one computer (which has Boinc set to stop when on battery)(I think), my DSL modem and router, and a couple of radio scanners before it shut down the computer. The next time my power went out, just a few months later, it only lasted 20 minutes. However, I have yet to take my own advice. ;-) David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.