Major Power Outage at SSL


log in

Advanced search

Message boards : News : Major Power Outage at SSL

Previous · 1 · 2 · 3 · 4 · 5 . . . 9 · Next
Author Message
Jeffrey Petro
Send message
Joined: 24 Apr 12
Posts: 2
Credit: 41,248
RAC: 0
United States
Message 1232863 - Posted: 18 May 2012, 17:30:24 UTC

Am I reading this correctly that when the power went out the servers all
insta-shut?

I am shocked to learn that a facility like this has no backup...even to do a clean shut down.

zoom314
Avatar
Send message
Joined: 30 Nov 03
Posts: 45737
Credit: 36,373,363
RAC: 8,374
Message 1232866 - Posted: 18 May 2012, 17:36:10 UTC

Great job getting everything back up guys!

So say We all?
____________

Claggy
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4039
Credit: 32,691,120
RAC: 762
United Kingdom
Message 1232871 - Posted: 18 May 2012, 17:42:41 UTC - in response to Message 1232863.

Am I reading this correctly that when the power went out the servers all
insta-shut?

I am shocked to learn that a facility like this has no backup...even to do a clean shut down.

No, they have UPS's so the servers should gracefully shutdown,

Claggy

Jeffrey Petro
Send message
Joined: 24 Apr 12
Posts: 2
Credit: 41,248
RAC: 0
United States
Message 1232877 - Posted: 18 May 2012, 17:49:46 UTC

Claggy,

That's fair...I guess you and I just read certain things differently...

like when I read... A mixture of "all hands on deck" and incredible luck that nothing really got corrupted/fried when the power suddenly disappeared. There are some RAID resyncs happening at the moment, but looking good thus far...

for example, I do not get a warm fuzzy feeling that servers shut down 'gracefully'...lol

Profile tullio
Send message
Joined: 9 Apr 04
Posts: 3564
Credit: 361,349
RAC: 216
Italy
Message 1232891 - Posted: 18 May 2012, 18:12:48 UTC
Last modified: 18 May 2012, 18:13:38 UTC

A UPS should allow a system to make a regular shutdown.I had a power failure just now and my system went down not gracefully, UPS notwithstanding. So I restarted only the router, not the system, when power restarted, since I know that power outages are repetitive here. So when the power failed another time the router stayed alive. Evidently the battery in my UPS is not capable of securing my system, only the router. Anyway I restarted the system which made a full filesystem check (Linux) and is working again.
Tollio
____________

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2233
Credit: 8,421,408
RAC: 4,024
United States
Message 1232899 - Posted: 18 May 2012, 18:26:10 UTC

They do have UPSes and remote control power switches (or those are part of a UPS or two.. it allows them to SSH in and power cycle a machine). I remember reading a few years ago that all the UPSes were basically just power strips at one point as the battery capacity in them was basically zero.

Even with a large 3000VA UPS, having 4-6 servers on it, some of which have 30+ HDDs, you're talking 5 minutes on brand new batteries for a graceful shutdown, and you can't just tell them all to shut down all at once. Some have to go down before others.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

W5DMG - Dave
Send message
Joined: 19 May 99
Posts: 155
Credit: 32,057,292
RAC: 15,972
United States
Message 1232919 - Posted: 18 May 2012, 18:52:01 UTC

Tis nice to see this back online, thanks for all the hard work guys.
Also thanks to all at Lunatics at keeping us informed.

My uploads are working, no reporting as of yet.
But I know everything will be back to normal shortly.

Profile RTFishall
Avatar
Send message
Joined: 4 Jan 00
Posts: 7
Credit: 947,588
RAC: 1,420
United Kingdom
Message 1232936 - Posted: 18 May 2012, 19:13:15 UTC
Last modified: 18 May 2012, 19:25:40 UTC

Thanks guys for all the work in getting it back up and running again. However, it is a shame the Berkley Uni web site could not find space in it's news to say what had happened to seti@home!
Does not boad well for inter-stellar contact if we can't get the communications right here does it?

Profile John Clark
Volunteer tester
Avatar
Send message
Joined: 29 Sep 99
Posts: 16515
Credit: 4,418,829
RAC: 0
United Kingdom
Message 1232945 - Posted: 18 May 2012, 19:25:10 UTC

Good to see you all back online, and, so far nothing corrupted. As soon as I get home next week, I hope, I can get my rigs rebooted.
____________
It's good to be back amongst friends and colleagues



Profile Zapped Sparky
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 30 Aug 08
Posts: 6554
Credit: 1,200,844
RAC: 93
United Kingdom
Message 1232952 - Posted: 18 May 2012, 19:32:32 UTC - in response to Message 1232810.

To get it all up and back on line from scratch in two hours was a major feat of teamwork, I take my hat off to you and the lads, well done!


Thanks. A mixture of "all hands on deck" and incredible luck that nothing really got corrupted/fried when the power suddenly disappeared. There are some RAID resyncs happening at the moment, but looking good thus far...

- Matt

When I got the GPUUG newsletter about the powerline short my first thought was: uh-oh. And from power restored to Seti being online in two hours? Wow. I'll be keeping my fingers crossed for the resyncs. Well, until my fingers hurt at least :)
____________
In an alternate universe, it was a ZX81 that asked for clothes, boots and motorcycle.

Client error 418: I'm a teapot

Tropical Goldfish Fish 13: You're not crazy if you crunch for Seti :)

Dave
Avatar
Send message
Joined: 29 Mar 02
Posts: 774
Credit: 23,193,139
RAC: 0
United Kingdom
Message 1232968 - Posted: 18 May 2012, 19:51:41 UTC - in response to Message 1232936.

Thanks guys for all the work in getting it back up and running again. However, it is a shame the Berkley Uni web site could not find space in it's news to say what had happened to seti@home!
Does not boad well for inter-stellar contact if we can't get the communications right here does it?


They did find space - it was here: http://ucbsystems.org/category/active/unscheduled-outage/

Jimmy Gondek
Send message
Joined: 1 Oct 06
Posts: 20
Credit: 715,874
RAC: 0
Message 1233036 - Posted: 18 May 2012, 21:40:50 UTC

...being a retired telecom I know you folks had you hands full getting things back up and running! Kudos to everyone for a fine, fine job!... :)

TPCBF
Send message
Joined: 18 May 99
Posts: 50
Credit: 918,002
RAC: 1,974
United States
Message 1233054 - Posted: 18 May 2012, 21:55:10 UTC

Well, after some initial problems uploading finished WUs (ok, only 5 of them), they are now gone and reported too, just sitting in PV jail now, as usual.

Got one new one as well, so it looks from here are if things are back to normal... ;-)

Ralf

Al
Send message
Joined: 3 Apr 99
Posts: 481
Credit: 51,118,821
RAC: 25,173
United States
Message 1233065 - Posted: 18 May 2012, 22:07:03 UTC - in response to Message 1232899.

They do have UPSes and remote control power switches (or those are part of a UPS or two.. it allows them to SSH in and power cycle a machine). I remember reading a few years ago that all the UPSes were basically just power strips at one point as the battery capacity in them was basically zero.

Even with a large 3000VA UPS, having 4-6 servers on it, some of which have 30+ HDDs, you're talking 5 minutes on brand new batteries for a graceful shutdown, and you can't just tell them all to shut down all at once. Some have to go down before others.
I know that this was kind of a freak occurrence, but it does appear to show a weakness in the systems. Would it possibly make sense to direct some of our fund raising contributions towards an even more robust UPS setup? Most UPS's of this size allow you to daisy chain multiple batteries together to allow more time to shut things down before they run out of juice. This makes sense to me, what do you guys there in the thick of it think?
____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5683
Credit: 56,060,206
RAC: 49,925
Australia
Message 1233085 - Posted: 18 May 2012, 22:36:25 UTC - in response to Message 1233065.



Attn Matt
There are thousands of WUs that have "Too any errors (may have bug) WU cancelled" from late on the 15th, early on the 16th of May as a result of download errors, but of those that haven't been cancelled many have been downloaded OK today.
We suspect it's a result of the power failure- the Scheduler & download servers were still up, but the WU storage wasn't (or at least wasn't accessable).
____________
Grant
Darwin NT.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5683
Credit: 56,060,206
RAC: 49,925
Australia
Message 1233088 - Posted: 18 May 2012, 22:39:31 UTC - in response to Message 1233065.

Most UPS's of this size allow you to daisy chain multiple batteries together to allow more time to shut things down before they run out of juice. This makes sense to me, what do you guys there in the thick of it think?

It would be good if they had enough UPS capacity to keep all systems up (inc routers etc) and enable a controlled shutdown.
It would probably require the purchase of more UPSs, as well as batteries.
In my case i just replaced my 7AH UPS batteries with a couple of cheap car batteries. Run time at full load went from a few minutes to about 6 hours.

____________
Grant
Darwin NT.

mg_man1
Send message
Joined: 3 Apr 99
Posts: 5
Credit: 13,492,067
RAC: 19,090
United States
Message 1233108 - Posted: 18 May 2012, 23:00:13 UTC

im still tring to get all the results uploaded to you guys and i need new work as well as my pc finished all that was on my pc.

____________

Al
Send message
Joined: 3 Apr 99
Posts: 481
Credit: 51,118,821
RAC: 25,173
United States
Message 1233109 - Posted: 18 May 2012, 23:00:23 UTC - in response to Message 1233088.

Lol kind of bush, but very effective! :D I have a couple APC 1100 UPS's that I have been looking at replacing batteries in, for a hundred + a crack, and you could get a decent deep cycle battery for that kind of ching. If it wasn't in my house, like in my workshop or something, I might just consider it. But back on topic, I doubt they'd do that, even being quite cost effective. Maybe Matt will chime in and let us know if he feels the upgrade is a good use of funds at this time.
____________

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2233
Credit: 8,421,408
RAC: 4,024
United States
Message 1233175 - Posted: 19 May 2012, 1:51:01 UTC

[very long UPS-related thoughts]

You can go with larger AH batteries, but more modern units have a limit to how much they'll charge, and for how long they'll charge. For example, the APC 1400 that I'm using had two 6AH batteries in it, and I replaced them with 9AH batteries that I had left over from ordering the wrong kit for two Tripp-Lite 1400s a few years ago. 50% more run time.

I have heard of people using car batteries, and it does work most of the time (some of those batteries are 60-150AH), but the problem with those is unless they are sealed/maintenance free, you are very very strongly advised against using them indoors, due to the acid fumes during discharge and charge cycles. Also, you can't keep rack-mount stuff neat and tidy if you are using batteries that won't fit within the chassis of the particular unit.

Regarding being able to remotely shut down in that situation.. I don't think they could anyway. Network connectivity went down as well, not just the servers themselves. Network went down and there was no way to tell the servers to shut down unless someone was already there in the lab when it happened.

However, most OSes have the ability to plug into a UPS and monitor the condition of it. For example, Windows 7 sees mine just fine and I have it set to shut down when the battery reaches 50%. That gives me about 10 minutes from the power going out to the OS doing a graceful shutdown. For servers, I would set them to somewhere around 30 seconds from when AC is lost to beginning the shutdown routine. Only problem with that is if you have 5 servers hooked up to one unit, they can't all plug into the status port. You could, though, hook the server up that is last to go down and set up a script on it that when AC is lost, ssh to the other machines and tell them 'init 0' and whatever else you have to do ('umount /all/remote/mounts', etc).

Even then, there's still two things to find out. How long does that particular unit last when all the servers are on and consuming power, and how long does it take them to do a graceful shutdown? As you start shutting them down, the run time will increase, but then you need to find out by how much.

[/very long UPS-related thoughts]
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Profile soft^spirit
Avatar
Send message
Joined: 18 May 99
Posts: 6374
Credit: 28,628,452
RAC: 1,233
United States
Message 1233186 - Posted: 19 May 2012, 2:18:46 UTC - in response to Message 1233175.

The basic premise of a commercial UPS system is to hold the equipment online long enough for backup generators to come online. They are surge/spike/brownout resistant as well.

In smaller or home use, "long enough to turn things off" is the primary selling point.

Side note: I see the AP data is showing pretty much non-existant, is this a residual problem or just a side effect of Jocelyn apparently taking some time off?
____________

Janice

Previous · 1 · 2 · 3 · 4 · 5 . . . 9 · Next

Message boards : News : Major Power Outage at SSL

Copyright © 2014 University of California