Major Power Outage at SSL

Message boards : News : Major Power Outage at SSL
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 9 · Next

AuthorMessage
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 66299
Credit: 55,293,173
RAC: 49
United States
Message 1232866 - Posted: 18 May 2012, 17:36:10 UTC

Great job getting everything back up guys!

So say We all?
Savoir-Faire is everywhere!
The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST

ID: 1232866 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1232871 - Posted: 18 May 2012, 17:42:41 UTC - in response to Message 1232863.  

Am I reading this correctly that when the power went out the servers all
insta-shut?

I am shocked to learn that a facility like this has no backup...even to do a clean shut down.

No, they have UPS's so the servers should gracefully shutdown,

Claggy
ID: 1232871 · Report as offensive
Jeffrey Petro

Send message
Joined: 24 Apr 12
Posts: 2
Credit: 41,248
RAC: 0
United States
Message 1232877 - Posted: 18 May 2012, 17:49:46 UTC

Claggy,

That's fair...I guess you and I just read certain things differently...

like when I read... A mixture of "all hands on deck" and incredible luck that nothing really got corrupted/fried when the power suddenly disappeared. There are some RAID resyncs happening at the moment, but looking good thus far...

for example, I do not get a warm fuzzy feeling that servers shut down 'gracefully'...lol

ID: 1232877 · Report as offensive
Profile tullio
Volunteer tester

Send message
Joined: 9 Apr 04
Posts: 8797
Credit: 2,930,782
RAC: 1
Italy
Message 1232891 - Posted: 18 May 2012, 18:12:48 UTC
Last modified: 18 May 2012, 18:13:38 UTC

A UPS should allow a system to make a regular shutdown.I had a power failure just now and my system went down not gracefully, UPS notwithstanding. So I restarted only the router, not the system, when power restarted, since I know that power outages are repetitive here. So when the power failed another time the router stayed alive. Evidently the battery in my UPS is not capable of securing my system, only the router. Anyway I restarted the system which made a full filesystem check (Linux) and is working again.
Tollio
ID: 1232891 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1232899 - Posted: 18 May 2012, 18:26:10 UTC

They do have UPSes and remote control power switches (or those are part of a UPS or two.. it allows them to SSH in and power cycle a machine). I remember reading a few years ago that all the UPSes were basically just power strips at one point as the battery capacity in them was basically zero.

Even with a large 3000VA UPS, having 4-6 servers on it, some of which have 30+ HDDs, you're talking 5 minutes on brand new batteries for a graceful shutdown, and you can't just tell them all to shut down all at once. Some have to go down before others.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1232899 · Report as offensive
W5DMG - Dave

Send message
Joined: 19 May 99
Posts: 155
Credit: 33,162,251
RAC: 0
United States
Message 1232919 - Posted: 18 May 2012, 18:52:01 UTC

Tis nice to see this back online, thanks for all the hard work guys.
Also thanks to all at Lunatics at keeping us informed.

My uploads are working, no reporting as of yet.
But I know everything will be back to normal shortly.
ID: 1232919 · Report as offensive
Profile Terry Byatt (R.T.Fishall)
Avatar

Send message
Joined: 4 Jan 00
Posts: 19
Credit: 2,262,059
RAC: 2
United Kingdom
Message 1232936 - Posted: 18 May 2012, 19:13:15 UTC
Last modified: 18 May 2012, 19:25:40 UTC

Thanks guys for all the work in getting it back up and running again. However, it is a shame the Berkley Uni web site could not find space in it's news to say what had happened to seti@home!
Does not boad well for inter-stellar contact if we can't get the communications right here does it?
ID: 1232936 · Report as offensive
Profile John Clark
Volunteer tester
Avatar

Send message
Joined: 29 Sep 99
Posts: 16515
Credit: 4,418,829
RAC: 0
United Kingdom
Message 1232945 - Posted: 18 May 2012, 19:25:10 UTC

Good to see you all back online, and, so far nothing corrupted. As soon as I get home next week, I hope, I can get my rigs rebooted.
It's good to be back amongst friends and colleagues



ID: 1232945 · Report as offensive
Profile Dimly Lit Lightbulb 😀
Volunteer tester
Avatar

Send message
Joined: 30 Aug 08
Posts: 15399
Credit: 7,423,413
RAC: 1
United Kingdom
Message 1232952 - Posted: 18 May 2012, 19:32:32 UTC - in response to Message 1232810.  

To get it all up and back on line from scratch in two hours was a major feat of teamwork, I take my hat off to you and the lads, well done!


Thanks. A mixture of "all hands on deck" and incredible luck that nothing really got corrupted/fried when the power suddenly disappeared. There are some RAID resyncs happening at the moment, but looking good thus far...

- Matt

When I got the GPUUG newsletter about the powerline short my first thought was: uh-oh. And from power restored to Seti being online in two hours? Wow. I'll be keeping my fingers crossed for the resyncs. Well, until my fingers hurt at least :)

Member of the People Encouraging Niceness In Society club.

ID: 1232952 · Report as offensive
Dave

Send message
Joined: 29 Mar 02
Posts: 778
Credit: 25,001,396
RAC: 0
United Kingdom
Message 1232968 - Posted: 18 May 2012, 19:51:41 UTC - in response to Message 1232936.  

Thanks guys for all the work in getting it back up and running again. However, it is a shame the Berkley Uni web site could not find space in it's news to say what had happened to seti@home!
Does not boad well for inter-stellar contact if we can't get the communications right here does it?


They did find space - it was here: http://ucbsystems.org/category/active/unscheduled-outage/
ID: 1232968 · Report as offensive
Jimmy Gondek

Send message
Joined: 1 Oct 06
Posts: 20
Credit: 715,874
RAC: 0
Message 1233036 - Posted: 18 May 2012, 21:40:50 UTC

...being a retired telecom I know you folks had you hands full getting things back up and running! Kudos to everyone for a fine, fine job!... :)
ID: 1233036 · Report as offensive
TPCBF

Send message
Joined: 18 May 99
Posts: 54
Credit: 4,594,980
RAC: 0
United States
Message 1233054 - Posted: 18 May 2012, 21:55:10 UTC

Well, after some initial problems uploading finished WUs (ok, only 5 of them), they are now gone and reported too, just sitting in PV jail now, as usual.

Got one new one as well, so it looks from here are if things are back to normal... ;-)

Ralf
ID: 1233054 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1233065 - Posted: 18 May 2012, 22:07:03 UTC - in response to Message 1232899.  

They do have UPSes and remote control power switches (or those are part of a UPS or two.. it allows them to SSH in and power cycle a machine). I remember reading a few years ago that all the UPSes were basically just power strips at one point as the battery capacity in them was basically zero.

Even with a large 3000VA UPS, having 4-6 servers on it, some of which have 30+ HDDs, you're talking 5 minutes on brand new batteries for a graceful shutdown, and you can't just tell them all to shut down all at once. Some have to go down before others.
I know that this was kind of a freak occurrence, but it does appear to show a weakness in the systems. Would it possibly make sense to direct some of our fund raising contributions towards an even more robust UPS setup? Most UPS's of this size allow you to daisy chain multiple batteries together to allow more time to shut things down before they run out of juice. This makes sense to me, what do you guys there in the thick of it think?

ID: 1233065 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13848
Credit: 208,696,464
RAC: 304
Australia
Message 1233085 - Posted: 18 May 2012, 22:36:25 UTC - in response to Message 1233065.  



Attn Matt
There are thousands of WUs that have "Too any errors (may have bug) WU cancelled" from late on the 15th, early on the 16th of May as a result of download errors, but of those that haven't been cancelled many have been downloaded OK today.
We suspect it's a result of the power failure- the Scheduler & download servers were still up, but the WU storage wasn't (or at least wasn't accessable).
Grant
Darwin NT
ID: 1233085 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13848
Credit: 208,696,464
RAC: 304
Australia
Message 1233088 - Posted: 18 May 2012, 22:39:31 UTC - in response to Message 1233065.  

Most UPS's of this size allow you to daisy chain multiple batteries together to allow more time to shut things down before they run out of juice. This makes sense to me, what do you guys there in the thick of it think?

It would be good if they had enough UPS capacity to keep all systems up (inc routers etc) and enable a controlled shutdown.
It would probably require the purchase of more UPSs, as well as batteries.
In my case i just replaced my 7AH UPS batteries with a couple of cheap car batteries. Run time at full load went from a few minutes to about 6 hours.

Grant
Darwin NT
ID: 1233088 · Report as offensive
mg_man1

Send message
Joined: 3 Apr 99
Posts: 5
Credit: 41,714,879
RAC: 0
United States
Message 1233108 - Posted: 18 May 2012, 23:00:13 UTC

im still tring to get all the results uploaded to you guys and i need new work as well as my pc finished all that was on my pc.

ID: 1233108 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1233109 - Posted: 18 May 2012, 23:00:23 UTC - in response to Message 1233088.  

Lol kind of bush, but very effective! :D I have a couple APC 1100 UPS's that I have been looking at replacing batteries in, for a hundred + a crack, and you could get a decent deep cycle battery for that kind of ching. If it wasn't in my house, like in my workshop or something, I might just consider it. But back on topic, I doubt they'd do that, even being quite cost effective. Maybe Matt will chime in and let us know if he feels the upgrade is a good use of funds at this time.

ID: 1233109 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1233175 - Posted: 19 May 2012, 1:51:01 UTC

[very long UPS-related thoughts]

You can go with larger AH batteries, but more modern units have a limit to how much they'll charge, and for how long they'll charge. For example, the APC 1400 that I'm using had two 6AH batteries in it, and I replaced them with 9AH batteries that I had left over from ordering the wrong kit for two Tripp-Lite 1400s a few years ago. 50% more run time.

I have heard of people using car batteries, and it does work most of the time (some of those batteries are 60-150AH), but the problem with those is unless they are sealed/maintenance free, you are very very strongly advised against using them indoors, due to the acid fumes during discharge and charge cycles. Also, you can't keep rack-mount stuff neat and tidy if you are using batteries that won't fit within the chassis of the particular unit.

Regarding being able to remotely shut down in that situation.. I don't think they could anyway. Network connectivity went down as well, not just the servers themselves. Network went down and there was no way to tell the servers to shut down unless someone was already there in the lab when it happened.

However, most OSes have the ability to plug into a UPS and monitor the condition of it. For example, Windows 7 sees mine just fine and I have it set to shut down when the battery reaches 50%. That gives me about 10 minutes from the power going out to the OS doing a graceful shutdown. For servers, I would set them to somewhere around 30 seconds from when AC is lost to beginning the shutdown routine. Only problem with that is if you have 5 servers hooked up to one unit, they can't all plug into the status port. You could, though, hook the server up that is last to go down and set up a script on it that when AC is lost, ssh to the other machines and tell them 'init 0' and whatever else you have to do ('umount /all/remote/mounts', etc).

Even then, there's still two things to find out. How long does that particular unit last when all the servers are on and consuming power, and how long does it take them to do a graceful shutdown? As you start shutting them down, the run time will increase, but then you need to find out by how much.

[/very long UPS-related thoughts]
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1233175 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1233186 - Posted: 19 May 2012, 2:18:46 UTC - in response to Message 1233175.  

The basic premise of a commercial UPS system is to hold the equipment online long enough for backup generators to come online. They are surge/spike/brownout resistant as well.

In smaller or home use, "long enough to turn things off" is the primary selling point.

Side note: I see the AP data is showing pretty much non-existant, is this a residual problem or just a side effect of Jocelyn apparently taking some time off?
Janice
ID: 1233186 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1233192 - Posted: 19 May 2012, 2:33:09 UTC - in response to Message 1233186.  
Last modified: 19 May 2012, 2:33:26 UTC


Side note: I see the AP data is showing pretty much non-existant, is this a residual problem or just a side effect of Jocelyn apparently taking some time off?

I think it's because the AP data is shown is still for the little bit of v505 still in the wild.
It does not reflect the v6 info yet.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1233192 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 . . . 9 · Next

Message boards : News : Major Power Outage at SSL


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.