Power (May 22 2012)


log in

Advanced search

Message boards : Technical News : Power (May 22 2012)

1 · 2 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 1235167 - Posted: 22 May 2012, 22:45:02 UTC

During the normal weekly outage last week I took the opportunity to convert georgem not only into the workunit storage server, but a single workunit download server (as opposed to using vader and anakin, which are mounting georgem's disks over the network). This was a bust. I believe I had apache cranked way too high and the kernel crashed. Before it completely went down for the count there were some NFS inconsistencies causing corrupt workunits to be generated on georgem, which only happened for a short time and we didn't notice until they were already sent out.

In any case the crash definitely seemed like an OS/software problem and not due to struggling hardware. Nevertheless I felt pretty heroic about being able to completely stop everything and revert back to using vader and anakin as download servers before I left the lab for the day. But that heroism got lost...

Because that night (Tuesday, a week ago) the lab had a sudden, unexpected major power outage. In fact, all the buildings that make up the Space Lab went dark, as well as the nearby Math Sciences Research Institute and the Lawrence Hall of Science down the hill. Of course lots of our systems went down in an instant, others after the UPS batteries drained, and none of it graceful. Even worse: an hour or two after the outage power came back up for only a split second, jolting everything before we had the chance to reach the lab and unplug everything.

Without any known cause there wasn't much we could do. Jeff did come up early the next day and unplugged everything to prevent further power surges. I came up the following day to check in on progress, clean things up, etc. but as I left the campus electricians were still popping down every manhole and doing laborious tests to find the short, and it seemed like we wouldn't be back up until Monday.

But luckily they soon found the short, and it was in a part of the loop with a spare cable in the same conduit which made replacement far easier. Power came on and stabilized early Friday morning. Jeff, Eric, and I all worked together to power everything back up safely and start the projects. We were very lucky: thus far it seems like we escaped with no hardware damage, nor any data corruption. Some RAID sets had to resync - no big deal. Phew.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4067
Credit: 32,899,873
RAC: 7,643
United Kingdom
Message 1235172 - Posted: 22 May 2012, 22:57:39 UTC - in response to Message 1235167.

Thanks for all your efforts getting everything back up,

Claggy

B-Man
Volunteer tester
Send message
Joined: 11 Feb 01
Posts: 253
Credit: 147,366
RAC: 0
United States
Message 1235173 - Posted: 22 May 2012, 22:58:02 UTC - in response to Message 1235167.

Thanks for the update Matt. I hope the server switching is figured out and works the next time you try.

____________

Profile soft^spirit
Avatar
Send message
Joined: 18 May 99
Posts: 6374
Credit: 28,631,059
RAC: 94
United States
Message 1235239 - Posted: 23 May 2012, 1:45:42 UTC

You can go back to feeling Heroic Matt.
____________

Janice

Profile Zeus Fab3r
Avatar
Send message
Joined: 17 Jan 01
Posts: 642
Credit: 92,933,455
RAC: 108,071
Serbia
Message 1235252 - Posted: 23 May 2012, 2:29:23 UTC

Thanks for everything Matt.

Nonetheless, there seems to be a little annoyance regarding Astropulse validation. This AP wu is stuck in validating limbo and there are whole bunch of them from where this one came. They all belong to a batch that was uploaded and reported right after you guys brought the project back online. Don't know if it is outage related, but we saw similar things in the past with v505.
____________

Who the hell is General Failure and why is he reading my harddisk?¿

Swibby Bear
Send message
Joined: 1 Aug 01
Posts: 236
Credit: 7,276,138
RAC: 415
United States
Message 1235299 - Posted: 23 May 2012, 5:11:05 UTC
Last modified: 23 May 2012, 5:12:30 UTC

I don't suppose that the electricians pulled a gigabit ethernet cable up the conduit for Seti, since they were mucking around there anyway? I would have been happy to run over to Fry's to get the wire for them! (Well, maybe not, since I live in Pennsylvania.)

N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 11162
Credit: 13,952,784
RAC: 12,424
United States
Message 1235425 - Posted: 23 May 2012, 13:28:06 UTC - in response to Message 1235167.

Thanks not only for briefing us on the power outage, but also for the explanation of the completely separate and unrelated to the outage batch of bad WUs.

[Emphasis not aimed at Matt]

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


BONNSaR
Send message
Joined: 9 Nov 04
Posts: 9
Credit: 4,664,441
RAC: 20,707
Australia
Message 1235668 - Posted: 24 May 2012, 3:26:08 UTC - in response to Message 1235167.

One thing you might want to consider to protect the servers in such a situation where power goes out and then comes on momentarily which usually does the damage. In laboratories where this had occured in the past I have recommended installing a Self Latching Relay on the main power line to the instruments. The relay has a Set and Reset switch on it, any electrician can make this up and only requires a contactor and a couple of switches. When the power goes off, the contactor drops out, if power comes back on the contactor stays dropped out. This way a person has to press the switch to energise the contactor and put power back onto the circuits. A person decides when it is appropriate to connect power back to the instruments / servers. Takes all the worry out of the situation if power failure occurs when the lab is not occupied. I had another customer take it 1 step further and included a temperature control so if the aircon fails the power disconnects before everything overheats. That's my 2 cents worth, keep up the good work.......Cheers from Aus
____________

Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12402
Credit: 6,710,870
RAC: 8,920
United States
Message 1235675 - Posted: 24 May 2012, 3:51:17 UTC - in response to Message 1235668.

One thing you might want to consider to protect the servers in such a situation where power goes out and then comes on momentarily which usually does the damage. In laboratories where this had occured in the past I have recommended installing a Self Latching Relay on the main power line to the instruments. The relay has a Set and Reset switch on it, any electrician can make this up and only requires a contactor and a couple of switches. When the power goes off, the contactor drops out, if power comes back on the contactor stays dropped out. This way a person has to press the switch to energise the contactor and put power back onto the circuits. A person decides when it is appropriate to connect power back to the instruments / servers. Takes all the worry out of the situation if power failure occurs when the lab is not occupied. I had another customer take it 1 step further and included a temperature control so if the aircon fails the power disconnects before everything overheats. That's my 2 cents worth, keep up the good work.......Cheers from Aus

Sounds like a good idea but you don't want a short interruption that the UPS covers to mean someone has to drive in from 30 miles away to press a reset button.

What you likely want is the sense on the UPS side so they don't go down for short drops as the UPS should handle that. Then another sense which will auto reconnect when the UPS batteries reach say 90% of charge indicating that the power has been on for a while and hence hopefully stable. Of course connecting the UPS to the systems to do a graceful shutdown before they run out of battery is another necessary step. Unfortunately I don't think in that situation that you can get a remote power on when juice is available. Something someone needs to work on for the kernel/bios.

____________

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5791
Credit: 58,023,129
RAC: 48,024
Australia
Message 1235710 - Posted: 24 May 2012, 6:26:08 UTC - in response to Message 1235675.

One thing you might want to consider to protect the servers in such a situation where power goes out and then comes on momentarily which usually does the damage. In laboratories where this had occured in the past I have recommended installing a Self Latching Relay on the main power line to the instruments. The relay has a Set and Reset switch on it, any electrician can make this up and only requires a contactor and a couple of switches. When the power goes off, the contactor drops out, if power comes back on the contactor stays dropped out. This way a person has to press the switch to energise the contactor and put power back onto the circuits. A person decides when it is appropriate to connect power back to the instruments / servers. Takes all the worry out of the situation if power failure occurs when the lab is not occupied. I had another customer take it 1 step further and included a temperature control so if the aircon fails the power disconnects before everything overheats. That's my 2 cents worth, keep up the good work.......Cheers from Aus

Sounds like a good idea but you don't want a short interruption that the UPS covers to mean someone has to drive in from 30 miles away to press a reset button.

Lots of USB controlled switches available these days.
Login & use switch remotely.

____________
Grant
Darwin NT.

BONNSaR
Send message
Joined: 9 Nov 04
Posts: 9
Credit: 4,664,441
RAC: 20,707
Australia
Message 1236252 - Posted: 25 May 2012, 3:38:07 UTC - in response to Message 1235167.

It was the comment

"Even worse: an hour or two after the outage power came back up for only a split second, jolting everything before we had the chance to reach the lab and unplug everything."

which prompted my suggestion as this is the scenario where the relay works best at protecting equipment.
____________

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 8377
Credit: 4,106,343
RAC: 1,047
United Kingdom
Message 1236515 - Posted: 25 May 2012, 14:03:12 UTC
Last modified: 25 May 2012, 14:05:11 UTC

For the various UPSes:

Just use NUT, along with scripting for doing a clean shutdown if the power stays off for more than one or two minutes, or you hit 75% battery capacity.

If the power is not restored after a few seconds, usually that means it isn't going to come back on! Meanwhile for other occasions, setting one minute is convenient to let you rearrange power sockets without shutting everything down.

Don't rely on being able to run the batteries down to zero. No more than 25% utilisation is far enough...


(And don't forget to include any essential network switches on a UPS.)

Happy crunchin',
Martin
____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2245
Credit: 8,596,723
RAC: 4,305
United States
Message 1236825 - Posted: 25 May 2012, 23:46:36 UTC

^ that was along the same lines of my suggestion over in the news thread regarding the power failure. Someone asked about UPSes and I explained it fairly well, I believe.. (seen here)..
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Profile ML1
Volunteer tester
Send message
Joined: 25 Nov 01
Posts: 8377
Credit: 4,106,343
RAC: 1,047
United Kingdom
Message 1236837 - Posted: 26 May 2012, 0:14:07 UTC - in response to Message 1236825.

Ahhhh... But does Matt ever get around to reading the replies?...


Very good for the comments non-the-less.

Happy crunchin',
Martin

____________
See new freedom: Mageia4
Linux Voice See & try out your OS Freedom!
The Future is what We make IT (GPLv3)

msattlerProject donor
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38922
Credit: 578,653,897
RAC: 515,673
United States
Message 1236839 - Posted: 26 May 2012, 0:19:47 UTC

As far as a hardwired latching power lockout upon fail......
It would also be quite easy for it to be built incorporating a time delay relay that would keep it locked out for a certain period of time before dropping back in.
Say, perhaps 5 minutes. If the power comes back on and stays on for 5 minutes, it is assumed the coast is clear, and it's OK to power back up.

Similar relay modules are commonly sold for refrigeration and air conditioning compressors to prevent them from trying to restart before the system pressure has equalized resulting in stalling of the compressor.
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 1923
Credit: 9,755,005
RAC: 16,976
United States
Message 1237389 - Posted: 26 May 2012, 15:50:21 UTC

Having trouble with uploads, regardless of what the "system Status" says about the upload server being "online": all my uploads are quitting after about 40 seconds with an "HTTP error" - both production and Beta...
____________
.

Ronsa
Send message
Joined: 2 Sep 99
Posts: 7
Credit: 727,680
RAC: 0
United States
Message 1237398 - Posted: 26 May 2012, 16:16:04 UTC

Yes, I noticed it about 4 hours ago, I have several machines that are not uploading. I'm guessing it's a problem with the upload servers and of course since it is the weekend it is not noticed yet. lol Figure I'll just keep crunching and wait

Profile Bill Walker
Avatar
Send message
Joined: 4 Sep 99
Posts: 3352
Credit: 2,041,137
RAC: 2,080
Canada
Message 1237450 - Posted: 26 May 2012, 17:41:06 UTC - in response to Message 1237398.

Yes, I noticed it about 4 hours ago, I have several machines that are not uploading. I'm guessing it's a problem with the upload servers and of course since it is the weekend it is not noticed yet. lol Figure I'll just keep crunching and wait


This is being discussed over in Number Crunching.
____________

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 15,322,223
RAC: 11,632
United States
Message 1237456 - Posted: 26 May 2012, 17:48:07 UTC - in response to Message 1237450.

It's been kicked, Thanks whoever did this on a weekend.

____________


PROUD MEMBER OF Team Starfire World BOINC

Profile Chris SProject donor
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 31452
Credit: 12,178,942
RAC: 28,729
United Kingdom
Message 1237498 - Posted: 26 May 2012, 18:40:47 UTC

Ahhhh... But does Matt ever get around to reading the replies?...

Knowing Matts total dedication to this project over many years, I am quite sure that he does, and I am also sure that he is grateful for any suggestions that are made.

1 · 2 · Next

Message boards : Technical News : Power (May 22 2012)

Copyright © 2014 University of California