Power (May 22 2012)

Message boards : Technical News : Power (May 22 2012)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1235167 - Posted: 22 May 2012, 22:45:02 UTC

During the normal weekly outage last week I took the opportunity to convert georgem not only into the workunit storage server, but a single workunit download server (as opposed to using vader and anakin, which are mounting georgem's disks over the network). This was a bust. I believe I had apache cranked way too high and the kernel crashed. Before it completely went down for the count there were some NFS inconsistencies causing corrupt workunits to be generated on georgem, which only happened for a short time and we didn't notice until they were already sent out.

In any case the crash definitely seemed like an OS/software problem and not due to struggling hardware. Nevertheless I felt pretty heroic about being able to completely stop everything and revert back to using vader and anakin as download servers before I left the lab for the day. But that heroism got lost...

Because that night (Tuesday, a week ago) the lab had a sudden, unexpected major power outage. In fact, all the buildings that make up the Space Lab went dark, as well as the nearby Math Sciences Research Institute and the Lawrence Hall of Science down the hill. Of course lots of our systems went down in an instant, others after the UPS batteries drained, and none of it graceful. Even worse: an hour or two after the outage power came back up for only a split second, jolting everything before we had the chance to reach the lab and unplug everything.

Without any known cause there wasn't much we could do. Jeff did come up early the next day and unplugged everything to prevent further power surges. I came up the following day to check in on progress, clean things up, etc. but as I left the campus electricians were still popping down every manhole and doing laborious tests to find the short, and it seemed like we wouldn't be back up until Monday.

But luckily they soon found the short, and it was in a part of the loop with a spare cable in the same conduit which made replacement far easier. Power came on and stabilized early Friday morning. Jeff, Eric, and I all worked together to power everything back up safely and start the projects. We were very lucky: thus far it seems like we escaped with no hardware damage, nor any data corruption. Some RAID sets had to resync - no big deal. Phew.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1235167 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1235172 - Posted: 22 May 2012, 22:57:39 UTC - in response to Message 1235167.  

Thanks for all your efforts getting everything back up,

Claggy
ID: 1235172 · Report as offensive
B-Man
Volunteer tester

Send message
Joined: 11 Feb 01
Posts: 253
Credit: 147,366
RAC: 0
United States
Message 1235173 - Posted: 22 May 2012, 22:58:02 UTC - in response to Message 1235167.  

Thanks for the update Matt. I hope the server switching is figured out and works the next time you try.

ID: 1235173 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1235239 - Posted: 23 May 2012, 1:45:42 UTC

You can go back to feeling Heroic Matt.
Janice
ID: 1235239 · Report as offensive
Profile Zeus Fab3r
Avatar

Send message
Joined: 17 Jan 01
Posts: 649
Credit: 275,335,635
RAC: 597
Serbia
Message 1235252 - Posted: 23 May 2012, 2:29:23 UTC

Thanks for everything Matt.

Nonetheless, there seems to be a little annoyance regarding Astropulse validation. This AP wu is stuck in validating limbo and there are whole bunch of them from where this one came. They all belong to a batch that was uploaded and reported right after you guys brought the project back online. Don't know if it is outage related, but we saw similar things in the past with v505.

Who the hell is General Failure and why is he reading my harddisk?¿
ID: 1235252 · Report as offensive
Swibby Bear

Send message
Joined: 1 Aug 01
Posts: 246
Credit: 7,945,093
RAC: 0
United States
Message 1235299 - Posted: 23 May 2012, 5:11:05 UTC
Last modified: 23 May 2012, 5:12:30 UTC

I don't suppose that the electricians pulled a gigabit ethernet cable up the conduit for Seti, since they were mucking around there anyway? I would have been happy to run over to Fry's to get the wire for them! (Well, maybe not, since I live in Pennsylvania.)
ID: 1235299 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1235425 - Posted: 23 May 2012, 13:28:06 UTC - in response to Message 1235167.  

Thanks not only for briefing us on the power outage, but also for the explanation of the completely separate and unrelated to the outage batch of bad WUs.

[Emphasis not aimed at Matt]

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1235425 · Report as offensive
BONNSaR

Send message
Joined: 9 Nov 04
Posts: 38
Credit: 21,538,589
RAC: 9
Australia
Message 1235668 - Posted: 24 May 2012, 3:26:08 UTC - in response to Message 1235167.  

One thing you might want to consider to protect the servers in such a situation where power goes out and then comes on momentarily which usually does the damage. In laboratories where this had occured in the past I have recommended installing a Self Latching Relay on the main power line to the instruments. The relay has a Set and Reset switch on it, any electrician can make this up and only requires a contactor and a couple of switches. When the power goes off, the contactor drops out, if power comes back on the contactor stays dropped out. This way a person has to press the switch to energise the contactor and put power back onto the circuits. A person decides when it is appropriate to connect power back to the instruments / servers. Takes all the worry out of the situation if power failure occurs when the lab is not occupied. I had another customer take it 1 step further and included a temperature control so if the aircon fails the power disconnects before everything overheats. That's my 2 cents worth, keep up the good work.......Cheers from Aus
ID: 1235668 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 31006
Credit: 53,134,872
RAC: 32
United States
Message 1235675 - Posted: 24 May 2012, 3:51:17 UTC - in response to Message 1235668.  

One thing you might want to consider to protect the servers in such a situation where power goes out and then comes on momentarily which usually does the damage. In laboratories where this had occured in the past I have recommended installing a Self Latching Relay on the main power line to the instruments. The relay has a Set and Reset switch on it, any electrician can make this up and only requires a contactor and a couple of switches. When the power goes off, the contactor drops out, if power comes back on the contactor stays dropped out. This way a person has to press the switch to energise the contactor and put power back onto the circuits. A person decides when it is appropriate to connect power back to the instruments / servers. Takes all the worry out of the situation if power failure occurs when the lab is not occupied. I had another customer take it 1 step further and included a temperature control so if the aircon fails the power disconnects before everything overheats. That's my 2 cents worth, keep up the good work.......Cheers from Aus

Sounds like a good idea but you don't want a short interruption that the UPS covers to mean someone has to drive in from 30 miles away to press a reset button.

What you likely want is the sense on the UPS side so they don't go down for short drops as the UPS should handle that. Then another sense which will auto reconnect when the UPS batteries reach say 90% of charge indicating that the power has been on for a while and hence hopefully stable. Of course connecting the UPS to the systems to do a graceful shutdown before they run out of battery is another necessary step. Unfortunately I don't think in that situation that you can get a remote power on when juice is available. Something someone needs to work on for the kernel/bios.

ID: 1235675 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13854
Credit: 208,696,464
RAC: 304
Australia
Message 1235710 - Posted: 24 May 2012, 6:26:08 UTC - in response to Message 1235675.  

One thing you might want to consider to protect the servers in such a situation where power goes out and then comes on momentarily which usually does the damage. In laboratories where this had occured in the past I have recommended installing a Self Latching Relay on the main power line to the instruments. The relay has a Set and Reset switch on it, any electrician can make this up and only requires a contactor and a couple of switches. When the power goes off, the contactor drops out, if power comes back on the contactor stays dropped out. This way a person has to press the switch to energise the contactor and put power back onto the circuits. A person decides when it is appropriate to connect power back to the instruments / servers. Takes all the worry out of the situation if power failure occurs when the lab is not occupied. I had another customer take it 1 step further and included a temperature control so if the aircon fails the power disconnects before everything overheats. That's my 2 cents worth, keep up the good work.......Cheers from Aus

Sounds like a good idea but you don't want a short interruption that the UPS covers to mean someone has to drive in from 30 miles away to press a reset button.

Lots of USB controlled switches available these days.
Login & use switch remotely.

Grant
Darwin NT
ID: 1235710 · Report as offensive
BONNSaR

Send message
Joined: 9 Nov 04
Posts: 38
Credit: 21,538,589
RAC: 9
Australia
Message 1236252 - Posted: 25 May 2012, 3:38:07 UTC - in response to Message 1235167.  

It was the comment

"Even worse: an hour or two after the outage power came back up for only a split second, jolting everything before we had the chance to reach the lab and unplug everything."

which prompted my suggestion as this is the scenario where the relay works best at protecting equipment.
ID: 1236252 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 21212
Credit: 7,508,002
RAC: 20
United Kingdom
Message 1236515 - Posted: 25 May 2012, 14:03:12 UTC
Last modified: 25 May 2012, 14:05:11 UTC

For the various UPSes:

Just use NUT, along with scripting for doing a clean shutdown if the power stays off for more than one or two minutes, or you hit 75% battery capacity.

If the power is not restored after a few seconds, usually that means it isn't going to come back on! Meanwhile for other occasions, setting one minute is convenient to let you rearrange power sockets without shutting everything down.

Don't rely on being able to run the batteries down to zero. No more than 25% utilisation is far enough...


(And don't forget to include any essential network switches on a UPS.)

Happy crunchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 1236515 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1236825 - Posted: 25 May 2012, 23:46:36 UTC

^ that was along the same lines of my suggestion over in the news thread regarding the power failure. Someone asked about UPSes and I explained it fairly well, I believe.. (seen here)..
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1236825 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 21212
Credit: 7,508,002
RAC: 20
United Kingdom
Message 1236837 - Posted: 26 May 2012, 0:14:07 UTC - in response to Message 1236825.  

Ahhhh... But does Matt ever get around to reading the replies?...


Very good for the comments non-the-less.

Happy crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 1236837 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1236839 - Posted: 26 May 2012, 0:19:47 UTC

As far as a hardwired latching power lockout upon fail......
It would also be quite easy for it to be built incorporating a time delay relay that would keep it locked out for a certain period of time before dropping back in.
Say, perhaps 5 minutes. If the power comes back on and stays on for 5 minutes, it is assumed the coast is clear, and it's OK to power back up.

Similar relay modules are commonly sold for refrigeration and air conditioning compressors to prevent them from trying to restart before the system pressure has equalized resulting in stalling of the compressor.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1236839 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 1237389 - Posted: 26 May 2012, 15:50:21 UTC

Having trouble with uploads, regardless of what the "system Status" says about the upload server being "online": all my uploads are quitting after about 40 seconds with an "HTTP error" - both production and Beta...
.

Hello, from Albany, CA!...
ID: 1237389 · Report as offensive
Ronsa

Send message
Joined: 2 Sep 99
Posts: 7
Credit: 727,680
RAC: 0
United States
Message 1237398 - Posted: 26 May 2012, 16:16:04 UTC

Yes, I noticed it about 4 hours ago, I have several machines that are not uploading. I'm guessing it's a problem with the upload servers and of course since it is the weekend it is not noticed yet. lol Figure I'll just keep crunching and wait
ID: 1237398 · Report as offensive
Profile Bill Walker
Avatar

Send message
Joined: 4 Sep 99
Posts: 3868
Credit: 2,697,267
RAC: 0
Canada
Message 1237450 - Posted: 26 May 2012, 17:41:06 UTC - in response to Message 1237398.  

Yes, I noticed it about 4 hours ago, I have several machines that are not uploading. I'm guessing it's a problem with the upload servers and of course since it is the weekend it is not noticed yet. lol Figure I'll just keep crunching and wait


This is being discussed over in Number Crunching.

ID: 1237450 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1237456 - Posted: 26 May 2012, 17:48:07 UTC - in response to Message 1237450.  

It's been kicked, Thanks whoever did this on a weekend.



PROUD MEMBER OF Team Starfire World BOINC
ID: 1237456 · Report as offensive
Profile soft^spirit
Avatar

Send message
Joined: 18 May 99
Posts: 6497
Credit: 34,134,168
RAC: 0
United States
Message 1237886 - Posted: 27 May 2012, 7:28:31 UTC - in response to Message 1237498.  

Ahhhh... But does Matt ever get around to reading the replies?...

Knowing Matts total dedication to this project over many years, I am quite sure that he does, and I am also sure that he is grateful for any suggestions that are made.



The pattern I have seen is that Matt reads, but seldom(but occasionally) replies. The fact there are some replies tells me: Yes he reads.



Janice
ID: 1237886 · Report as offensive

Message boards : Technical News : Power (May 22 2012)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.