Message boards :
Technical News :
Power (May 22 2012)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
During the normal weekly outage last week I took the opportunity to convert georgem not only into the workunit storage server, but a single workunit download server (as opposed to using vader and anakin, which are mounting georgem's disks over the network). This was a bust. I believe I had apache cranked way too high and the kernel crashed. Before it completely went down for the count there were some NFS inconsistencies causing corrupt workunits to be generated on georgem, which only happened for a short time and we didn't notice until they were already sent out. In any case the crash definitely seemed like an OS/software problem and not due to struggling hardware. Nevertheless I felt pretty heroic about being able to completely stop everything and revert back to using vader and anakin as download servers before I left the lab for the day. But that heroism got lost... Because that night (Tuesday, a week ago) the lab had a sudden, unexpected major power outage. In fact, all the buildings that make up the Space Lab went dark, as well as the nearby Math Sciences Research Institute and the Lawrence Hall of Science down the hill. Of course lots of our systems went down in an instant, others after the UPS batteries drained, and none of it graceful. Even worse: an hour or two after the outage power came back up for only a split second, jolting everything before we had the chance to reach the lab and unplug everything. Without any known cause there wasn't much we could do. Jeff did come up early the next day and unplugged everything to prevent further power surges. I came up the following day to check in on progress, clean things up, etc. but as I left the campus electricians were still popping down every manhole and doing laborious tests to find the short, and it seemed like we wouldn't be back up until Monday. But luckily they soon found the short, and it was in a part of the loop with a spare cable in the same conduit which made replacement far easier. Power came on and stabilized early Friday morning. Jeff, Eric, and I all worked together to power everything back up safely and start the projects. We were very lucky: thus far it seems like we escaped with no hardware damage, nor any data corruption. Some RAID sets had to resync - no big deal. Phew. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Thanks for all your efforts getting everything back up, Claggy |
B-Man Send message Joined: 11 Feb 01 Posts: 253 Credit: 147,366 RAC: 0 |
Thanks for the update Matt. I hope the server switching is figured out and works the next time you try. |
soft^spirit Send message Joined: 18 May 99 Posts: 6497 Credit: 34,134,168 RAC: 0 |
You can go back to feeling Heroic Matt. Janice |
Zeus Fab3r Send message Joined: 17 Jan 01 Posts: 649 Credit: 275,335,635 RAC: 597 |
Thanks for everything Matt. Nonetheless, there seems to be a little annoyance regarding Astropulse validation. This AP wu is stuck in validating limbo and there are whole bunch of them from where this one came. They all belong to a batch that was uploaded and reported right after you guys brought the project back online. Don't know if it is outage related, but we saw similar things in the past with v505. Who the hell is General Failure and why is he reading my harddisk?¿ |
Swibby Bear Send message Joined: 1 Aug 01 Posts: 246 Credit: 7,945,093 RAC: 0 |
I don't suppose that the electricians pulled a gigabit ethernet cable up the conduit for Seti, since they were mucking around there anyway? I would have been happy to run over to Fry's to get the wire for them! (Well, maybe not, since I live in Pennsylvania.) |
David S Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12 |
Thanks not only for briefing us on the power outage, but also for the explanation of the completely separate and unrelated to the outage batch of bad WUs. [Emphasis not aimed at Matt] David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. |
BONNSaR Send message Joined: 9 Nov 04 Posts: 38 Credit: 21,538,589 RAC: 9 |
One thing you might want to consider to protect the servers in such a situation where power goes out and then comes on momentarily which usually does the damage. In laboratories where this had occured in the past I have recommended installing a Self Latching Relay on the main power line to the instruments. The relay has a Set and Reset switch on it, any electrician can make this up and only requires a contactor and a couple of switches. When the power goes off, the contactor drops out, if power comes back on the contactor stays dropped out. This way a person has to press the switch to energise the contactor and put power back onto the circuits. A person decides when it is appropriate to connect power back to the instruments / servers. Takes all the worry out of the situation if power failure occurs when the lab is not occupied. I had another customer take it 1 step further and included a temperature control so if the aircon fails the power disconnects before everything overheats. That's my 2 cents worth, keep up the good work.......Cheers from Aus |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 31006 Credit: 53,134,872 RAC: 32 |
One thing you might want to consider to protect the servers in such a situation where power goes out and then comes on momentarily which usually does the damage. In laboratories where this had occured in the past I have recommended installing a Self Latching Relay on the main power line to the instruments. The relay has a Set and Reset switch on it, any electrician can make this up and only requires a contactor and a couple of switches. When the power goes off, the contactor drops out, if power comes back on the contactor stays dropped out. This way a person has to press the switch to energise the contactor and put power back onto the circuits. A person decides when it is appropriate to connect power back to the instruments / servers. Takes all the worry out of the situation if power failure occurs when the lab is not occupied. I had another customer take it 1 step further and included a temperature control so if the aircon fails the power disconnects before everything overheats. That's my 2 cents worth, keep up the good work.......Cheers from Aus Sounds like a good idea but you don't want a short interruption that the UPS covers to mean someone has to drive in from 30 miles away to press a reset button. What you likely want is the sense on the UPS side so they don't go down for short drops as the UPS should handle that. Then another sense which will auto reconnect when the UPS batteries reach say 90% of charge indicating that the power has been on for a while and hence hopefully stable. Of course connecting the UPS to the systems to do a graceful shutdown before they run out of battery is another necessary step. Unfortunately I don't think in that situation that you can get a remote power on when juice is available. Something someone needs to work on for the kernel/bios. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13854 Credit: 208,696,464 RAC: 304 |
One thing you might want to consider to protect the servers in such a situation where power goes out and then comes on momentarily which usually does the damage. In laboratories where this had occured in the past I have recommended installing a Self Latching Relay on the main power line to the instruments. The relay has a Set and Reset switch on it, any electrician can make this up and only requires a contactor and a couple of switches. When the power goes off, the contactor drops out, if power comes back on the contactor stays dropped out. This way a person has to press the switch to energise the contactor and put power back onto the circuits. A person decides when it is appropriate to connect power back to the instruments / servers. Takes all the worry out of the situation if power failure occurs when the lab is not occupied. I had another customer take it 1 step further and included a temperature control so if the aircon fails the power disconnects before everything overheats. That's my 2 cents worth, keep up the good work.......Cheers from Aus Lots of USB controlled switches available these days. Login & use switch remotely. Grant Darwin NT |
BONNSaR Send message Joined: 9 Nov 04 Posts: 38 Credit: 21,538,589 RAC: 9 |
It was the comment "Even worse: an hour or two after the outage power came back up for only a split second, jolting everything before we had the chance to reach the lab and unplug everything." which prompted my suggestion as this is the scenario where the relay works best at protecting equipment. |
ML1 Send message Joined: 25 Nov 01 Posts: 21212 Credit: 7,508,002 RAC: 20 |
For the various UPSes: Just use NUT, along with scripting for doing a clean shutdown if the power stays off for more than one or two minutes, or you hit 75% battery capacity. If the power is not restored after a few seconds, usually that means it isn't going to come back on! Meanwhile for other occasions, setting one minute is convenient to let you rearrange power sockets without shutting everything down. Don't rely on being able to run the batteries down to zero. No more than 25% utilisation is far enough... (And don't forget to include any essential network switches on a UPS.) Happy crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
^ that was along the same lines of my suggestion over in the news thread regarding the power failure. Someone asked about UPSes and I explained it fairly well, I believe.. (seen here).. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
ML1 Send message Joined: 25 Nov 01 Posts: 21212 Credit: 7,508,002 RAC: 20 |
Ahhhh... But does Matt ever get around to reading the replies?... Very good for the comments non-the-less. Happy crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
As far as a hardwired latching power lockout upon fail...... It would also be quite easy for it to be built incorporating a time delay relay that would keep it locked out for a certain period of time before dropping back in. Say, perhaps 5 minutes. If the power comes back on and stays on for 5 minutes, it is assumed the coast is clear, and it's OK to power back up. Similar relay modules are commonly sold for refrigeration and air conditioning compressors to prevent them from trying to restart before the system pressure has equalized resulting in stalling of the compressor. "Time is simply the mechanism that keeps everything from happening all at once." |
KWSN THE Holy Hand Grenade! Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0 |
Having trouble with uploads, regardless of what the "system Status" says about the upload server being "online": all my uploads are quitting after about 40 seconds with an "HTTP error" - both production and Beta... . Hello, from Albany, CA!... |
Ronsa Send message Joined: 2 Sep 99 Posts: 7 Credit: 727,680 RAC: 0 |
Yes, I noticed it about 4 hours ago, I have several machines that are not uploading. I'm guessing it's a problem with the upload servers and of course since it is the weekend it is not noticed yet. lol Figure I'll just keep crunching and wait |
Bill Walker Send message Joined: 4 Sep 99 Posts: 3868 Credit: 2,697,267 RAC: 0 |
Yes, I noticed it about 4 hours ago, I have several machines that are not uploading. I'm guessing it's a problem with the upload servers and of course since it is the weekend it is not noticed yet. lol Figure I'll just keep crunching and wait This is being discussed over in Number Crunching. |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0 |
It's been kicked, Thanks whoever did this on a weekend. PROUD MEMBER OF Team Starfire World BOINC |
soft^spirit Send message Joined: 18 May 99 Posts: 6497 Credit: 34,134,168 RAC: 0 |
Ahhhh... But does Matt ever get around to reading the replies?... The pattern I have seen is that Matt reads, but seldom(but occasionally) replies. The fact there are some replies tells me: Yes he reads. Janice |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.