Message boards :
Technical News :
Out of the Frying Pan (Feb 17 2010)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Well, shoot. Right at the end of the work day yesterday the air conditioning unit failed. What's worse is that the cause is still a complete mystery. When the campus A/C techs came up in the early evening they just pressed the reset button and it came back to life. But that was after a panicked fury of shutting down every server possible to save their lives. Eric was the first on the scene and smelled burned plastic, heard broken fans, and quickly started unplugging everything he could. I came up later after the A/C was on to get the web servers going again (so people could at least see we were still alive). This morning rolled up our sleeves and surveyed the damage, which actually wasn't too bad. We definitely lost one UPS, and possibly a power supply in one of our file servers (though it seems okay for now). Eric's hydrogen survey server seemed to take the brunt of the damage, and he was ready to reinstall the OS on what disks remained visible to the system, when suddenly after the nth reboot all drives were visible again and all data was still intact. Well, that was a pleasant surprise. Still, there was a bit of RAID and database recovery on various servers, which is why the project largely remained offline until the end of the day today. This is still going on, so we probably won't be fully back to normal until tomorrow morning at the earliest. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Thanks for the update Matt. Claggy |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0 |
Thanks Matt and all the rest of the crew too. PROUD MEMBER OF Team Starfire World BOINC |
Radford Bunker Send message Joined: 12 Mar 09 Posts: 8 Credit: 6,073,787 RAC: 0 |
Thanks Matt. Sounds like a Murphy Strike. Rad |
Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0 |
smelled burned plastic, heard broken fans How hot was it in there? Are the systems not automatically shuting down, when overheating? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874 |
Matt, Uploads have been problematic since well before the air conditioning failure, and well before the Tuesday maintenance window, too - since around 09:00 PST Monday morning, judging by the first post in Number Crunching.. I'm currently getting: 17/02/2010 23:09:28|SETI@home|[file_xfer] Started upload of file 25fe07ac.28421.12751.16.10.119_1_0 17/02/2010 23:09:29||[http_debug] [ID#14] info: About to connect() to setiboincdata.ssl.berkeley.edu port 80 (#0) 17/02/2010 23:09:29||[http_debug] [ID#14] info: Trying 208.68.240.16... 17/02/2010 23:09:29||[http_debug] [ID#14] info: Connected to setiboincdata.ssl.berkeley.edu (208.68.240.16) port 80 (#0) 17/02/2010 23:09:29||[http_debug] [ID#14] Sent header to server: POST /sah_cgi/file_upload_handler HTTP/1.1 User-Agent: BOINC client (windows_intelx86 5.10.13) Host: setiboincdata.ssl.berkeley.edu Accept: */* Accept-Encoding: deflate, gzip Content-Type: application/x-www-form-urlencoded Content-Length: 288 17/02/2010 23:09:29||[http_debug] [ID#14] Received header from server: HTTP/1.0 503 Service Unavailable 17/02/2010 23:09:29||[http_debug] [ID#14] Received header from server: Content-Type: text/html 17/02/2010 23:09:29||[http_debug] [ID#14] Received header from server: Content-Length: 53 17/02/2010 23:09:29||[http_xfer_debug] HTTP: wrote 53 bytes 17/02/2010 23:09:29||[http_debug] [ID#14] info: Expire cleared 17/02/2010 23:09:29||[http_debug] [ID#14] info: Closing connection #0 17/02/2010 23:09:30|SETI@home|[file_xfer] Temporarily failed upload of 25fe07ac.28421.12751.16.10.119_1_0: http error That HTTP/1.0 503 Service Unavailable suggests something might still need kicking. |
Dena Wiltsie Send message Joined: 19 Apr 01 Posts: 1628 Credit: 24,230,968 RAC: 26 |
For a while my job depended on a system cooled by an air conditioner that I could not depend on. My solution was to get one of these and wire it into an extension cord so I could connect all the non-replaceable equipment to it. I then set it to about 80 F and had no worries about failed hardware. The catch is you must make sure your backups are up to date as the power down will be very hard and in my case the raid lost a drive often when it was powered down (very old drives). |
Rick Send message Joined: 3 Dec 99 Posts: 79 Credit: 11,486,227 RAC: 0 |
For a while my job depended on a system cooled by an air conditioner that I could not depend on. My solution was to get one of these and wire it into an extension cord so I could connect all the non-replaceable equipment to it. I then set it to about 80 F and had no worries about failed hardware. The catch is you must make sure your backups are up to date as the power down will be very hard and in my case the raid lost a drive often when it was powered down (very old drives). Plug a UPS into it that has the ability to trigger a graceful shutdown of the systems when the power fails. So long as the UPS has the capacity to keep power to the systems during the shutdown you should be in good shape. |
S@NL - Eesger - www.knoop.nl Send message Joined: 7 Oct 01 Posts: 385 Credit: 50,200,038 RAC: 0 |
... they just pressed the reset button... It's the Microsoft way.. and heck it works more ofthen then one would think ;) The SETI@Home Gauntlet 2012 april 16 - 30| info / chat | STATS |
Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0 |
For a while my job depended on a system cooled by an air conditioner that I could not depend on. My solution was to get one of these (...) I think there are more than enough software based solutions, which will nicely power down the system, if something is overheating. Alternatively, if software not possible, one could try to simulate pressing the power button. That will also gracefully shut down the system. |
Dena Wiltsie Send message Joined: 19 Apr 01 Posts: 1628 Credit: 24,230,968 RAC: 26 |
For a while my job depended on a system cooled by an air conditioner that I could not depend on. My solution was to get one of these and wire it into an extension cord so I could connect all the non-replaceable equipment to it. I then set it to about 80 F and had no worries about failed hardware. The catch is you must make sure your backups are up to date as the power down will be very hard and in my case the raid lost a drive often when it was powered down (very old drives). It was a P390 running OS2 Warp and VM/ESA. It was so old it didn't have any idea what a smart UPS was. The hard drive failure would happen just because it stopped turning. On the other hand, I would have to do a cold start on VM/ESA but we never lost a byte of data with that set up. I am not sure other operating systems would be as forgiving so I provided a warning. We did have a UPS but it's main function was to filter power glitches. One danger of putting the switch on the UPS is additional heat will be generated while the UPS reaches it's shutdown point. My room was not much large than a closet so when things overheated, they needed to be shut down fast. The system was up 24 hours a day and often would be unattended so the failure would most likely happen when no one was around to lay hands on the system. |
RottenMutt Send message Joined: 15 Mar 01 Posts: 1011 Credit: 230,314,058 RAC: 0 |
When the campus A/C techs came up in the early evening they just pressed the reset button and it came back to life. i sure hope you took note as to where the reset switch is... cricket still shows little activity, you must still be down and some of my rigs are out of work for the GPU's and others will be out soon (hours)... |
dbryce Send message Joined: 23 Dec 99 Posts: 4 Credit: 906,647 RAC: 0 |
I remember that box!! <g> In my 'previous life' we were running one of those and we had a 'UPS on steroids' that would power the machine for, I think, 2 hours. It might even have powered our 'server farm', but that was 6.5 years ago and my memory is iffy. Doug |
frank Send message Joined: 14 May 99 Posts: 1 Credit: 362,082 RAC: 0 |
thanks matt. was sure worried why its been off so long. thank you for your work their [/b] |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
For a while my job depended on a system cooled by an air conditioner that I could not depend on. My solution was to get one of these and wire it into an extension cord so I could connect all the non-replaceable equipment to it. I then set it to about 80 F and had no worries about failed hardware. The catch is you must make sure your backups are up to date as the power down will be very hard and in my case the raid lost a drive often when it was powered down (very old drives). Most anything semi-modern supports some sort of "dumb" signaling from a UPS. It uses a normal serial port, and only the handshake lines. A line goes "low" to signal "low battery" and the UPS waits for the system to drop a handshake line back when it is safe for the UPS to turn off. One could build a "UPS" whose only job was to signal low battery when the temperature got above a certain temperature, and kill power when the system said "okay." Power would be restored when it got cold enough. Or not. |
Bob1701a Send message Joined: 11 Apr 00 Posts: 9 Credit: 6,933,206 RAC: 0 |
The smart-ass in me made me write this..... The A/C died and it's too hot? It's winter, it's 25 degrees and snowing...open the windows. That'll cool you off. |
gizbar Send message Joined: 7 Jan 01 Posts: 586 Credit: 21,087,774 RAC: 0 |
Blasted A/C! Now we're out of the frying pan, can we just avoid the fire this time? ;-) Good job guys. Trying to recover some data off a laptop drive for someone at the moment. Of course, there isn't a backup, and this is the 4th system I'm trying to recover just recently. The battery has gone in the laptop and seeing as it is a normal P4@3.00Ghz, the PSU is struggling to supply everything now too. I've had to take the hdd out and attach it to a desktop. After the usual virus checks etc, I started a chkdsk over 10 hours ago and it's less than halfway through! Oh well. At least it keeps me busy while you guys were up to your eyeballs in it. Gizbar. [/i] A proud GPU User Server Donor! |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13866 Credit: 208,696,464 RAC: 304 |
The smart-ass in me made me write this..... Not from that part of the world, but i don't think it snows too often at Berkeley. And most server rooms don't have windows, let alone ones that open. Grant Darwin NT |
Joori Send message Joined: 2 Nov 07 Posts: 1 Credit: 3,578,735 RAC: 0 |
Nice to hear everything is almost back to normal. Unfortunate that alot of work units were aborted while trying to upload them as their deadline had passed during the downtime. A have a feeling more will be aborted as they are still unable to be uploaded.. Kinda dissapointed but what can ya do aye? You win some, you lose some - gotta keep on truckin' ! :) |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
The A/C died and it's too hot? It's winter, it's 25 degrees and snowing...open the windows. That'll cool you off. Assuming the server room is near an outside wall and has openable windows, of course. |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.