Out of the Frying Pan (Feb 17 2010)


log in

Advanced search

Message boards : Technical News : Out of the Frying Pan (Feb 17 2010)

1 · 2 · 3 · 4 . . . 6 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 970983 - Posted: 17 Feb 2010, 22:51:35 UTC

Well, shoot. Right at the end of the work day yesterday the air conditioning unit failed. What's worse is that the cause is still a complete mystery. When the campus A/C techs came up in the early evening they just pressed the reset button and it came back to life.

But that was after a panicked fury of shutting down every server possible to save their lives. Eric was the first on the scene and smelled burned plastic, heard broken fans, and quickly started unplugging everything he could. I came up later after the A/C was on to get the web servers going again (so people could at least see we were still alive).

This morning rolled up our sleeves and surveyed the damage, which actually wasn't too bad. We definitely lost one UPS, and possibly a power supply in one of our file servers (though it seems okay for now). Eric's hydrogen survey server seemed to take the brunt of the damage, and he was ready to reinstall the OS on what disks remained visible to the system, when suddenly after the nth reboot all drives were visible again and all data was still intact. Well, that was a pleasant surprise.

Still, there was a bit of RAID and database recovery on various servers, which is why the project largely remained offline until the end of the day today. This is still going on, so we probably won't be fully back to normal until tomorrow morning at the earliest.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4097
Credit: 33,047,134
RAC: 7,997
United Kingdom
Message 970987 - Posted: 17 Feb 2010, 22:56:01 UTC - in response to Message 970983.

Thanks for the update Matt.

Claggy

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 15,546,182
RAC: 11,462
United States
Message 970989 - Posted: 17 Feb 2010, 22:57:48 UTC - in response to Message 970983.

Thanks Matt and all the rest of the crew too.
____________


PROUD MEMBER OF Team Starfire World BOINC

Radford Bunker
Send message
Joined: 12 Mar 09
Posts: 8
Credit: 3,140,339
RAC: 1,082
United States
Message 970996 - Posted: 17 Feb 2010, 23:13:26 UTC

Thanks Matt.

Sounds like a Murphy Strike.

Rad

Profile Link
Avatar
Send message
Joined: 18 Sep 03
Posts: 828
Credit: 1,564,170
RAC: 263
Germany
Message 970998 - Posted: 17 Feb 2010, 23:13:51 UTC - in response to Message 970983.
Last modified: 17 Feb 2010, 23:15:14 UTC

smelled burned plastic, heard broken fans

How hot was it in there? Are the systems not automatically shuting down, when overheating?
____________
.

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8499
Credit: 49,919,657
RAC: 50,972
United Kingdom
Message 971000 - Posted: 17 Feb 2010, 23:21:19 UTC

Matt,

Uploads have been problematic since well before the air conditioning failure, and well before the Tuesday maintenance window, too - since around 09:00 PST Monday morning, judging by the first post in Number Crunching..

I'm currently getting:

17/02/2010 23:09:28|SETI@home|[file_xfer] Started upload of file 25fe07ac.28421.12751.16.10.119_1_0
17/02/2010 23:09:29||[http_debug] [ID#14] info: About to connect() to setiboincdata.ssl.berkeley.edu port 80 (#0)
17/02/2010 23:09:29||[http_debug] [ID#14] info: Trying 208.68.240.16...
17/02/2010 23:09:29||[http_debug] [ID#14] info: Connected to setiboincdata.ssl.berkeley.edu (208.68.240.16) port 80 (#0)
17/02/2010 23:09:29||[http_debug] [ID#14] Sent header to server: POST /sah_cgi/file_upload_handler HTTP/1.1
User-Agent: BOINC client (windows_intelx86 5.10.13)
Host: setiboincdata.ssl.berkeley.edu
Accept: */*
Accept-Encoding: deflate, gzip
Content-Type: application/x-www-form-urlencoded
Content-Length: 288


17/02/2010 23:09:29||[http_debug] [ID#14] Received header from server: HTTP/1.0 503 Service Unavailable

17/02/2010 23:09:29||[http_debug] [ID#14] Received header from server: Content-Type: text/html

17/02/2010 23:09:29||[http_debug] [ID#14] Received header from server: Content-Length: 53

17/02/2010 23:09:29||[http_xfer_debug] HTTP: wrote 53 bytes
17/02/2010 23:09:29||[http_debug] [ID#14] info: Expire cleared
17/02/2010 23:09:29||[http_debug] [ID#14] info: Closing connection #0
17/02/2010 23:09:30|SETI@home|[file_xfer] Temporarily failed upload of 25fe07ac.28421.12751.16.10.119_1_0: http error

That HTTP/1.0 503 Service Unavailable suggests something might still need kicking.

Dena Wiltsie
Send message
Joined: 19 Apr 01
Posts: 1147
Credit: 547,331
RAC: 284
United States
Message 971002 - Posted: 17 Feb 2010, 23:38:21 UTC

For a while my job depended on a system cooled by an air conditioner that I could not depend on. My solution was to get one of these and wire it into an extension cord so I could connect all the non-replaceable equipment to it. I then set it to about 80 F and had no worries about failed hardware. The catch is you must make sure your backups are up to date as the power down will be very hard and in my case the raid lost a drive often when it was powered down (very old drives).
____________

Rick
Avatar
Send message
Joined: 3 Dec 99
Posts: 79
Credit: 11,486,227
RAC: 0
United States
Message 971007 - Posted: 17 Feb 2010, 23:50:07 UTC - in response to Message 971002.

For a while my job depended on a system cooled by an air conditioner that I could not depend on. My solution was to get one of these and wire it into an extension cord so I could connect all the non-replaceable equipment to it. I then set it to about 80 F and had no worries about failed hardware. The catch is you must make sure your backups are up to date as the power down will be very hard and in my case the raid lost a drive often when it was powered down (very old drives).


Plug a UPS into it that has the ability to trigger a graceful shutdown of the systems when the power fails. So long as the UPS has the capacity to keep power to the systems during the shutdown you should be in good shape.
____________

Profile S@NL - Eesger - www.knoop.nl
Avatar
Send message
Joined: 7 Oct 01
Posts: 384
Credit: 36,652,491
RAC: 17,840
Netherlands
Message 971008 - Posted: 17 Feb 2010, 23:50:10 UTC - in response to Message 970983.

... they just pressed the reset button...


It's the Microsoft way.. and heck it works more ofthen then one would think ;)
____________
The SETI@Home Gauntlet 2012 april 16 - 30| info / chat | STATS

Profile Link
Avatar
Send message
Joined: 18 Sep 03
Posts: 828
Credit: 1,564,170
RAC: 263
Germany
Message 971011 - Posted: 17 Feb 2010, 23:55:55 UTC - in response to Message 971002.

For a while my job depended on a system cooled by an air conditioner that I could not depend on. My solution was to get one of these (...)

I think there are more than enough software based solutions, which will nicely power down the system, if something is overheating.

Alternatively, if software not possible, one could try to simulate pressing the power button. That will also gracefully shut down the system.
____________
.

Dena Wiltsie
Send message
Joined: 19 Apr 01
Posts: 1147
Credit: 547,331
RAC: 284
United States
Message 971012 - Posted: 18 Feb 2010, 0:01:23 UTC - in response to Message 971007.
Last modified: 18 Feb 2010, 0:04:17 UTC

For a while my job depended on a system cooled by an air conditioner that I could not depend on. My solution was to get one of these and wire it into an extension cord so I could connect all the non-replaceable equipment to it. I then set it to about 80 F and had no worries about failed hardware. The catch is you must make sure your backups are up to date as the power down will be very hard and in my case the raid lost a drive often when it was powered down (very old drives).


Plug a UPS into it that has the ability to trigger a graceful shutdown of the systems when the power fails. So long as the UPS has the capacity to keep power to the systems during the shutdown you should be in good shape.

It was a P390 running OS2 Warp and VM/ESA. It was so old it didn't have any idea what a smart UPS was. The hard drive failure would happen just because it stopped turning. On the other hand, I would have to do a cold start on VM/ESA but we never lost a byte of data with that set up. I am not sure other operating systems would be as forgiving so I provided a warning.
We did have a UPS but it's main function was to filter power glitches. One danger of putting the switch on the UPS is additional heat will be generated while the UPS reaches it's shutdown point. My room was not much large than a closet so when things overheated, they needed to be shut down fast.
The system was up 24 hours a day and often would be unattended so the failure would most likely happen when no one was around to lay hands on the system.
____________

Profile RottenMutt
Avatar
Send message
Joined: 15 Mar 01
Posts: 992
Credit: 207,654,737
RAC: 0
United States
Message 971015 - Posted: 18 Feb 2010, 0:20:19 UTC - in response to Message 970983.
Last modified: 18 Feb 2010, 0:22:13 UTC

When the campus A/C techs came up in the early evening they just pressed the reset button and it came back to life.


i sure hope you took note as to where the reset switch is...

cricket still shows little activity, you must still be down and some of my rigs are out of work for the GPU's and others will be out soon (hours)...
____________

dbryce
Send message
Joined: 23 Dec 99
Posts: 4
Credit: 906,647
RAC: 0
Canada
Message 971017 - Posted: 18 Feb 2010, 0:31:48 UTC - in response to Message 971012.
Last modified: 18 Feb 2010, 0:32:29 UTC


It was a P390 running OS2 Warp and VM/ESA. It was so old it didn't have any idea what a smart UPS was. The hard drive failure would happen just because it stopped turning. On the other hand, I would have to do a cold start on VM/ESA but we never lost a byte of data with that set up. I am not sure other operating systems would be as forgiving so I provided a warning.
We did have a UPS but it's main function was to filter power glitches. One danger of putting the switch on the UPS is additional heat will be generated while the UPS reaches it's shutdown point. My room was not much large than a closet so when things overheated, they needed to be shut down fast.
The system was up 24 hours a day and often would be unattended so the failure would most likely happen when no one was around to lay hands on the system.


I remember that box!! <g> In my 'previous life' we were running one of those and we had a 'UPS on steroids' that would power the machine for, I think, 2 hours. It might even have powered our 'server farm', but that was 6.5 years ago and my memory is iffy.

Doug

frank
Send message
Joined: 14 May 99
Posts: 1
Credit: 341,345
RAC: 0
United States
Message 971053 - Posted: 18 Feb 2010, 2:33:13 UTC

thanks matt. was sure worried why its been off so long. thank you for your work their

[/b]
____________

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 971079 - Posted: 18 Feb 2010, 4:20:57 UTC - in response to Message 971002.

For a while my job depended on a system cooled by an air conditioner that I could not depend on. My solution was to get one of these and wire it into an extension cord so I could connect all the non-replaceable equipment to it. I then set it to about 80 F and had no worries about failed hardware. The catch is you must make sure your backups are up to date as the power down will be very hard and in my case the raid lost a drive often when it was powered down (very old drives).

Most anything semi-modern supports some sort of "dumb" signaling from a UPS.

It uses a normal serial port, and only the handshake lines. A line goes "low" to signal "low battery" and the UPS waits for the system to drop a handshake line back when it is safe for the UPS to turn off.

One could build a "UPS" whose only job was to signal low battery when the temperature got above a certain temperature, and kill power when the system said "okay."

Power would be restored when it got cold enough. Or not.
____________

Profile Bob1701a
Send message
Joined: 11 Apr 00
Posts: 9
Credit: 6,933,206
RAC: 0
United States
Message 971080 - Posted: 18 Feb 2010, 4:22:19 UTC - in response to Message 970983.

The smart-ass in me made me write this.....

The A/C died and it's too hot? It's winter, it's 25 degrees and snowing...open the windows. That'll cool you off.
____________

Profile gizbar
Avatar
Send message
Joined: 7 Jan 01
Posts: 586
Credit: 21,087,774
RAC: 0
United Kingdom
Message 971082 - Posted: 18 Feb 2010, 4:25:01 UTC

Blasted A/C! Now we're out of the frying pan, can we just avoid the fire this time? ;-)

Good job guys.

Trying to recover some data off a laptop drive for someone at the moment. Of course, there isn't a backup, and this is the 4th system I'm trying to recover just recently. The battery has gone in the laptop and seeing as it is a normal P4@3.00Ghz, the PSU is struggling to supply everything now too.

I've had to take the hdd out and attach it to a desktop. After the usual virus checks etc, I started a chkdsk over 10 hours ago and it's less than halfway through!

Oh well. At least it keeps me busy while you guys were up to your eyeballs in it.

Gizbar. [/i]
____________


A proud GPU User Server Donor!

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5819
Credit: 58,983,376
RAC: 48,186
Australia
Message 971089 - Posted: 18 Feb 2010, 4:43:47 UTC - in response to Message 971080.

The smart-ass in me made me write this.....

The A/C died and it's too hot? It's winter, it's 25 degrees and snowing...open the windows. That'll cool you off.

Not from that part of the world, but i don't think it snows too often at Berkeley. And most server rooms don't have windows, let alone ones that open.
____________
Grant
Darwin NT.

Joori
Send message
Joined: 2 Nov 07
Posts: 1
Credit: 3,436,062
RAC: 0
Australia
Message 971093 - Posted: 18 Feb 2010, 4:52:49 UTC

Nice to hear everything is almost back to normal. Unfortunate that alot of work units were aborted while trying to upload them as their deadline had passed during the downtime. A have a feeling more will be aborted as they are still unable to be uploaded..

Kinda dissapointed but what can ya do aye? You win some, you lose some - gotta keep on truckin' ! :)
____________

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 971095 - Posted: 18 Feb 2010, 5:05:40 UTC - in response to Message 971080.

The A/C died and it's too hot? It's winter, it's 25 degrees and snowing...open the windows. That'll cool you off.

Assuming the server room is near an outside wall and has openable windows, of course.
____________

1 · 2 · 3 · 4 . . . 6 · Next

Message boards : Technical News : Out of the Frying Pan (Feb 17 2010)

Copyright © 2014 University of California