Out of the Frying Pan (Feb 17 2010)

Message boards : Technical News : Out of the Frying Pan (Feb 17 2010)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 6 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 970983 - Posted: 17 Feb 2010, 22:51:35 UTC

Well, shoot. Right at the end of the work day yesterday the air conditioning unit failed. What's worse is that the cause is still a complete mystery. When the campus A/C techs came up in the early evening they just pressed the reset button and it came back to life.

But that was after a panicked fury of shutting down every server possible to save their lives. Eric was the first on the scene and smelled burned plastic, heard broken fans, and quickly started unplugging everything he could. I came up later after the A/C was on to get the web servers going again (so people could at least see we were still alive).

This morning rolled up our sleeves and surveyed the damage, which actually wasn't too bad. We definitely lost one UPS, and possibly a power supply in one of our file servers (though it seems okay for now). Eric's hydrogen survey server seemed to take the brunt of the damage, and he was ready to reinstall the OS on what disks remained visible to the system, when suddenly after the nth reboot all drives were visible again and all data was still intact. Well, that was a pleasant surprise.

Still, there was a bit of RAID and database recovery on various servers, which is why the project largely remained offline until the end of the day today. This is still going on, so we probably won't be fully back to normal until tomorrow morning at the earliest.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 970983 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 970987 - Posted: 17 Feb 2010, 22:56:01 UTC - in response to Message 970983.  

Thanks for the update Matt.

Claggy
ID: 970987 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 970989 - Posted: 17 Feb 2010, 22:57:48 UTC - in response to Message 970983.  

Thanks Matt and all the rest of the crew too.


PROUD MEMBER OF Team Starfire World BOINC
ID: 970989 · Report as offensive
Radford Bunker

Send message
Joined: 12 Mar 09
Posts: 8
Credit: 6,073,787
RAC: 0
United States
Message 970996 - Posted: 17 Feb 2010, 23:13:26 UTC

Thanks Matt.

Sounds like a Murphy Strike.

Rad
ID: 970996 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 970998 - Posted: 17 Feb 2010, 23:13:51 UTC - in response to Message 970983.  
Last modified: 17 Feb 2010, 23:15:14 UTC

smelled burned plastic, heard broken fans

How hot was it in there? Are the systems not automatically shuting down, when overheating?
ID: 970998 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14644
Credit: 200,643,578
RAC: 874
United Kingdom
Message 971000 - Posted: 17 Feb 2010, 23:21:19 UTC

Matt,

Uploads have been problematic since well before the air conditioning failure, and well before the Tuesday maintenance window, too - since around 09:00 PST Monday morning, judging by the first post in Number Crunching..

I'm currently getting:

17/02/2010 23:09:28|SETI@home|[file_xfer] Started upload of file 25fe07ac.28421.12751.16.10.119_1_0
17/02/2010 23:09:29||[http_debug] [ID#14] info: About to connect() to setiboincdata.ssl.berkeley.edu port 80 (#0)
17/02/2010 23:09:29||[http_debug] [ID#14] info:   Trying 208.68.240.16... 
17/02/2010 23:09:29||[http_debug] [ID#14] info: Connected to setiboincdata.ssl.berkeley.edu (208.68.240.16) port 80 (#0)
17/02/2010 23:09:29||[http_debug] [ID#14] Sent header to server: POST /sah_cgi/file_upload_handler HTTP/1.1
User-Agent: BOINC client (windows_intelx86 5.10.13)
Host: setiboincdata.ssl.berkeley.edu
Accept: */*
Accept-Encoding: deflate, gzip
Content-Type: application/x-www-form-urlencoded
Content-Length: 288


17/02/2010 23:09:29||[http_debug] [ID#14] Received header from server: HTTP/1.0 503 Service Unavailable

17/02/2010 23:09:29||[http_debug] [ID#14] Received header from server: Content-Type: text/html

17/02/2010 23:09:29||[http_debug] [ID#14] Received header from server: Content-Length: 53

17/02/2010 23:09:29||[http_xfer_debug] HTTP: wrote 53 bytes
17/02/2010 23:09:29||[http_debug] [ID#14] info: Expire cleared
17/02/2010 23:09:29||[http_debug] [ID#14] info: Closing connection #0
17/02/2010 23:09:30|SETI@home|[file_xfer] Temporarily failed upload of 25fe07ac.28421.12751.16.10.119_1_0: http error

That HTTP/1.0 503 Service Unavailable suggests something might still need kicking.
ID: 971000 · Report as offensive
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1628
Credit: 24,230,968
RAC: 26
United States
Message 971002 - Posted: 17 Feb 2010, 23:38:21 UTC

For a while my job depended on a system cooled by an air conditioner that I could not depend on. My solution was to get one of these and wire it into an extension cord so I could connect all the non-replaceable equipment to it. I then set it to about 80 F and had no worries about failed hardware. The catch is you must make sure your backups are up to date as the power down will be very hard and in my case the raid lost a drive often when it was powered down (very old drives).
ID: 971002 · Report as offensive
Rick
Avatar

Send message
Joined: 3 Dec 99
Posts: 79
Credit: 11,486,227
RAC: 0
United States
Message 971007 - Posted: 17 Feb 2010, 23:50:07 UTC - in response to Message 971002.  

For a while my job depended on a system cooled by an air conditioner that I could not depend on. My solution was to get one of these and wire it into an extension cord so I could connect all the non-replaceable equipment to it. I then set it to about 80 F and had no worries about failed hardware. The catch is you must make sure your backups are up to date as the power down will be very hard and in my case the raid lost a drive often when it was powered down (very old drives).


Plug a UPS into it that has the ability to trigger a graceful shutdown of the systems when the power fails. So long as the UPS has the capacity to keep power to the systems during the shutdown you should be in good shape.
ID: 971007 · Report as offensive
Profile S@NL - Eesger - www.knoop.nl
Avatar

Send message
Joined: 7 Oct 01
Posts: 385
Credit: 50,200,038
RAC: 0
Netherlands
Message 971008 - Posted: 17 Feb 2010, 23:50:10 UTC - in response to Message 970983.  

... they just pressed the reset button...


It's the Microsoft way.. and heck it works more ofthen then one would think ;)
The SETI@Home Gauntlet 2012 april 16 - 30| info / chat | STATS
ID: 971008 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 971011 - Posted: 17 Feb 2010, 23:55:55 UTC - in response to Message 971002.  

For a while my job depended on a system cooled by an air conditioner that I could not depend on. My solution was to get one of these (...)

I think there are more than enough software based solutions, which will nicely power down the system, if something is overheating.

Alternatively, if software not possible, one could try to simulate pressing the power button. That will also gracefully shut down the system.
ID: 971011 · Report as offensive
Dena Wiltsie
Volunteer tester

Send message
Joined: 19 Apr 01
Posts: 1628
Credit: 24,230,968
RAC: 26
United States
Message 971012 - Posted: 18 Feb 2010, 0:01:23 UTC - in response to Message 971007.  
Last modified: 18 Feb 2010, 0:04:17 UTC

For a while my job depended on a system cooled by an air conditioner that I could not depend on. My solution was to get one of these and wire it into an extension cord so I could connect all the non-replaceable equipment to it. I then set it to about 80 F and had no worries about failed hardware. The catch is you must make sure your backups are up to date as the power down will be very hard and in my case the raid lost a drive often when it was powered down (very old drives).


Plug a UPS into it that has the ability to trigger a graceful shutdown of the systems when the power fails. So long as the UPS has the capacity to keep power to the systems during the shutdown you should be in good shape.

It was a P390 running OS2 Warp and VM/ESA. It was so old it didn't have any idea what a smart UPS was. The hard drive failure would happen just because it stopped turning. On the other hand, I would have to do a cold start on VM/ESA but we never lost a byte of data with that set up. I am not sure other operating systems would be as forgiving so I provided a warning.
We did have a UPS but it's main function was to filter power glitches. One danger of putting the switch on the UPS is additional heat will be generated while the UPS reaches it's shutdown point. My room was not much large than a closet so when things overheated, they needed to be shut down fast.
The system was up 24 hours a day and often would be unattended so the failure would most likely happen when no one was around to lay hands on the system.
ID: 971012 · Report as offensive
Profile RottenMutt
Avatar

Send message
Joined: 15 Mar 01
Posts: 1011
Credit: 230,314,058
RAC: 0
United States
Message 971015 - Posted: 18 Feb 2010, 0:20:19 UTC - in response to Message 970983.  
Last modified: 18 Feb 2010, 0:22:13 UTC

When the campus A/C techs came up in the early evening they just pressed the reset button and it came back to life.


i sure hope you took note as to where the reset switch is...

cricket still shows little activity, you must still be down and some of my rigs are out of work for the GPU's and others will be out soon (hours)...
ID: 971015 · Report as offensive
dbryce

Send message
Joined: 23 Dec 99
Posts: 4
Credit: 906,647
RAC: 0
Canada
Message 971017 - Posted: 18 Feb 2010, 0:31:48 UTC - in response to Message 971012.  
Last modified: 18 Feb 2010, 0:32:29 UTC


It was a P390 running OS2 Warp and VM/ESA. It was so old it didn't have any idea what a smart UPS was. The hard drive failure would happen just because it stopped turning. On the other hand, I would have to do a cold start on VM/ESA but we never lost a byte of data with that set up. I am not sure other operating systems would be as forgiving so I provided a warning.
We did have a UPS but it's main function was to filter power glitches. One danger of putting the switch on the UPS is additional heat will be generated while the UPS reaches it's shutdown point. My room was not much large than a closet so when things overheated, they needed to be shut down fast.
The system was up 24 hours a day and often would be unattended so the failure would most likely happen when no one was around to lay hands on the system.


I remember that box!! <g> In my 'previous life' we were running one of those and we had a 'UPS on steroids' that would power the machine for, I think, 2 hours. It might even have powered our 'server farm', but that was 6.5 years ago and my memory is iffy.

Doug
ID: 971017 · Report as offensive
frank

Send message
Joined: 14 May 99
Posts: 1
Credit: 362,082
RAC: 0
United States
Message 971053 - Posted: 18 Feb 2010, 2:33:13 UTC

thanks matt. was sure worried why its been off so long. thank you for your work their

[/b]
ID: 971053 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 971079 - Posted: 18 Feb 2010, 4:20:57 UTC - in response to Message 971002.  

For a while my job depended on a system cooled by an air conditioner that I could not depend on. My solution was to get one of these and wire it into an extension cord so I could connect all the non-replaceable equipment to it. I then set it to about 80 F and had no worries about failed hardware. The catch is you must make sure your backups are up to date as the power down will be very hard and in my case the raid lost a drive often when it was powered down (very old drives).

Most anything semi-modern supports some sort of "dumb" signaling from a UPS.

It uses a normal serial port, and only the handshake lines. A line goes "low" to signal "low battery" and the UPS waits for the system to drop a handshake line back when it is safe for the UPS to turn off.

One could build a "UPS" whose only job was to signal low battery when the temperature got above a certain temperature, and kill power when the system said "okay."

Power would be restored when it got cold enough. Or not.
ID: 971079 · Report as offensive
Profile Bob1701a

Send message
Joined: 11 Apr 00
Posts: 9
Credit: 6,933,206
RAC: 0
United States
Message 971080 - Posted: 18 Feb 2010, 4:22:19 UTC - in response to Message 970983.  

The smart-ass in me made me write this.....

The A/C died and it's too hot? It's winter, it's 25 degrees and snowing...open the windows. That'll cool you off.
ID: 971080 · Report as offensive
Profile gizbar
Avatar

Send message
Joined: 7 Jan 01
Posts: 586
Credit: 21,087,774
RAC: 0
United Kingdom
Message 971082 - Posted: 18 Feb 2010, 4:25:01 UTC

Blasted A/C! Now we're out of the frying pan, can we just avoid the fire this time? ;-)

Good job guys.

Trying to recover some data off a laptop drive for someone at the moment. Of course, there isn't a backup, and this is the 4th system I'm trying to recover just recently. The battery has gone in the laptop and seeing as it is a normal P4@3.00Ghz, the PSU is struggling to supply everything now too.

I've had to take the hdd out and attach it to a desktop. After the usual virus checks etc, I started a chkdsk over 10 hours ago and it's less than halfway through!

Oh well. At least it keeps me busy while you guys were up to your eyeballs in it.

Gizbar. [/i]


A proud GPU User Server Donor!
ID: 971082 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13715
Credit: 208,696,464
RAC: 304
Australia
Message 971089 - Posted: 18 Feb 2010, 4:43:47 UTC - in response to Message 971080.  

The smart-ass in me made me write this.....

The A/C died and it's too hot? It's winter, it's 25 degrees and snowing...open the windows. That'll cool you off.

Not from that part of the world, but i don't think it snows too often at Berkeley. And most server rooms don't have windows, let alone ones that open.
Grant
Darwin NT
ID: 971089 · Report as offensive
Joori

Send message
Joined: 2 Nov 07
Posts: 1
Credit: 3,578,735
RAC: 0
Australia
Message 971093 - Posted: 18 Feb 2010, 4:52:49 UTC

Nice to hear everything is almost back to normal. Unfortunate that alot of work units were aborted while trying to upload them as their deadline had passed during the downtime. A have a feeling more will be aborted as they are still unable to be uploaded..

Kinda dissapointed but what can ya do aye? You win some, you lose some - gotta keep on truckin' ! :)
ID: 971093 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 971095 - Posted: 18 Feb 2010, 5:05:40 UTC - in response to Message 971080.  

The A/C died and it's too hot? It's winter, it's 25 degrees and snowing...open the windows. That'll cool you off.

Assuming the server room is near an outside wall and has openable windows, of course.
ID: 971095 · Report as offensive
1 · 2 · 3 · 4 . . . 6 · Next

Message boards : Technical News : Out of the Frying Pan (Feb 17 2010)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.