Heat Wave (Jun 09 2008)

Message boards : Technical News : Heat Wave (Jun 09 2008)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 765412 - Posted: 9 Jun 2008, 20:52:35 UTC

Over the weekend the scheduler ceased operations on its own again. I was able to remotely fix this Saturday morning and recovery was swift. This was the same problem as earlier in the week but this time we had a smoking gun: the CGI output log file was maxed out at 2GB in size (this is running on a 32 bit system). Cleaning out the logs solved the problem. The thing is: We've been letting these logs grown to 2GB in size for months without any issue. So why is this a problem all of a sudden? However strange, I put a log rotation script in place to prevent this from happening again any time soon. Funny side note: I would have gotten the alerts faster but coincidentally the lab-wide mail servers conked out as well Saturday morning. Other than that, nothing much to report the past couple of days.

Which brings us to today. Around 12:30 our server closet air conditioning unit died. Within 30 minutes all the servers warmed up over 5 degrees Celsius and I started getting alerts. This may be a significant problem (i.e. we may need more than just a coolant refill). So depending on how fast we can get the maintenance people up here I might have to shut down parts or all of the project to prevent server burnout. Meanwhile, I have the server closet doors open to help cool things down, much to the annoyance of all the projects on this floor (the fan noise is about 20-30 decibels louder with the doors open). The poor people across the hall from the closet are being defeaned - my desk is a few doors down.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 765412 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 765419 - Posted: 9 Jun 2008, 21:14:51 UTC


. . . oh mi lord - hopefully NOTHIN' else shall go wrong eh

> you're doin' a great job Matt - iT is appreciated . . .

< goes for all of you @ Berkeley btw - Thanks to each of you


BOINC Wiki . . .

Science Status Page . . .
ID: 765419 · Report as offensive
Urs Echternacht
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 692
Credit: 135,197,781
RAC: 211
Germany
Message 765429 - Posted: 9 Jun 2008, 21:29:44 UTC
Last modified: 9 Jun 2008, 21:30:12 UTC

Give out a round of earplugs to the folks near that open door. Alone the gesture will make them calm down, hopefully.
_\|/_
U r s
ID: 765429 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 765452 - Posted: 9 Jun 2008, 22:25:37 UTC

Update: The air conditioner people came up from campus and inspected everything - long story short a faulty switch caused the outside fans to turn off. This switch is now temporarily bypassed until they can replace it. Meanwhile it's running, cold air is coming in, the doors are closed, the hall is quiet again, everybody is happy.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 765452 · Report as offensive
Profile Neil Blaikie
Volunteer tester
Avatar

Send message
Joined: 17 May 99
Posts: 143
Credit: 6,652,341
RAC: 0
Canada
Message 765456 - Posted: 9 Jun 2008, 22:38:34 UTC

Good to see the technicians got a temporary fix working for you guys. Here in Montreal, having problems with heat as well, it has been warm all weekend and high humidity levels!

Looking at the weather before checking for severe thunderstorm warnings, I noticed there is snow forecast for the Cascades for the next few days down to 2000ft. Maybe time to add climate prediction as a backup project, that is messed up for June!

Keep up the good work and thanks for the updates Matt.
ID: 765456 · Report as offensive
Profile Mad Max
Volunteer tester
Avatar

Send message
Joined: 16 Mar 00
Posts: 475
Credit: 213,231,775
RAC: 407
United States
Message 765473 - Posted: 9 Jun 2008, 23:17:35 UTC - in response to Message 765412.  



Which brings us to today. Around 12:30 our server closet air conditioning unit died. Within 30 minutes all the servers warmed up over 5 degrees Celsius and I started getting alerts. This may be a significant problem (i.e. we may need more than just a coolant refill). So depending on how fast we can get the maintenance people up here I might have to shut down parts or all of the project to prevent server burnout. Meanwhile, I have the server closet doors open to help cool things down, much to the annoyance of all the projects on this floor (the fan noise is about 20-30 decibels louder with the doors open). The poor people across the hall from the closet are being defeaned - my desk is a few doors down.

- Matt


What I find humorous about this is the fact of how much it mirrors my own work life.
IAS - Where Space Is Golden!
ID: 765473 · Report as offensive
Profile Steve Dodd

Send message
Joined: 29 May 99
Posts: 23
Credit: 8,695,373
RAC: 1
United States
Message 765492 - Posted: 10 Jun 2008, 0:33:36 UTC

Aren't coincidences amazing. On the 1st, Lattice was down for cooling problems. Then on Saturday last, Docking was down for cooling, too. Me thinks there is some code lurking way down in BOINC that must be triggered somehow to cause this to randomly selected projects :)
ID: 765492 · Report as offensive

Message boards : Technical News : Heat Wave (Jun 09 2008)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.