Power - the Reunion Tour (Jun 11 2012)

Message boards : Technical News : Power - the Reunion Tour (Jun 11 2012)

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1441
Credit: 213,689
RAC: 0
United States
Message 1244754 - Posted: 11 Jun 2012, 22:17:30 UTC

Kind of a bumpy weekend. So we moved that database (which handles the seti.berkeley.edu website) from Dan's new but oddly crashy desktop on my new desktop. Then over the weekend MY new desktop started crashing at random. You'd think this is now clearly related to the database, but Dan's desktop continued to crash after moving the mysql database off of it. And upon further inspection both systems sometimes crash before the OS is even loaded.

So this looks like a hardware problem after all. Funny how both of these new systems are failing in the same manner. We think it has to do with the power outages from a couple weeks ago sending some jolts into these perhaps more sensitive systems.

But speaking of outages, completely separate from those previous power issues which have since been fixed, there was a brand new problem affecting just this building (and all the projects within it, including SETI@home/BOINC). This one was worse, starting in the middle of the night, and by the time anybody could do anything power was up and down several times, and some outlets delivering half power, etc.

The repairs were much faster, and we were stable again around noon, but upon turning everything back on we found we completely lost thinman, the main web server. Totally dead. However, quite luckily, we happened to have a spare old frankenstein machine kicking around, and I was able to do a "brain transplant" i.e. swap the drives from thinman to this other machine. Now this other machine thinks it is thinman and is working quite well as a web server. Dodged a major bullet there.

I also happened to have my old desktop nearby, so I'm using that as I diagnose the new crashy one. Not sure who is responsible for all these damages and lost time, but it definitely shouldn't be us.

- Matt


-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

ID: 1244754 · Report as offensive
ClaggyProject Donor
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4623
Credit: 46,348,550
RAC: 2,965
United Kingdom
Message 1244764 - Posted: 11 Jun 2012, 22:28:48 UTC - in response to Message 1244754.  

Thanks for the update Matt,

Claggy

ID: 1244764 · Report as offensive
Profile Chris SCrowdfunding Project Donor
Volunteer tester
Avatar

Send message
Joined: 19 Nov 00
Posts: 38182
Credit: 21,365,988
RAC: 27,736
United Kingdom
Message 1244765 - Posted: 11 Jun 2012, 22:31:04 UTC

Thanks for the update Matt.

We think it has to do with the power outages from a couple weeks ago sending some jolts into these perhaps more sensitive systems.

Does newer kit need to be run of UPS's to regularise dirty mains?

Hope Thinman might be recoverable, but if not can it be used for spares?

Not sure who is responsible for all these damages and lost time, but it definitely shouldn't be us.

Quite right, no it shouldn't. But as usual I expect politics will play a part in it. Still, not the best way to start the week.

ID: 1244765 · Report as offensive
Profile Gary CharpentierCrowdfunding Project Donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 18641
Credit: 21,462,042
RAC: 20,023
United States
Message 1244769 - Posted: 11 Jun 2012, 22:41:08 UTC

Thanks for the update, and let us know if you need a petition drive to make the powers that be held responsible for the damage.

Actually wouldn't surprise me if the first outage stressed something the the building and it went.


ID: 1244769 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1057
Credit: 802,388
RAC: 86
United States
Message 1244805 - Posted: 12 Jun 2012, 0:01:31 UTC - in response to Message 1244754.  

Now would be a great time to get the funds for those whole-closet UPS devices. How much could that possibly cost the school? ;)

ID: 1244805 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 2871
Credit: 10,621,745
RAC: 322
United States
Message 1244824 - Posted: 12 Jun 2012, 1:11:20 UTC - in response to Message 1244805.  
Last modified: 12 Jun 2012, 1:12:51 UTC

Now would be a great time to get the funds for those whole-closet UPS devices. How much could that possibly cost the school? ;)

Or at the very least, some line conditioners, which are usually built-in to UPS units. Line conditioners will clean up noisy power, and also most of the time handles very strong surges just fine. May help with keeping weird power scenarios from taking out machines.. or dirty/noisy power may be what is causing those strange and random crashes.

One of my long-since retired crunchers continues to do other things for me around the house and it was acting weird and would randomly crash. Sometimes it would be weeks before it did it, other times it would be repeatedly for an hour or so. I ran memtest on it and discovered the RAM needed more voltage. Instead of the 2.6 that it wanted, I already had the board set for 2.8, so I had to crank it to 2.9, and that fixed it.

Might just be a power issue, either internal or external.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)

ID: 1244824 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7243
Credit: 87,244,366
RAC: 5,468
Australia
Message 1244871 - Posted: 12 Jun 2012, 3:08:23 UTC - in response to Message 1244754.  
Last modified: 12 Jun 2012, 3:10:24 UTC

... Dan's new but oddly crashy desktop on my new desktop. Then over the weekend MY new desktop started crashing at random. You'd think this is now clearly related to the database, but Dan's desktop continued to crash after moving the mysql database off of it. And upon further inspection both systems sometimes crash before the OS is even loaded.

So this looks like a hardware problem after all. Funny how both of these new systems are failing in the same manner. ...


One relatively newer possibility, in addition to the usual checks, that's quick & easy to eliminate. There's been a general trend evolving lately, to supply XMP profile (or other high frequency with tight latency) memory defaulting to 'normal undervolts'.

After a typical 14 hour or so burnin period the crashy symptoms appear, & gradually worsen over time. Heavy RAM usage patterns in particular then throw either controller or RAM modules over the edge, while memtests often show clear.

The quick check is to make sure the DIMM voltage matches the XMP profile spec, and that VID (memory controller in the CPU) is set to about 70% of that (which is for impedance matching purposes, maximising signal integrity & stopping the memory controller sinking excessive current).

Jason
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

ID: 1244871 · Report as offensive
kittymanProject Donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 45916
Credit: 815,217,969
RAC: 124,954
United States
Message 1244955 - Posted: 12 Jun 2012, 7:05:42 UTC

I have wondered this out loud before, but doesn't the campus have some kind of comprehensive insurance coverage that might cover the loss of equipment in cases like this?
I find it hard to believe that lab and computer equipment might not be covered.
Even most basic homeowner's insurance covers this kind of thing for example, in the case of a lightning strike.

It might be worthwhile to ask some serious questions of the proper authorities.....

Just sayin'.


Cats.....what more does one need?

Have made friends in this life.
Most were cats.

ID: 1244955 · Report as offensive
Profile Chris SCrowdfunding Project Donor
Volunteer tester
Avatar

Send message
Joined: 19 Nov 00
Posts: 38182
Credit: 21,365,988
RAC: 27,736
United Kingdom
Message 1244978 - Posted: 12 Jun 2012, 9:25:25 UTC

Now would be a great time to get the funds for those whole-closet UPS devices. How much could that possibly cost the school?

I agree. I don't know whether the Seti server closet and other kit has rack mounted UPS's, but if not then they really should have. No UPS will last for a 5 or 6 hour outage, but they will shut down kit gracefully much earlier without any damage, and they protect against the brownouts mentioned by Matt. Seti having its own automatic backup diesel generator would probably be unrealistic.

But if these power problems and outages are likely to continue over the summer then the project has to take steps to protect its kit. If UPS's are needed then I am sure we could start an emergency fund raising drive once we know what is needed and the cost. I'll most certainly chip in what I can afford.

ID: 1244978 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 7486
Credit: 91,096,297
RAC: 46,429
Australia
Message 1244988 - Posted: 12 Jun 2012, 10:02:56 UTC - in response to Message 1244978.  

No UPS will last for a 5 or 6 hour outage,

They can, but it takes big batteries.
The main use for UPSs is protection from surges, brownouts & power falures. If the failure is long enough, then it allows the hardware to be shut down normally.
Larger UPS units are designed to keep systems up till such time as a backup generator can come online, and then keep things up when that shuts down & the system switches back to mains power.
Grant
Darwin NT

ID: 1244988 · Report as offensive
Cheopis

Send message
Joined: 17 Sep 00
Posts: 150
Credit: 16,554,824
RAC: 1,411
United States
Message 1244997 - Posted: 12 Jun 2012, 10:28:33 UTC
Last modified: 12 Jun 2012, 10:30:31 UTC

I do not think it is reasonable to try to get a UPS system that will do more than protect the machines, and allow them enough time to gracefully power off after a short timeframe running with no power. Maybe 10 minutes.

Power conditioning and voltage regulation, if they are not already a part of the lab's UPS system, should be considered. Every time you have an outage like this one (especially in an older building), some other part of the electrical system gets stressed. You might have cascading problems every few weeks for the next year before everything is all ironed out.

ID: 1244997 · Report as offensive
Profile Slavac
Volunteer tester
Avatar

Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1245056 - Posted: 12 Jun 2012, 14:17:38 UTC - in response to Message 1244997.  

We've floated the idea of power stabilizing hardware to the lab, I'll let anyone know if they decide they'd like some of the same.

It's heartbreaking that our two new workstations got crippled but given the past few weeks it's understanding. We'll replace the damaged components ASAP once Matt et al figure out the issues.




Executive Director GPU Users Group Inc. -
brad@gpuug.org

ID: 1245056 · Report as offensive
Profile Chris SCrowdfunding Project Donor
Volunteer tester
Avatar

Send message
Joined: 19 Nov 00
Posts: 38182
Credit: 21,365,988
RAC: 27,736
United Kingdom
Message 1245492 - Posted: 13 Jun 2012, 18:38:43 UTC

Thanks Slavac, your heads up is appreciated.

ID: 1245492 · Report as offensive
Profile edjcox
Avatar

Send message
Joined: 20 May 99
Posts: 88
Credit: 4,592,005
RAC: 571
United States
Message 1245745 - Posted: 14 Jun 2012, 5:53:17 UTC

Even some small UPS equipment for the PC's would help keep the power gremlins from disturbing circuitry and such and shortening lifespan. I have all my gear at home on UPS for graceful shutdown and power conditioning at all times...

Find out who your campuis engineer is and raise hell ... Let people know they are destroying equipment with their shenanigans. This should bye upchanneled as mush as possible to let management know this is costing them money, time, equipment...


Never engage stupid people at their level, they then have the home court advantage.....

ID: 1245745 · Report as offensive

Message boards : Technical News : Power - the Reunion Tour (Jun 11 2012)


 
©2016 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.