Power - the Reunion Tour (Jun 11 2012)


log in

Advanced search

Message boards : Technical News : Power - the Reunion Tour (Jun 11 2012)

Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 1244754 - Posted: 11 Jun 2012, 22:17:30 UTC

Kind of a bumpy weekend. So we moved that database (which handles the seti.berkeley.edu website) from Dan's new but oddly crashy desktop on my new desktop. Then over the weekend MY new desktop started crashing at random. You'd think this is now clearly related to the database, but Dan's desktop continued to crash after moving the mysql database off of it. And upon further inspection both systems sometimes crash before the OS is even loaded.

So this looks like a hardware problem after all. Funny how both of these new systems are failing in the same manner. We think it has to do with the power outages from a couple weeks ago sending some jolts into these perhaps more sensitive systems.

But speaking of outages, completely separate from those previous power issues which have since been fixed, there was a brand new problem affecting just this building (and all the projects within it, including SETI@home/BOINC). This one was worse, starting in the middle of the night, and by the time anybody could do anything power was up and down several times, and some outlets delivering half power, etc.

The repairs were much faster, and we were stable again around noon, but upon turning everything back on we found we completely lost thinman, the main web server. Totally dead. However, quite luckily, we happened to have a spare old frankenstein machine kicking around, and I was able to do a "brain transplant" i.e. swap the drives from thinman to this other machine. Now this other machine thinks it is thinman and is working quite well as a web server. Dodged a major bullet there.

I also happened to have my old desktop nearby, so I'm using that as I diagnose the new crashy one. Not sure who is responsible for all these damages and lost time, but it definitely shouldn't be us.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4141
Credit: 33,586,832
RAC: 26,816
United Kingdom
Message 1244764 - Posted: 11 Jun 2012, 22:28:48 UTC - in response to Message 1244754.

Thanks for the update Matt,

Claggy

Profile Chris SProject donor
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 32092
Credit: 13,773,611
RAC: 25,410
United Kingdom
Message 1244765 - Posted: 11 Jun 2012, 22:31:04 UTC

Thanks for the update Matt.

We think it has to do with the power outages from a couple weeks ago sending some jolts into these perhaps more sensitive systems.

Does newer kit need to be run of UPS's to regularise dirty mains?

Hope Thinman might be recoverable, but if not can it be used for spares?

Not sure who is responsible for all these damages and lost time, but it definitely shouldn't be us.

Quite right, no it shouldn't. But as usual I expect politics will play a part in it. Still, not the best way to start the week.

Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12729
Credit: 7,264,581
RAC: 17,551
United States
Message 1244769 - Posted: 11 Jun 2012, 22:41:08 UTC

Thanks for the update, and let us know if you need a petition drive to make the powers that be held responsible for the damage.

Actually wouldn't surprise me if the first outage stressed something the the building and it went.


____________

DJStarfox
Send message
Joined: 23 May 01
Posts: 1045
Credit: 560,168
RAC: 442
United States
Message 1244805 - Posted: 12 Jun 2012, 0:01:31 UTC - in response to Message 1244754.

Now would be a great time to get the funds for those whole-closet UPS devices. How much could that possibly cost the school? ;)

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2287
Credit: 8,811,200
RAC: 4,080
United States
Message 1244824 - Posted: 12 Jun 2012, 1:11:20 UTC - in response to Message 1244805.
Last modified: 12 Jun 2012, 1:12:51 UTC

Now would be a great time to get the funds for those whole-closet UPS devices. How much could that possibly cost the school? ;)

Or at the very least, some line conditioners, which are usually built-in to UPS units. Line conditioners will clean up noisy power, and also most of the time handles very strong surges just fine. May help with keeping weird power scenarios from taking out machines.. or dirty/noisy power may be what is causing those strange and random crashes.

One of my long-since retired crunchers continues to do other things for me around the house and it was acting weird and would randomly crash. Sometimes it would be weeks before it did it, other times it would be repeatedly for an hour or so. I ran memtest on it and discovered the RAM needed more voltage. Instead of the 2.6 that it wanted, I already had the board set for 2.8, so I had to crank it to 2.9, and that fixed it.

Might just be a power issue, either internal or external.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 5051
Credit: 73,846,081
RAC: 12,086
Australia
Message 1244871 - Posted: 12 Jun 2012, 3:08:23 UTC - in response to Message 1244754.
Last modified: 12 Jun 2012, 3:10:24 UTC

... Dan's new but oddly crashy desktop on my new desktop. Then over the weekend MY new desktop started crashing at random. You'd think this is now clearly related to the database, but Dan's desktop continued to crash after moving the mysql database off of it. And upon further inspection both systems sometimes crash before the OS is even loaded.

So this looks like a hardware problem after all. Funny how both of these new systems are failing in the same manner. ...


One relatively newer possibility, in addition to the usual checks, that's quick & easy to eliminate. There's been a general trend evolving lately, to supply XMP profile (or other high frequency with tight latency) memory defaulting to 'normal undervolts'.

After a typical 14 hour or so burnin period the crashy symptoms appear, & gradually worsen over time. Heavy RAM usage patterns in particular then throw either controller or RAM modules over the edge, while memtests often show clear.

The quick check is to make sure the DIMM voltage matches the XMP profile spec, and that VID (memory controller in the CPU) is set to about 70% of that (which is for impedance matching purposes, maximising signal integrity & stopping the memory controller sinking excessive current).

Jason
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile Chris SProject donor
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 32092
Credit: 13,773,611
RAC: 25,410
United Kingdom
Message 1244978 - Posted: 12 Jun 2012, 9:25:25 UTC

Now would be a great time to get the funds for those whole-closet UPS devices. How much could that possibly cost the school?

I agree. I don't know whether the Seti server closet and other kit has rack mounted UPS's, but if not then they really should have. No UPS will last for a 5 or 6 hour outage, but they will shut down kit gracefully much earlier without any damage, and they protect against the brownouts mentioned by Matt. Seti having its own automatic backup diesel generator would probably be unrealistic.

But if these power problems and outages are likely to continue over the summer then the project has to take steps to protect its kit. If UPS's are needed then I am sure we could start an emergency fund raising drive once we know what is needed and the cost. I'll most certainly chip in what I can afford.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5864
Credit: 60,562,021
RAC: 47,732
Australia
Message 1244988 - Posted: 12 Jun 2012, 10:02:56 UTC - in response to Message 1244978.

No UPS will last for a 5 or 6 hour outage,

They can, but it takes big batteries.
The main use for UPSs is protection from surges, brownouts & power falures. If the failure is long enough, then it allows the hardware to be shut down normally.
Larger UPS units are designed to keep systems up till such time as a backup generator can come online, and then keep things up when that shuts down & the system switches back to mains power.
____________
Grant
Darwin NT.

Cheopis
Send message
Joined: 17 Sep 00
Posts: 139
Credit: 11,312,147
RAC: 8,080
United States
Message 1244997 - Posted: 12 Jun 2012, 10:28:33 UTC
Last modified: 12 Jun 2012, 10:30:31 UTC

I do not think it is reasonable to try to get a UPS system that will do more than protect the machines, and allow them enough time to gracefully power off after a short timeframe running with no power. Maybe 10 minutes.

Power conditioning and voltage regulation, if they are not already a part of the lab's UPS system, should be considered. Every time you have an outage like this one (especially in an older building), some other part of the electrical system gets stressed. You might have cascading problems every few weeks for the next year before everything is all ironed out.

Profile Slavac
Volunteer tester
Avatar
Send message
Joined: 27 Apr 11
Posts: 1932
Credit: 17,952,639
RAC: 0
United States
Message 1245056 - Posted: 12 Jun 2012, 14:17:38 UTC - in response to Message 1244997.

We've floated the idea of power stabilizing hardware to the lab, I'll let anyone know if they decide they'd like some of the same.

It's heartbreaking that our two new workstations got crippled but given the past few weeks it's understanding. We'll replace the damaged components ASAP once Matt et al figure out the issues.
____________


Executive Director GPU Users Group Inc. -
brad@gpuug.org

Profile Chris SProject donor
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 32092
Credit: 13,773,611
RAC: 25,410
United Kingdom
Message 1245492 - Posted: 13 Jun 2012, 18:38:43 UTC

Thanks Slavac, your heads up is appreciated.

Profile edjcox
Avatar
Send message
Joined: 20 May 99
Posts: 68
Credit: 4,024,854
RAC: 1,152
United States
Message 1245745 - Posted: 14 Jun 2012, 5:53:17 UTC

Even some small UPS equipment for the PC's would help keep the power gremlins from disturbing circuitry and such and shortening lifespan. I have all my gear at home on UPS for graceful shutdown and power conditioning at all times...

Find out who your campuis engineer is and raise hell ... Let people know they are destroying equipment with their shenanigans. This should bye upchanneled as mush as possible to let management know this is costing them money, time, equipment...
____________
Never engage stupid people at their level, they then have the home court advantage.....

Message boards : Technical News : Power - the Reunion Tour (Jun 11 2012)

Copyright © 2014 University of California