Stormy (Nov 22 2010)


log in

Advanced search

Message boards : Technical News : Stormy (Nov 22 2010)

1 · 2 · 3 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 1050343 - Posted: 22 Nov 2010, 18:50:27 UTC

I'll write today's message early as this week is a short holiday week so we're kinda busy.

First and foremost, carolyn is now the *only* mysql replica - I just turned the other replica (the troublesome server mork) off, perhaps for good. Yay! That's one of the two new servers more or less ready for prime time, though we still hope to make carolyn the master (and jocelyn the replica) today or tomorrow.

We're still far from getting the whole project back on line - we have the other new server, oscar, installed and ready to roll, but still need to (a) install and configure informix on it, (b) clean up the science database on thumper, and then (c) transfer all the data from thumper to oscar. This may take a while - the spike merge (which was the last major part of the "clean up") did finally complete last week (after running about 2-3 months) but there was still a discrepancy of about a million missing spikes which Jeff is successfully tracking down. So there are a few extra merges to do yet. We probably won't really dig into getting oscar on line until after Thanksgiving.

Of course, what's a weekend without an unexpected server crash or two? On Saturday afternoon a major lightning storm swept through the Bay Area. Other projects in the lab (located in the other building) had major power outages. Luckily we were spared a full outage, but apparently a couple of our servers got hung up around this time, perhaps due to some kind of non-zero power fluctuation. The servers were thumper and marvin - each located in different rooms, and on different breakers. It is funny that these two machines are our current two informix servers (thumper holds the SETI@home scientific data, and marvin holds Astropulse). So there was some cleanup to deal with this morning (database/filesystem recovery, hung mounts, etc.) but really no big shakes and we're back to normal (whatever normal is these days). Both systems were on surge protectors so I'm not sure why they were so sensitive - maybe the crashes were random and the timing was coincidental with the storm.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Profile Bill GProject donor
Avatar
Send message
Joined: 1 Jun 01
Posts: 349
Credit: 43,132,712
RAC: 47,677
United States
Message 1050347 - Posted: 22 Nov 2010, 19:40:14 UTC - in response to Message 1050343.

Great and thanks for the info.
____________

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 15,946,303
RAC: 11,984
United States
Message 1050348 - Posted: 22 Nov 2010, 19:40:46 UTC - in response to Message 1050343.

Thanks for the update Matt. Take your time, we will be here when your ready to turn it back on. Maybe soon these good news/bad news messages will turn into only good news for many long times to come.
____________


PROUD MEMBER OF Team Starfire World BOINC

SMWProject donor
Send message
Joined: 16 May 99
Posts: 21
Credit: 11,230,046
RAC: 8,289
United States
Message 1050349 - Posted: 22 Nov 2010, 19:44:21 UTC

Thanks for keeping us in the loop on what's happening, we appreciate this.
____________
"It is better to be hated for what you are then to be loved for what you are not"
- Andre Gide (1869-1951)

DJStarfox
Send message
Joined: 23 May 01
Posts: 1045
Credit: 561,241
RAC: 518
United States
Message 1050350 - Posted: 22 Nov 2010, 19:48:26 UTC - in response to Message 1050343.

Thanks for the update.

Yeah, getting jocelyn the replica and all that working will be a great way to go into the holiday weekend. Oscar can wait.

BTW, it seems the website/forums are fast and snappy compared to a month ago.

Profile Gary CharpentierProject donor
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12743
Credit: 7,285,538
RAC: 17,957
United States
Message 1050353 - Posted: 22 Nov 2010, 19:54:43 UTC

Thanks for the update Matt.

Also you might want to replace those surge protectors if you had local strikes. Good chance they did their thing and protected you but lost their life doing it. As you know MOV's die with time.

____________

OzzFan
Volunteer tester
Avatar
Send message
Joined: 9 Apr 02
Posts: 13625
Credit: 31,057,329
RAC: 20,600
United States
Message 1050364 - Posted: 22 Nov 2010, 20:22:58 UTC - in response to Message 1050343.

If we don't hear back from any of you guys for the rest of the week, I want to wish everyone at the lab a happy Thanksgiving.

Profile John Clark
Volunteer tester
Avatar
Send message
Joined: 29 Sep 99
Posts: 16515
Credit: 4,418,829
RAC: 0
United Kingdom
Message 1050366 - Posted: 22 Nov 2010, 20:41:40 UTC

Have a good break, when it comes, and thanks for the update.
____________
It's good to be back amongst friends and colleagues



Roy Wall (shiny sides)
Send message
Joined: 8 Nov 99
Posts: 5
Credit: 2,642,427
RAC: 2,944
United States
Message 1050389 - Posted: 22 Nov 2010, 21:58:44 UTC - in response to Message 1050343.

Thanks Matt for the update. Keep up the good work.
____________

Profile Kibble (KB7TIB)
Avatar
Send message
Joined: 6 Dec 99
Posts: 21
Credit: 1,696,913
RAC: 5,288
United States
Message 1050394 - Posted: 22 Nov 2010, 22:17:07 UTC - in response to Message 1050343.

I agree that you guys are doing a superb job, Matt. Having fun with the new toys. :-) And thank you for the update. We are all patiently waiting for for the new systems to go live. I'll just continue chewing on Einstein and LHC w/u's here until then.

It might be a good idea to acquire some backup power units rather than simple surge protectors. Modern ones will allow the the servers to gracefully shut down from battery power when the mains go out, and let the batteries take the hits from surges.

Regardless, hope your feasting with friends and family goes well.
____________

Swibby Bear
Send message
Joined: 1 Aug 01
Posts: 236
Credit: 7,276,504
RAC: 3
United States
Message 1050446 - Posted: 23 Nov 2010, 2:19:57 UTC - in response to Message 1050394.

It might be a good idea to acquire some backup power units rather than simple surge protectors. Modern ones will allow the the servers to gracefully shut down from battery power when the mains go out, and let the batteries take the hits from surges.


Matt has described over the years that all of the servers are each on heavy-duty UPS backup systems.

But any surge protectors are sacrificial as they age.

Profile lupo
Send message
Joined: 29 Aug 10
Posts: 91
Credit: 4,736,407
RAC: 0
United States
Message 1050462 - Posted: 23 Nov 2010, 2:47:05 UTC

So, what kind of time frame do you think until the project is back up? Another few weeks? Another few months? Just curious.

Adam

cer
Avatar
Send message
Joined: 15 Apr 00
Posts: 3
Credit: 959,601
RAC: 0
United States
Message 1050494 - Posted: 23 Nov 2010, 5:34:01 UTC - in response to Message 1050343.

...Of course, what's a weekend without an unexpected server crash or two? On Saturday afternoon a major lightning storm swept through the Bay Area. Other projects in the lab (located in the other building) had major power outages. Luckily we were spared a full outage, but apparently a couple of our servers got hung up around this time, perhaps due to some kind of non-zero power fluctuation. The servers were thumper and marvin - each located in different rooms, and on different breakers. It is funny that these two machines are our current two informix servers (thumper holds the SETI@home scientific data, and marvin holds Astropulse)...

... Both systems were on surge protectors so I'm not sure why they were so sensitive - maybe the crashes were random and the timing was coincidental with the storm.

- Matt

First Matt... thank you for taking time to issue these updates. You can't imagine how important they are to the community. Personally, I hardly ever respond, but believe me that's no indication of their value.

What struck me about your post, was the closing supposition... One crash with a storm might be random, not two.

Others here have observed that suppression is sometimes sacrificial. I have found this to be true.

I don't know if you regularly do any EMC testing of suppression integrity there, but I encourage your group to do so. From your description, I'd begin with the facility grounding system.

Good luck, and again.... Thank You.
____________

Profile tullioProject donor
Send message
Joined: 9 Apr 04
Posts: 3756
Credit: 388,305
RAC: 120
Italy
Message 1050506 - Posted: 23 Nov 2010, 6:24:04 UTC - in response to Message 1050462.

I bought an UPS last summer to protect my SUN workstation from summer blackouts due to airconditioners for 79 euros and it worked well. I remember one summery day at Area Research Park in Trieste when the UPSs shut down because of poor air conditioning in their closet and all Area computers were stopped, including that of Nobelist Carlo Rubbia, who was building the Elettra synchrotron radiation machine. He was rather upset.
Tullio
____________

Cosmic_Ocean
Avatar
Send message
Joined: 23 Dec 00
Posts: 2290
Credit: 8,814,369
RAC: 4,017
United States
Message 1050507 - Posted: 23 Nov 2010, 6:24:14 UTC

I have seen in the past that the throw time for a UPS combined with a power supply's hold-up time can be very close to being truly uninterrupted. Sometimes if the right conditions happen, you still end up with a brown-out on the DC side of the power supply. Most times the system will just shut off, but sometimes it will just freeze due to CPU/RAM/chipset forgetting what it was doing due to reduced power, albeit briefly.

UPS battery packs do in fact become effectively useless after a few years, though I have heard on numerous occasions that discharging the batteries to at least 50% once per month can in some cases double the life of them.

Once your batteries do become useless, depending on how much a new equivalent unit is, it is very cost-effective to replace the batteries, often times several times before it becomes time to just buy a new unit. I replaced the batteries in my 1500 about three years ago for US$120, when a new 1500 like it was well over 500. Then I brought home two 1400 carcasses from work and got batteries for them for less than 200 total. Batteries are inexpensive in comparison a lot of times.
____________

Linux laptop uptime: 1484d 22h 42m
Ended due to UPS failure, found 14 hours after the fact

Profile lupo
Send message
Joined: 29 Aug 10
Posts: 91
Credit: 4,736,407
RAC: 0
United States
Message 1050510 - Posted: 23 Nov 2010, 6:29:00 UTC - in response to Message 1050464.
Last modified: 23 Nov 2010, 6:36:46 UTC

Thanks KittyMan. Sometimes I find it frustrating wanting to help in their rebuild in a field that I have expertise in. I'm trying hard not to be an arm-chair quarterback since I do not know all the in's and out's of their current situation. However, when I saw the photos of their server rack I was more than a little shocked. It was hard to believe that they were supporting so many clients in the real world on that setup. I understand that there are financial limitations that make it hard for the seti guys to have the latest and greatest hardware, but a lot can be done with just some common sense and a shoe-string budget.

The power issues are a great concern to me. If I were Seti, I would consider co-locating their servers in a Tier 4 data center. A cage big enough to house their equipment would cost very little and all access can be done remotely (unless hardware changes are required.) In our setup, myself and my team manage over 10K windows servers remotely in our two Tier 4 data centers. We have two people on site that handle any hardware changes that are required and at least 1 person on site per 8 hour shift in the command center in the event of an emergency. (My team is myself and 3 other Sr. Engineers, 15 system engineers in India, and 4 interns.)

I bet with a little work Seti could get the cage donated and their costs would be practically 0. I would think their highest MRC would be bandwidth charges. (Hell, if I was given the ability to speak as a duly authorized agent on their behalf, I could probably find them the co-location facility and get a cage donated.)

Again, I apologize, and I am not trying to attack anyone's work ethic, but there are times I want to help the project so badly and being able to lend my expertise is quite frustrating.

One thing I will recommend, go to a company like upsforless.com and purchase a few Online Double Conversion UPS's. (Make sure to get the Double Conversion UPS's. They are the best and most secure type of UPS available.) I have purchases two of their liebert ups's and they are great. (One for my home theater, one for my computers in my office.) They are refurbished units but come with a full warranty and are a hell of a bargain. (I have nothing to do with the company, just pointing out a good value)

1mp0£173
Volunteer tester
Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 1050517 - Posted: 23 Nov 2010, 6:55:14 UTC - in response to Message 1050510.
Last modified: 23 Nov 2010, 6:56:17 UTC

I bet with a little work Seti could get the cage donated and their costs would be practically 0. I would think their highest MRC would be bandwidth charges. (Hell, if I was given the ability to speak as a duly authorized agent on their behalf, I could probably find them the co-location facility and get a cage donated.)

Okay, let's assume that for $0, SETI could get space in a nice data center.

They'll still need to pay for bandwidth between the servers (the data center) and the users.

Then we have the "tapes" from Arecibo, which are shipped from Puerto Rico, and have to be mounted and copied to the servers to be split.

That's bandwidth from Campus to the Data Center, probably equal to what they currently have (and have to pay for) -- and you need that bandwidth to bring the completed work back.

Doubling the monthly bandwidth expense may not turn out to be "help" -- and that's why a data center may not be as good an idea as it might seem.

Grant (SSSF)
Send message
Joined: 19 Aug 99
Posts: 5868
Credit: 60,604,575
RAC: 47,519
Australia
Message 1050531 - Posted: 23 Nov 2010, 8:47:52 UTC - in response to Message 1050507.

...though I have heard on numerous occasions that discharging the batteries to at least 50% once per month can in some cases double the life of them.

Nope.
Heat tends to be the biggest killer of Lead Acid batteries.
Here in Darwin, if you get 2 years out of a car battery, that's pretty good going. When i lived down south (much further down south) 10 years wasn't unusual.

When a lead acid battery voltage drops to 10V, it's as good as dead. Deep cycle batteries can handle such a deep state of discharge, but not often or regularly.
____________
Grant
Darwin NT.

1 · 2 · 3 · Next

Message boards : Technical News : Stormy (Nov 22 2010)

Copyright © 2014 University of California