Stormy (Nov 22 2010)

Message boards : Technical News : Stormy (Nov 22 2010)
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1050343 - Posted: 22 Nov 2010, 18:50:27 UTC

I'll write today's message early as this week is a short holiday week so we're kinda busy.

First and foremost, carolyn is now the *only* mysql replica - I just turned the other replica (the troublesome server mork) off, perhaps for good. Yay! That's one of the two new servers more or less ready for prime time, though we still hope to make carolyn the master (and jocelyn the replica) today or tomorrow.

We're still far from getting the whole project back on line - we have the other new server, oscar, installed and ready to roll, but still need to (a) install and configure informix on it, (b) clean up the science database on thumper, and then (c) transfer all the data from thumper to oscar. This may take a while - the spike merge (which was the last major part of the "clean up") did finally complete last week (after running about 2-3 months) but there was still a discrepancy of about a million missing spikes which Jeff is successfully tracking down. So there are a few extra merges to do yet. We probably won't really dig into getting oscar on line until after Thanksgiving.

Of course, what's a weekend without an unexpected server crash or two? On Saturday afternoon a major lightning storm swept through the Bay Area. Other projects in the lab (located in the other building) had major power outages. Luckily we were spared a full outage, but apparently a couple of our servers got hung up around this time, perhaps due to some kind of non-zero power fluctuation. The servers were thumper and marvin - each located in different rooms, and on different breakers. It is funny that these two machines are our current two informix servers (thumper holds the SETI@home scientific data, and marvin holds Astropulse). So there was some cleanup to deal with this morning (database/filesystem recovery, hung mounts, etc.) but really no big shakes and we're back to normal (whatever normal is these days). Both systems were on surge protectors so I'm not sure why they were so sensitive - maybe the crashes were random and the timing was coincidental with the storm.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1050343 · Report as offensive
Profile Bill G Special Project $75 donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 1282
Credit: 187,688,550
RAC: 182
United States
Message 1050347 - Posted: 22 Nov 2010, 19:40:14 UTC - in response to Message 1050343.  

Great and thanks for the info.

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1050347 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1050348 - Posted: 22 Nov 2010, 19:40:46 UTC - in response to Message 1050343.  

Thanks for the update Matt. Take your time, we will be here when your ready to turn it back on. Maybe soon these good news/bad news messages will turn into only good news for many long times to come.


PROUD MEMBER OF Team Starfire World BOINC
ID: 1050348 · Report as offensive
Profile SMW

Send message
Joined: 16 May 99
Posts: 22
Credit: 29,285,238
RAC: 16
United States
Message 1050349 - Posted: 22 Nov 2010, 19:44:21 UTC

Thanks for keeping us in the loop on what's happening, we appreciate this.
"It is better to be hated for what you are then to be loved for what you are not"
- Andre Gide (1869-1951)
ID: 1050349 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 1050350 - Posted: 22 Nov 2010, 19:48:26 UTC - in response to Message 1050343.  

Thanks for the update.

Yeah, getting jocelyn the replica and all that working will be a great way to go into the holiday weekend. Oscar can wait.

BTW, it seems the website/forums are fast and snappy compared to a month ago.
ID: 1050350 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 30981
Credit: 53,134,872
RAC: 32
United States
Message 1050353 - Posted: 22 Nov 2010, 19:54:43 UTC

Thanks for the update Matt.

Also you might want to replace those surge protectors if you had local strikes. Good chance they did their thing and protected you but lost their life doing it. As you know MOV's die with time.

ID: 1050353 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 1050364 - Posted: 22 Nov 2010, 20:22:58 UTC - in response to Message 1050343.  

If we don't hear back from any of you guys for the rest of the week, I want to wish everyone at the lab a happy Thanksgiving.
ID: 1050364 · Report as offensive
Profile John Clark
Volunteer tester
Avatar

Send message
Joined: 29 Sep 99
Posts: 16515
Credit: 4,418,829
RAC: 0
United Kingdom
Message 1050366 - Posted: 22 Nov 2010, 20:41:40 UTC

Have a good break, when it comes, and thanks for the update.
It's good to be back amongst friends and colleagues



ID: 1050366 · Report as offensive
Roy Wall (shiny sides)

Send message
Joined: 8 Nov 99
Posts: 5
Credit: 5,099,610
RAC: 0
United States
Message 1050389 - Posted: 22 Nov 2010, 21:58:44 UTC - in response to Message 1050343.  

Thanks Matt for the update. Keep up the good work.
ID: 1050389 · Report as offensive
Profile Kibble (KB7TIB)
Avatar

Send message
Joined: 6 Dec 99
Posts: 27
Credit: 10,121,469
RAC: 2
United States
Message 1050394 - Posted: 22 Nov 2010, 22:17:07 UTC - in response to Message 1050343.  

I agree that you guys are doing a superb job, Matt. Having fun with the new toys. :-) And thank you for the update. We are all patiently waiting for for the new systems to go live. I'll just continue chewing on Einstein and LHC w/u's here until then.

It might be a good idea to acquire some backup power units rather than simple surge protectors. Modern ones will allow the the servers to gracefully shut down from battery power when the mains go out, and let the batteries take the hits from surges.

Regardless, hope your feasting with friends and family goes well.
ID: 1050394 · Report as offensive
Swibby Bear

Send message
Joined: 1 Aug 01
Posts: 246
Credit: 7,945,093
RAC: 0
United States
Message 1050446 - Posted: 23 Nov 2010, 2:19:57 UTC - in response to Message 1050394.  

It might be a good idea to acquire some backup power units rather than simple surge protectors. Modern ones will allow the the servers to gracefully shut down from battery power when the mains go out, and let the batteries take the hits from surges.


Matt has described over the years that all of the servers are each on heavy-duty UPS backup systems.

But any surge protectors are sacrificial as they age.
ID: 1050446 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1050452 - Posted: 23 Nov 2010, 2:29:19 UTC - in response to Message 1050446.  

It might be a good idea to acquire some backup power units rather than simple surge protectors. Modern ones will allow the the servers to gracefully shut down from battery power when the mains go out, and let the batteries take the hits from surges.


Matt has described over the years that all of the servers are each on heavy-duty UPS backup systems.

But any surge protectors are sacrificial as they age.

True server grade online UPS systems can be thousands of dollars.....
Not the $100.00 APS rigs that some might buy hoping to shore up their living room PC.
I have a couple of 1500w units that, due to their age, are probably only still good at surge suppression and voltage regulation, because their battery packs are long past their prime.
The lead-acid gel cells used in most backups have a standby life of about 5 years. If you don't replace them at that point, their capacity is much diminished. And they are not real cheap to replace.

The best protection is a true online UPS.....
They convert the AC mains to DC, keep the batteries charged, and continuously convert the DC back to AC to feed to the computers. The rigs never touch the mains. They are a bit less efficient to operate, due to conversion losses, but they are the best at protecting the connected equipment.
And rather expensive.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1050452 · Report as offensive
Profile lupo

Send message
Joined: 29 Aug 10
Posts: 91
Credit: 4,736,407
RAC: 0
United States
Message 1050462 - Posted: 23 Nov 2010, 2:47:05 UTC

So, what kind of time frame do you think until the project is back up? Another few weeks? Another few months? Just curious.

Adam

ID: 1050462 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1050464 - Posted: 23 Nov 2010, 2:49:04 UTC - in response to Message 1050462.  

So, what kind of time frame do you think until the project is back up? Another few weeks? Another few months? Just curious.

Adam


From Matt's post in tech news....and my own intuition, I might venture another week and a half, given they are probably on holiday for two days this week.

The kitties' best guess. I think they are as anxious to get the show back on the road as anybody else.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1050464 · Report as offensive
cer
Avatar

Send message
Joined: 15 Apr 00
Posts: 3
Credit: 959,601
RAC: 0
United States
Message 1050494 - Posted: 23 Nov 2010, 5:34:01 UTC - in response to Message 1050343.  

...Of course, what's a weekend without an unexpected server crash or two? On Saturday afternoon a major lightning storm swept through the Bay Area. Other projects in the lab (located in the other building) had major power outages. Luckily we were spared a full outage, but apparently a couple of our servers got hung up around this time, perhaps due to some kind of non-zero power fluctuation. The servers were thumper and marvin - each located in different rooms, and on different breakers. It is funny that these two machines are our current two informix servers (thumper holds the SETI@home scientific data, and marvin holds Astropulse)...

... Both systems were on surge protectors so I'm not sure why they were so sensitive - maybe the crashes were random and the timing was coincidental with the storm.

- Matt

First Matt... thank you for taking time to issue these updates. You can't imagine how important they are to the community. Personally, I hardly ever respond, but believe me that's no indication of their value.

What struck me about your post, was the closing supposition... One crash with a storm might be random, not two.

Others here have observed that suppression is sometimes sacrificial. I have found this to be true.

I don't know if you regularly do any EMC testing of suppression integrity there, but I encourage your group to do so. From your description, I'd begin with the facility grounding system.

Good luck, and again.... Thank You.
ID: 1050494 · Report as offensive
Profile tullio
Volunteer tester

Send message
Joined: 9 Apr 04
Posts: 8797
Credit: 2,930,782
RAC: 1
Italy
Message 1050506 - Posted: 23 Nov 2010, 6:24:04 UTC - in response to Message 1050462.  

I bought an UPS last summer to protect my SUN workstation from summer blackouts due to airconditioners for 79 euros and it worked well. I remember one summery day at Area Research Park in Trieste when the UPSs shut down because of poor air conditioning in their closet and all Area computers were stopped, including that of Nobelist Carlo Rubbia, who was building the Elettra synchrotron radiation machine. He was rather upset.
Tullio
ID: 1050506 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1050507 - Posted: 23 Nov 2010, 6:24:14 UTC

I have seen in the past that the throw time for a UPS combined with a power supply's hold-up time can be very close to being truly uninterrupted. Sometimes if the right conditions happen, you still end up with a brown-out on the DC side of the power supply. Most times the system will just shut off, but sometimes it will just freeze due to CPU/RAM/chipset forgetting what it was doing due to reduced power, albeit briefly.

UPS battery packs do in fact become effectively useless after a few years, though I have heard on numerous occasions that discharging the batteries to at least 50% once per month can in some cases double the life of them.

Once your batteries do become useless, depending on how much a new equivalent unit is, it is very cost-effective to replace the batteries, often times several times before it becomes time to just buy a new unit. I replaced the batteries in my 1500 about three years ago for US$120, when a new 1500 like it was well over 500. Then I brought home two 1400 carcasses from work and got batteries for them for less than 200 total. Batteries are inexpensive in comparison a lot of times.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1050507 · Report as offensive
Profile lupo

Send message
Joined: 29 Aug 10
Posts: 91
Credit: 4,736,407
RAC: 0
United States
Message 1050510 - Posted: 23 Nov 2010, 6:29:00 UTC - in response to Message 1050464.  
Last modified: 23 Nov 2010, 6:36:46 UTC

Thanks KittyMan. Sometimes I find it frustrating wanting to help in their rebuild in a field that I have expertise in. I'm trying hard not to be an arm-chair quarterback since I do not know all the in's and out's of their current situation. However, when I saw the photos of their server rack I was more than a little shocked. It was hard to believe that they were supporting so many clients in the real world on that setup. I understand that there are financial limitations that make it hard for the seti guys to have the latest and greatest hardware, but a lot can be done with just some common sense and a shoe-string budget.

The power issues are a great concern to me. If I were Seti, I would consider co-locating their servers in a Tier 4 data center. A cage big enough to house their equipment would cost very little and all access can be done remotely (unless hardware changes are required.) In our setup, myself and my team manage over 10K windows servers remotely in our two Tier 4 data centers. We have two people on site that handle any hardware changes that are required and at least 1 person on site per 8 hour shift in the command center in the event of an emergency. (My team is myself and 3 other Sr. Engineers, 15 system engineers in India, and 4 interns.)

I bet with a little work Seti could get the cage donated and their costs would be practically 0. I would think their highest MRC would be bandwidth charges. (Hell, if I was given the ability to speak as a duly authorized agent on their behalf, I could probably find them the co-location facility and get a cage donated.)

Again, I apologize, and I am not trying to attack anyone's work ethic, but there are times I want to help the project so badly and being able to lend my expertise is quite frustrating.

One thing I will recommend, go to a company like upsforless.com and purchase a few Online Double Conversion UPS's. (Make sure to get the Double Conversion UPS's. They are the best and most secure type of UPS available.) I have purchases two of their liebert ups's and they are great. (One for my home theater, one for my computers in my office.) They are refurbished units but come with a full warranty and are a hell of a bargain. (I have nothing to do with the company, just pointing out a good value)
ID: 1050510 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 1050517 - Posted: 23 Nov 2010, 6:55:14 UTC - in response to Message 1050510.  
Last modified: 23 Nov 2010, 6:56:17 UTC

I bet with a little work Seti could get the cage donated and their costs would be practically 0. I would think their highest MRC would be bandwidth charges. (Hell, if I was given the ability to speak as a duly authorized agent on their behalf, I could probably find them the co-location facility and get a cage donated.)

Okay, let's assume that for $0, SETI could get space in a nice data center.

They'll still need to pay for bandwidth between the servers (the data center) and the users.

Then we have the "tapes" from Arecibo, which are shipped from Puerto Rico, and have to be mounted and copied to the servers to be split.

That's bandwidth from Campus to the Data Center, probably equal to what they currently have (and have to pay for) -- and you need that bandwidth to bring the completed work back.

Doubling the monthly bandwidth expense may not turn out to be "help" -- and that's why a data center may not be as good an idea as it might seem.
ID: 1050517 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13847
Credit: 208,696,464
RAC: 304
Australia
Message 1050531 - Posted: 23 Nov 2010, 8:47:52 UTC - in response to Message 1050507.  

...though I have heard on numerous occasions that discharging the batteries to at least 50% once per month can in some cases double the life of them.

Nope.
Heat tends to be the biggest killer of Lead Acid batteries.
Here in Darwin, if you get 2 years out of a car battery, that's pretty good going. When i lived down south (much further down south) 10 years wasn't unusual.

When a lead acid battery voltage drops to 10V, it's as good as dead. Deep cycle batteries can handle such a deep state of discharge, but not often or regularly.
Grant
Darwin NT
ID: 1050531 · Report as offensive
1 · 2 · 3 · Next

Message boards : Technical News : Stormy (Nov 22 2010)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.