Message boards :
Technical News :
Stormy (Nov 22 2010)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
I'll write today's message early as this week is a short holiday week so we're kinda busy. First and foremost, carolyn is now the *only* mysql replica - I just turned the other replica (the troublesome server mork) off, perhaps for good. Yay! That's one of the two new servers more or less ready for prime time, though we still hope to make carolyn the master (and jocelyn the replica) today or tomorrow. We're still far from getting the whole project back on line - we have the other new server, oscar, installed and ready to roll, but still need to (a) install and configure informix on it, (b) clean up the science database on thumper, and then (c) transfer all the data from thumper to oscar. This may take a while - the spike merge (which was the last major part of the "clean up") did finally complete last week (after running about 2-3 months) but there was still a discrepancy of about a million missing spikes which Jeff is successfully tracking down. So there are a few extra merges to do yet. We probably won't really dig into getting oscar on line until after Thanksgiving. Of course, what's a weekend without an unexpected server crash or two? On Saturday afternoon a major lightning storm swept through the Bay Area. Other projects in the lab (located in the other building) had major power outages. Luckily we were spared a full outage, but apparently a couple of our servers got hung up around this time, perhaps due to some kind of non-zero power fluctuation. The servers were thumper and marvin - each located in different rooms, and on different breakers. It is funny that these two machines are our current two informix servers (thumper holds the SETI@home scientific data, and marvin holds Astropulse). So there was some cleanup to deal with this morning (database/filesystem recovery, hung mounts, etc.) but really no big shakes and we're back to normal (whatever normal is these days). Both systems were on surge protectors so I'm not sure why they were so sensitive - maybe the crashes were random and the timing was coincidental with the storm. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Bill G Send message Joined: 1 Jun 01 Posts: 1282 Credit: 187,688,550 RAC: 182 |
Great and thanks for the info. SETI@home classic workunits 4,019 SETI@home classic CPU time 34,348 hours |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0 |
Thanks for the update Matt. Take your time, we will be here when your ready to turn it back on. Maybe soon these good news/bad news messages will turn into only good news for many long times to come. PROUD MEMBER OF Team Starfire World BOINC |
SMW Send message Joined: 16 May 99 Posts: 22 Credit: 29,285,238 RAC: 16 |
Thanks for keeping us in the loop on what's happening, we appreciate this. "It is better to be hated for what you are then to be loved for what you are not" - Andre Gide (1869-1951) |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
Thanks for the update. Yeah, getting jocelyn the replica and all that working will be a great way to go into the holiday weekend. Oscar can wait. BTW, it seems the website/forums are fast and snappy compared to a month ago. |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 30981 Credit: 53,134,872 RAC: 32 |
Thanks for the update Matt. Also you might want to replace those surge protectors if you had local strikes. Good chance they did their thing and protected you but lost their life doing it. As you know MOV's die with time. |
OzzFan Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28 |
If we don't hear back from any of you guys for the rest of the week, I want to wish everyone at the lab a happy Thanksgiving. |
John Clark Send message Joined: 29 Sep 99 Posts: 16515 Credit: 4,418,829 RAC: 0 |
Have a good break, when it comes, and thanks for the update. It's good to be back amongst friends and colleagues |
Roy Wall (shiny sides) Send message Joined: 8 Nov 99 Posts: 5 Credit: 5,099,610 RAC: 0 |
Thanks Matt for the update. Keep up the good work. |
Kibble (KB7TIB) Send message Joined: 6 Dec 99 Posts: 27 Credit: 10,121,469 RAC: 2 |
I agree that you guys are doing a superb job, Matt. Having fun with the new toys. :-) And thank you for the update. We are all patiently waiting for for the new systems to go live. I'll just continue chewing on Einstein and LHC w/u's here until then. It might be a good idea to acquire some backup power units rather than simple surge protectors. Modern ones will allow the the servers to gracefully shut down from battery power when the mains go out, and let the batteries take the hits from surges. Regardless, hope your feasting with friends and family goes well. |
Swibby Bear Send message Joined: 1 Aug 01 Posts: 246 Credit: 7,945,093 RAC: 0 |
It might be a good idea to acquire some backup power units rather than simple surge protectors. Modern ones will allow the the servers to gracefully shut down from battery power when the mains go out, and let the batteries take the hits from surges. Matt has described over the years that all of the servers are each on heavy-duty UPS backup systems. But any surge protectors are sacrificial as they age. |
kittyman Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004 |
It might be a good idea to acquire some backup power units rather than simple surge protectors. Modern ones will allow the the servers to gracefully shut down from battery power when the mains go out, and let the batteries take the hits from surges. True server grade online UPS systems can be thousands of dollars..... Not the $100.00 APS rigs that some might buy hoping to shore up their living room PC. I have a couple of 1500w units that, due to their age, are probably only still good at surge suppression and voltage regulation, because their battery packs are long past their prime. The lead-acid gel cells used in most backups have a standby life of about 5 years. If you don't replace them at that point, their capacity is much diminished. And they are not real cheap to replace. The best protection is a true online UPS..... They convert the AC mains to DC, keep the batteries charged, and continuously convert the DC back to AC to feed to the computers. The rigs never touch the mains. They are a bit less efficient to operate, due to conversion losses, but they are the best at protecting the connected equipment. And rather expensive. "Time is simply the mechanism that keeps everything from happening all at once." |
lupo Send message Joined: 29 Aug 10 Posts: 91 Credit: 4,736,407 RAC: 0 |
So, what kind of time frame do you think until the project is back up? Another few weeks? Another few months? Just curious. Adam |
kittyman Send message Joined: 9 Jul 00 Posts: 51477 Credit: 1,018,363,574 RAC: 1,004 |
So, what kind of time frame do you think until the project is back up? Another few weeks? Another few months? Just curious. From Matt's post in tech news....and my own intuition, I might venture another week and a half, given they are probably on holiday for two days this week. The kitties' best guess. I think they are as anxious to get the show back on the road as anybody else. "Time is simply the mechanism that keeps everything from happening all at once." |
cer Send message Joined: 15 Apr 00 Posts: 3 Credit: 959,601 RAC: 0 |
...Of course, what's a weekend without an unexpected server crash or two? On Saturday afternoon a major lightning storm swept through the Bay Area. Other projects in the lab (located in the other building) had major power outages. Luckily we were spared a full outage, but apparently a couple of our servers got hung up around this time, perhaps due to some kind of non-zero power fluctuation. The servers were thumper and marvin - each located in different rooms, and on different breakers. It is funny that these two machines are our current two informix servers (thumper holds the SETI@home scientific data, and marvin holds Astropulse)... First Matt... thank you for taking time to issue these updates. You can't imagine how important they are to the community. Personally, I hardly ever respond, but believe me that's no indication of their value. What struck me about your post, was the closing supposition... One crash with a storm might be random, not two. Others here have observed that suppression is sometimes sacrificial. I have found this to be true. I don't know if you regularly do any EMC testing of suppression integrity there, but I encourage your group to do so. From your description, I'd begin with the facility grounding system. Good luck, and again.... Thank You. |
tullio Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1 |
I bought an UPS last summer to protect my SUN workstation from summer blackouts due to airconditioners for 79 euros and it worked well. I remember one summery day at Area Research Park in Trieste when the UPSs shut down because of poor air conditioning in their closet and all Area computers were stopped, including that of Nobelist Carlo Rubbia, who was building the Elettra synchrotron radiation machine. He was rather upset. Tullio |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
I have seen in the past that the throw time for a UPS combined with a power supply's hold-up time can be very close to being truly uninterrupted. Sometimes if the right conditions happen, you still end up with a brown-out on the DC side of the power supply. Most times the system will just shut off, but sometimes it will just freeze due to CPU/RAM/chipset forgetting what it was doing due to reduced power, albeit briefly. UPS battery packs do in fact become effectively useless after a few years, though I have heard on numerous occasions that discharging the batteries to at least 50% once per month can in some cases double the life of them. Once your batteries do become useless, depending on how much a new equivalent unit is, it is very cost-effective to replace the batteries, often times several times before it becomes time to just buy a new unit. I replaced the batteries in my 1500 about three years ago for US$120, when a new 1500 like it was well over 500. Then I brought home two 1400 carcasses from work and got batteries for them for less than 200 total. Batteries are inexpensive in comparison a lot of times. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
lupo Send message Joined: 29 Aug 10 Posts: 91 Credit: 4,736,407 RAC: 0 |
Thanks KittyMan. Sometimes I find it frustrating wanting to help in their rebuild in a field that I have expertise in. I'm trying hard not to be an arm-chair quarterback since I do not know all the in's and out's of their current situation. However, when I saw the photos of their server rack I was more than a little shocked. It was hard to believe that they were supporting so many clients in the real world on that setup. I understand that there are financial limitations that make it hard for the seti guys to have the latest and greatest hardware, but a lot can be done with just some common sense and a shoe-string budget. The power issues are a great concern to me. If I were Seti, I would consider co-locating their servers in a Tier 4 data center. A cage big enough to house their equipment would cost very little and all access can be done remotely (unless hardware changes are required.) In our setup, myself and my team manage over 10K windows servers remotely in our two Tier 4 data centers. We have two people on site that handle any hardware changes that are required and at least 1 person on site per 8 hour shift in the command center in the event of an emergency. (My team is myself and 3 other Sr. Engineers, 15 system engineers in India, and 4 interns.) I bet with a little work Seti could get the cage donated and their costs would be practically 0. I would think their highest MRC would be bandwidth charges. (Hell, if I was given the ability to speak as a duly authorized agent on their behalf, I could probably find them the co-location facility and get a cage donated.) Again, I apologize, and I am not trying to attack anyone's work ethic, but there are times I want to help the project so badly and being able to lend my expertise is quite frustrating. One thing I will recommend, go to a company like upsforless.com and purchase a few Online Double Conversion UPS's. (Make sure to get the Double Conversion UPS's. They are the best and most secure type of UPS available.) I have purchases two of their liebert ups's and they are great. (One for my home theater, one for my computers in my office.) They are refurbished units but come with a full warranty and are a hell of a bargain. (I have nothing to do with the company, just pointing out a good value) |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
I bet with a little work Seti could get the cage donated and their costs would be practically 0. I would think their highest MRC would be bandwidth charges. (Hell, if I was given the ability to speak as a duly authorized agent on their behalf, I could probably find them the co-location facility and get a cage donated.) Okay, let's assume that for $0, SETI could get space in a nice data center. They'll still need to pay for bandwidth between the servers (the data center) and the users. Then we have the "tapes" from Arecibo, which are shipped from Puerto Rico, and have to be mounted and copied to the servers to be split. That's bandwidth from Campus to the Data Center, probably equal to what they currently have (and have to pay for) -- and you need that bandwidth to bring the completed work back. Doubling the monthly bandwidth expense may not turn out to be "help" -- and that's why a data center may not be as good an idea as it might seem. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13847 Credit: 208,696,464 RAC: 304 |
...though I have heard on numerous occasions that discharging the batteries to at least 50% once per month can in some cases double the life of them. Nope. Heat tends to be the biggest killer of Lead Acid batteries. Here in Darwin, if you get 2 years out of a car battery, that's pretty good going. When i lived down south (much further down south) 10 years wasn't unusual. When a lead acid battery voltage drops to 10V, it's as good as dead. Deep cycle batteries can handle such a deep state of discharge, but not often or regularly. Grant Darwin NT |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.