Message boards :
Technical News :
Happy Lupercalia! (Feb 14 2011)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Slow, steady progress... We're hoping to have everything copied from gowron onto thumper by tomorrow. Yeah, I know it's going slowly, but there's lots of bottlenecks (degraded RAID, NFS, tons of small files as opposed to a few big ones). After the usual outage we might actually have thumper ready to be the temporary workunit storage server so we can get back to business while doing the necessary upgrades on gowron (which make take as much as a week, unobtrusively running in the background). That new-ish server synergy rebooted itself on Sunday. This concerned me as this has happened a couple times already. However, I discovered the three reboots thus far all happened on Sunday at 3pm, and two weeks apart from each other. There are no smoking-gun cronjobs, but it is plugged into an old UPS of unknown quality, so we're going to remove that from the equation and watch what happens. The reboots have all been harmless thus far. Somebody somewhere on these forums asked what our server makeup was. It certainly isn't limited to what's on the server status page. If you just count the unix-based machines, there are currently 26 systems all told. Combining all the stuff inside, we have roughly 100 CPUs, 500GB RAM, and 150 TB raw storage. There are also several appliances (routers, switches, UPSes, kvms, remote controlled power strips, etc. etc.). Usually in these threads I'm griping about public facing servers, or ones causing the BOINC back end to jam up for one reason or another. I rarely mention the mundane, day-to-day, garden variety IT stuff. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
Thank you for the news, Matt. Many have been waiting with much anticipation, and some even with a certain degree of patience....LOL. Would be great if you can get back up and running tomorrow while you continue the tedious task of Gowron repairs. Meow! "Time is simply the mechanism that keeps everything from happening all at once." |
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
It would be nice to see a list of them and what they do. For security reasons, I tend to only name and define the systems that are already public facing, or otherwise already known. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Saaby900T Send message Joined: 24 Dec 10 Posts: 76 Credit: 4,971,171 RAC: 0 |
Thank You For the Update!!!!!! Plz let us Know if you Need some new parts!!! (I'd Be willing to donate Some $$$) to Help out with parts. If I knew what was needed. |
Jeff Mercer Send message Joined: 14 Aug 08 Posts: 90 Credit: 162,139 RAC: 0 |
Hi Matt, and THANKS for the update. Glad to hear that things are looking up, and that things might be up and running by tomorrow. As usual, I'll be here waiting. I might hold off a day or two until the, "TRAFFIC" dies down a little bit though. I have a feeling that the servers will be swamped for a while. Thank you for all your hard work. |
Jeff Mercer Send message Joined: 14 Aug 08 Posts: 90 Credit: 162,139 RAC: 0 |
Thank You For the Update!!!!!! GOOD IDEA ! Later on, when and IF you have time, maybe you could tell us what all is needed, and the cost of what is needed. Sure wouldn't hurt to let us know about it. |
Dimly Lit Lightbulb 😀 Send message Joined: 30 Aug 08 Posts: 15399 Credit: 7,423,413 RAC: 1 |
If you just count the unix-based machines, there are currently 26 systems all told. Combining all the stuff inside, we have roughly 100 CPUs, 500GB RAM, and 150 TB raw storage. There are also several appliances (routers, switches, UPSes, kvms, remote controlled power strips, etc. etc.). Wow, that's a fair bit of equipment. Thanks for all the work we DON'T hear about that you do. |
Dirk Sadowski Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
Matt, thanks for the news! |
KWSN THE Holy Hand Grenade! Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0 |
Maybe you should implement the slow ramp-up that was the rule back during the three-day outage era... Oh, and don't forget to bring up beta sometime soon, once production gets running nicely... . Hello, from Albany, CA!... |
System 3 Lab Send message Joined: 29 Apr 08 Posts: 9 Credit: 3,319,213 RAC: 0 |
Just an FYI. I was up in Oakland during the last big one during the World Series. Anyway, all our equipment went down even though we had UPS's on everything. What caused it? Seems that the batteries in the UPS's are not eternal and have to be replaced like every 3 - 5 years! Could there be any relation to what is causing your issues? |
Douglas Davidson Send message Joined: 26 Sep 05 Posts: 2 Credit: 1,586,580 RAC: 1 |
Thank you for all the hard work. I still have some W.U.'s left to do, so will wait. |
-BeNt- Send message Joined: 17 Oct 99 Posts: 1234 Credit: 10,116,112 RAC: 0 |
Usually in these threads I'm griping about public facing servers, or ones causing the BOINC back end to jam up for one reason or another. I rarely mention the mundane, day-to-day, garden variety IT stuff. Well hopefully Matt sometime in the immediate future we can hear more about your mundane, day-to-day, operations and less about the major issues. Growing pains are difficult and when mixing it with older hardware like you are dealing with can be.....less than perfect, to put it mildly. Best of luck gentlemen and hopefully the IT gods will bless you and the gremlins let you stomp them for a change. Traveling through space at ~67,000mph! |
Big Bang Send message Joined: 7 Jan 10 Posts: 670 Credit: 28,481 RAC: 0 |
Thanks for the update Matt. Trust you had a good gig the other night to re-align the neurons. Always good and appreciative thoughts toward you and your colleagues. Cheers mate. |
KWSN THE Holy Hand Grenade! Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0 |
Just an FYI. I was up in Oakland during the last big one during the World Series. Anyway, all our equipment went down even though we had UPS's on everything. What caused it? Seems that the batteries in the UPS's are not eternal and have to be replaced like every 3 - 5 years! Could there be any relation to what is causing your issues? Are you talking the Earthquake series in '89 or last year? The routinely spaced hits at the same wall-clock time two weeks apart that Matt is talking about don't seem to be battery-related to me... (assuming, of course, that Matt has checked the possibility that someone [who comes in alternate weeks...] is routinely turning off the circuit that the UPS in question is plugged into...) . Hello, from Albany, CA!... |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 |
Just an FYI. I was up in Oakland during the last big one during the World Series. Anyway, all our equipment went down even though we had UPS's on everything. What caused it? Seems that the batteries in the UPS's are not eternal and have to be replaced like every 3 - 5 years! Could there be any relation to what is causing your issues? The cleaning crew unplugging something to crank up their vacuum cleaners? BOINC WIKI |
David S Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12 |
The routinely spaced hits at the same wall-clock time two weeks apart that Matt is talking about don't seem to be battery-related to me... (assuming, of course, that Matt has checked the possibility that someone [who comes in alternate weeks...] is routinely turning off the circuit that the UPS in question is plugged into...) In the school where I work it used to be standard procedure to turn off the power to all the computers in the computer lab every night, via 4 master switches in the lab director's office (now they keep them on 24/7). One night, I saw the custodian turn on the power so he could plug in the vacuum... which of course turned on all the computers. When he was done, he turned them all off again without doing a proper shutdown of Windows. At the time, they were, probably, Pentium 3s running Windows NT. It didn't seem to do them any harm, though (except maybe an unquantifiable shortening of their lives, but they were retired long before that became an issue). David David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. |
KWSN THE Holy Hand Grenade! Send message Joined: 20 Dec 05 Posts: 3187 Credit: 57,163,290 RAC: 0 |
Regularly, at the same time, every other weekend? Doesn't seem likely... The SSL doesn't have carpeted floors, so substitute "floor polisher(s)" for "vacuum cleaner"... ;-) . Hello, from Albany, CA!... |
Cheopis Send message Joined: 17 Sep 00 Posts: 156 Credit: 18,451,329 RAC: 0 |
How tightly timed are the failures? Within seconds, minutes, or multiple minutes separation in time? Seems likely that the UPS is the culprit, but if not, the grouping of the timing might help determine the liklihood of different types of issues. |
blueone Send message Joined: 18 Sep 00 Posts: 7 Credit: 250,657 RAC: 0 |
I think the timing of the reboots is kinda suspicious, and those kitties of Kittieman look kinda suspicious hmmm... It's the cats they must be aliens in disguise and Kittieman is the overlord they are holding the data back cuz they know the next wow signal is ready to be crunched.... :) Izzit |
David S Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12 |
I think the timing of the reboots is kinda suspicious, and those kitties of Kittieman look kinda suspicious hmmm... It's the cats they must be aliens in disguise and Kittieman is the overlord they are holding the data back cuz they know the next wow signal is ready to be crunched.... The alien from that stupid '70s movie with Sandy Duncan had kittens! Several generations by now... They're everywhere! It's war between them and the Geico squirrels! lol David David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.