Happy Lupercalia! (Feb 14 2011)

Message boards : Technical News : Happy Lupercalia! (Feb 14 2011)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1077328 - Posted: 14 Feb 2011, 22:27:06 UTC

Slow, steady progress... We're hoping to have everything copied from gowron onto thumper by tomorrow. Yeah, I know it's going slowly, but there's lots of bottlenecks (degraded RAID, NFS, tons of small files as opposed to a few big ones). After the usual outage we might actually have thumper ready to be the temporary workunit storage server so we can get back to business while doing the necessary upgrades on gowron (which make take as much as a week, unobtrusively running in the background).

That new-ish server synergy rebooted itself on Sunday. This concerned me as this has happened a couple times already. However, I discovered the three reboots thus far all happened on Sunday at 3pm, and two weeks apart from each other. There are no smoking-gun cronjobs, but it is plugged into an old UPS of unknown quality, so we're going to remove that from the equation and watch what happens. The reboots have all been harmless thus far.

Somebody somewhere on these forums asked what our server makeup was. It certainly isn't limited to what's on the server status page. If you just count the unix-based machines, there are currently 26 systems all told. Combining all the stuff inside, we have roughly 100 CPUs, 500GB RAM, and 150 TB raw storage. There are also several appliances (routers, switches, UPSes, kvms, remote controlled power strips, etc. etc.). Usually in these threads I'm griping about public facing servers, or ones causing the BOINC back end to jam up for one reason or another. I rarely mention the mundane, day-to-day, garden variety IT stuff.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1077328 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1077332 - Posted: 14 Feb 2011, 22:33:43 UTC

Thank you for the news, Matt.
Many have been waiting with much anticipation, and some even with a certain degree of patience....LOL.
Would be great if you can get back up and running tomorrow while you continue the tedious task of Gowron repairs.

Meow!
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1077332 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1077349 - Posted: 14 Feb 2011, 23:06:18 UTC - in response to Message 1077334.  

It would be nice to see a list of them and what they do.

For security reasons, I tend to only name and define the systems that are already public facing, or otherwise already known.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1077349 · Report as offensive
Saaby900T

Send message
Joined: 24 Dec 10
Posts: 76
Credit: 4,971,171
RAC: 0
United States
Message 1077351 - Posted: 14 Feb 2011, 23:09:26 UTC

Thank You For the Update!!!!!!

Plz let us Know if you Need some new parts!!! (I'd Be willing to donate Some $$$) to Help out with parts. If I knew what was needed.
ID: 1077351 · Report as offensive
Profile Jeff Mercer

Send message
Joined: 14 Aug 08
Posts: 90
Credit: 162,139
RAC: 0
United States
Message 1077354 - Posted: 14 Feb 2011, 23:20:17 UTC

Hi Matt, and THANKS for the update. Glad to hear that things are looking up, and that things might be up and running by tomorrow. As usual, I'll be here waiting. I might hold off a day or two until the, "TRAFFIC" dies down a little bit though. I have a feeling that the servers will be swamped for a while.
Thank you for all your hard work.
ID: 1077354 · Report as offensive
Profile Jeff Mercer

Send message
Joined: 14 Aug 08
Posts: 90
Credit: 162,139
RAC: 0
United States
Message 1077356 - Posted: 14 Feb 2011, 23:26:33 UTC - in response to Message 1077351.  

Thank You For the Update!!!!!!

Plz let us Know if you Need some new parts!!! (I'd Be willing to donate Some $$$) to Help out with parts. If I knew what was needed.



GOOD IDEA ! Later on, when and IF you have time, maybe you could tell us what all is needed, and the cost of what is needed. Sure wouldn't hurt to let us know about it.
ID: 1077356 · Report as offensive
Profile Dimly Lit Lightbulb 😀
Volunteer tester
Avatar

Send message
Joined: 30 Aug 08
Posts: 15399
Credit: 7,423,413
RAC: 1
United Kingdom
Message 1077363 - Posted: 14 Feb 2011, 23:46:37 UTC - in response to Message 1077328.  

If you just count the unix-based machines, there are currently 26 systems all told. Combining all the stuff inside, we have roughly 100 CPUs, 500GB RAM, and 150 TB raw storage. There are also several appliances (routers, switches, UPSes, kvms, remote controlled power strips, etc. etc.).

Wow, that's a fair bit of equipment. Thanks for all the work we DON'T hear about that you do.
ID: 1077363 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1077405 - Posted: 15 Feb 2011, 3:00:08 UTC - in response to Message 1077328.  

Matt, thanks for the news!

ID: 1077405 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 1077406 - Posted: 15 Feb 2011, 3:00:51 UTC
Last modified: 15 Feb 2011, 3:02:01 UTC

Maybe you should implement the slow ramp-up that was the rule back during the three-day outage era...

Oh, and don't forget to bring up beta sometime soon, once production gets running nicely...
.

Hello, from Albany, CA!...
ID: 1077406 · Report as offensive
System 3 Lab

Send message
Joined: 29 Apr 08
Posts: 9
Credit: 3,319,213
RAC: 0
United States
Message 1077407 - Posted: 15 Feb 2011, 3:01:11 UTC

Just an FYI. I was up in Oakland during the last big one during the World Series. Anyway, all our equipment went down even though we had UPS's on everything. What caused it? Seems that the batteries in the UPS's are not eternal and have to be replaced like every 3 - 5 years! Could there be any relation to what is causing your issues?
ID: 1077407 · Report as offensive
Douglas Davidson

Send message
Joined: 26 Sep 05
Posts: 2
Credit: 1,586,580
RAC: 1
United Kingdom
Message 1077411 - Posted: 15 Feb 2011, 3:27:25 UTC

Thank you for all the hard work. I still have some W.U.'s left to do, so will wait.
ID: 1077411 · Report as offensive
-BeNt-
Avatar

Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1077436 - Posted: 15 Feb 2011, 6:42:50 UTC - in response to Message 1077328.  

Usually in these threads I'm griping about public facing servers, or ones causing the BOINC back end to jam up for one reason or another. I rarely mention the mundane, day-to-day, garden variety IT stuff.

- Matt


Well hopefully Matt sometime in the immediate future we can hear more about your mundane, day-to-day, operations and less about the major issues. Growing pains are difficult and when mixing it with older hardware like you are dealing with can be.....less than perfect, to put it mildly. Best of luck gentlemen and hopefully the IT gods will bless you and the gremlins let you stomp them for a change.

Traveling through space at ~67,000mph!
ID: 1077436 · Report as offensive
Big Bang

Send message
Joined: 7 Jan 10
Posts: 670
Credit: 28,481
RAC: 0
Message 1077450 - Posted: 15 Feb 2011, 8:29:27 UTC

Thanks for the update Matt. Trust you had a good gig the other night to re-align the neurons. Always good and appreciative thoughts toward you and your colleagues. Cheers mate.
ID: 1077450 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 1077758 - Posted: 16 Feb 2011, 2:48:49 UTC - in response to Message 1077407.  
Last modified: 16 Feb 2011, 2:54:07 UTC

Just an FYI. I was up in Oakland during the last big one during the World Series. Anyway, all our equipment went down even though we had UPS's on everything. What caused it? Seems that the batteries in the UPS's are not eternal and have to be replaced like every 3 - 5 years! Could there be any relation to what is causing your issues?


Are you talking the Earthquake series in '89 or last year?

The routinely spaced hits at the same wall-clock time two weeks apart that Matt is talking about don't seem to be battery-related to me... (assuming, of course, that Matt has checked the possibility that someone [who comes in alternate weeks...] is routinely turning off the circuit that the UPS in question is plugged into...)
.

Hello, from Albany, CA!...
ID: 1077758 · Report as offensive
John McLeod VII
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jul 99
Posts: 24806
Credit: 790,712
RAC: 0
United States
Message 1077769 - Posted: 16 Feb 2011, 3:40:39 UTC - in response to Message 1077758.  

Just an FYI. I was up in Oakland during the last big one during the World Series. Anyway, all our equipment went down even though we had UPS's on everything. What caused it? Seems that the batteries in the UPS's are not eternal and have to be replaced like every 3 - 5 years! Could there be any relation to what is causing your issues?


Are you talking the Earthquake series in '89 or last year?

The routinely spaced hits at the same wall-clock time two weeks apart that Matt is talking about don't seem to be battery-related to me... (assuming, of course, that Matt has checked the possibility that someone [who comes in alternate weeks...] is routinely turning off the circuit that the UPS in question is plugged into...)

The cleaning crew unplugging something to crank up their vacuum cleaners?


BOINC WIKI
ID: 1077769 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1077921 - Posted: 16 Feb 2011, 15:45:25 UTC - in response to Message 1077769.  

The routinely spaced hits at the same wall-clock time two weeks apart that Matt is talking about don't seem to be battery-related to me... (assuming, of course, that Matt has checked the possibility that someone [who comes in alternate weeks...] is routinely turning off the circuit that the UPS in question is plugged into...)

The cleaning crew unplugging something to crank up their vacuum cleaners?

In the school where I work it used to be standard procedure to turn off the power to all the computers in the computer lab every night, via 4 master switches in the lab director's office (now they keep them on 24/7). One night, I saw the custodian turn on the power so he could plug in the vacuum... which of course turned on all the computers. When he was done, he turned them all off again without doing a proper shutdown of Windows. At the time, they were, probably, Pentium 3s running Windows NT. It didn't seem to do them any harm, though (except maybe an unquantifiable shortening of their lives, but they were retired long before that became an issue).

David
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1077921 · Report as offensive
Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar

Send message
Joined: 20 Dec 05
Posts: 3187
Credit: 57,163,290
RAC: 0
United States
Message 1077952 - Posted: 16 Feb 2011, 17:50:34 UTC - in response to Message 1077769.  
Last modified: 16 Feb 2011, 17:53:30 UTC



The routinely spaced hits at the same wall-clock time two weeks apart that Matt is talking about don't seem to be battery-related to me... (assuming, of course, that Matt has checked the possibility that someone [who comes in alternate weeks...] is routinely turning off the circuit that the UPS in question is plugged into...)


The cleaning crew unplugging something to crank up their vacuum cleaners?


Regularly, at the same time, every other weekend? Doesn't seem likely...

The SSL doesn't have carpeted floors, so substitute "floor polisher(s)" for "vacuum cleaner"... ;-)
.

Hello, from Albany, CA!...
ID: 1077952 · Report as offensive
Cheopis

Send message
Joined: 17 Sep 00
Posts: 156
Credit: 18,451,329
RAC: 0
United States
Message 1078087 - Posted: 16 Feb 2011, 22:11:12 UTC

How tightly timed are the failures?

Within seconds, minutes, or multiple minutes separation in time?

Seems likely that the UPS is the culprit, but if not, the grouping of the timing might help determine the liklihood of different types of issues.
ID: 1078087 · Report as offensive
Profile blueone

Send message
Joined: 18 Sep 00
Posts: 7
Credit: 250,657
RAC: 0
Canada
Message 1078300 - Posted: 17 Feb 2011, 15:21:10 UTC

I think the timing of the reboots is kinda suspicious, and those kitties of Kittieman look kinda suspicious hmmm... It's the cats they must be aliens in disguise and Kittieman is the overlord they are holding the data back cuz they know the next wow signal is ready to be crunched....

:)

Izzit
ID: 1078300 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1078305 - Posted: 17 Feb 2011, 15:34:00 UTC - in response to Message 1078300.  

I think the timing of the reboots is kinda suspicious, and those kitties of Kittieman look kinda suspicious hmmm... It's the cats they must be aliens in disguise and Kittieman is the overlord they are holding the data back cuz they know the next wow signal is ready to be crunched....

:)

Izzit

The alien from that stupid '70s movie with Sandy Duncan had kittens! Several generations by now... They're everywhere! It's war between them and the Geico squirrels!

lol

David
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1078305 · Report as offensive
1 · 2 · Next

Message boards : Technical News : Happy Lupercalia! (Feb 14 2011)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.