Happy Lupercalia! (Feb 14 2011)


log in

Advanced search

Message boards : Technical News : Happy Lupercalia! (Feb 14 2011)

1 · 2 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 1077328 - Posted: 14 Feb 2011, 22:27:06 UTC

Slow, steady progress... We're hoping to have everything copied from gowron onto thumper by tomorrow. Yeah, I know it's going slowly, but there's lots of bottlenecks (degraded RAID, NFS, tons of small files as opposed to a few big ones). After the usual outage we might actually have thumper ready to be the temporary workunit storage server so we can get back to business while doing the necessary upgrades on gowron (which make take as much as a week, unobtrusively running in the background).

That new-ish server synergy rebooted itself on Sunday. This concerned me as this has happened a couple times already. However, I discovered the three reboots thus far all happened on Sunday at 3pm, and two weeks apart from each other. There are no smoking-gun cronjobs, but it is plugged into an old UPS of unknown quality, so we're going to remove that from the equation and watch what happens. The reboots have all been harmless thus far.

Somebody somewhere on these forums asked what our server makeup was. It certainly isn't limited to what's on the server status page. If you just count the unix-based machines, there are currently 26 systems all told. Combining all the stuff inside, we have roughly 100 CPUs, 500GB RAM, and 150 TB raw storage. There are also several appliances (routers, switches, UPSes, kvms, remote controlled power strips, etc. etc.). Usually in these threads I'm griping about public facing servers, or ones causing the BOINC back end to jam up for one reason or another. I rarely mention the mundane, day-to-day, garden variety IT stuff.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

msattlerProject donor
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38863
Credit: 577,377,544
RAC: 522,680
United States
Message 1077332 - Posted: 14 Feb 2011, 22:33:43 UTC

Thank you for the news, Matt.
Many have been waiting with much anticipation, and some even with a certain degree of patience....LOL.
Would be great if you can get back up and running tomorrow while you continue the tedious task of Gowron repairs.

Meow!
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

Profile Chris SProject donor
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 31416
Credit: 12,098,950
RAC: 27,926
United Kingdom
Message 1077334 - Posted: 14 Feb 2011, 22:34:26 UTC

Thanks for the update there Matt.

Actually it was me that asked about all the non-public facing servers that Seti had :-) Sometime in the future when you are all not so busy, it would be nice to see a list of them and what they do. It will give people a real feel for the scope of the admin tasks that you all are responsible for.



____________
Damsel Rescuer, Kitty Patron, Uli Devotee, Julie Supporter
ES99 Admirer, Raccoon Friend, Anniet fan, RJ45 rulz OK!


Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 1077349 - Posted: 14 Feb 2011, 23:06:18 UTC - in response to Message 1077334.

It would be nice to see a list of them and what they do.

For security reasons, I tend to only name and define the systems that are already public facing, or otherwise already known.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Saaby900T
Send message
Joined: 24 Dec 10
Posts: 76
Credit: 4,971,171
RAC: 0
United States
Message 1077351 - Posted: 14 Feb 2011, 23:09:26 UTC

Thank You For the Update!!!!!!

Plz let us Know if you Need some new parts!!! (I'd Be willing to donate Some $$$) to Help out with parts. If I knew what was needed.

Profile Jeff Mercer
Send message
Joined: 14 Aug 08
Posts: 90
Credit: 162,139
RAC: 0
United States
Message 1077354 - Posted: 14 Feb 2011, 23:20:17 UTC

Hi Matt, and THANKS for the update. Glad to hear that things are looking up, and that things might be up and running by tomorrow. As usual, I'll be here waiting. I might hold off a day or two until the, "TRAFFIC" dies down a little bit though. I have a feeling that the servers will be swamped for a while.
Thank you for all your hard work.

Profile Jeff Mercer
Send message
Joined: 14 Aug 08
Posts: 90
Credit: 162,139
RAC: 0
United States
Message 1077356 - Posted: 14 Feb 2011, 23:26:33 UTC - in response to Message 1077351.

Thank You For the Update!!!!!!

Plz let us Know if you Need some new parts!!! (I'd Be willing to donate Some $$$) to Help out with parts. If I knew what was needed.



GOOD IDEA ! Later on, when and IF you have time, maybe you could tell us what all is needed, and the cost of what is needed. Sure wouldn't hurt to let us know about it.

Profile Zapped SparkyProject donor
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 30 Aug 08
Posts: 7301
Credit: 1,234,996
RAC: 1,321
United Kingdom
Message 1077363 - Posted: 14 Feb 2011, 23:46:37 UTC - in response to Message 1077328.

If you just count the unix-based machines, there are currently 26 systems all told. Combining all the stuff inside, we have roughly 100 CPUs, 500GB RAM, and 150 TB raw storage. There are also several appliances (routers, switches, UPSes, kvms, remote controlled power strips, etc. etc.).

Wow, that's a fair bit of equipment. Thanks for all the work we DON'T hear about that you do.

Profile [seti.international] Dirk SadowskiProject donor
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 7057
Credit: 59,948,898
RAC: 21,981
Germany
Message 1077405 - Posted: 15 Feb 2011, 3:00:08 UTC - in response to Message 1077328.

Matt, thanks for the news!

____________
BR



>Das Deutsche Cafe. The German Cafe.<

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 1920
Credit: 9,696,040
RAC: 15,261
United States
Message 1077406 - Posted: 15 Feb 2011, 3:00:51 UTC
Last modified: 15 Feb 2011, 3:02:01 UTC

Maybe you should implement the slow ramp-up that was the rule back during the three-day outage era...

Oh, and don't forget to bring up beta sometime soon, once production gets running nicely...
____________
.

System 3 Lab
Send message
Joined: 29 Apr 08
Posts: 9
Credit: 1,585,775
RAC: 1,675
United States
Message 1077407 - Posted: 15 Feb 2011, 3:01:11 UTC

Just an FYI. I was up in Oakland during the last big one during the World Series. Anyway, all our equipment went down even though we had UPS's on everything. What caused it? Seems that the batteries in the UPS's are not eternal and have to be replaced like every 3 - 5 years! Could there be any relation to what is causing your issues?

Douglas Davidson
Send message
Joined: 26 Sep 05
Posts: 2
Credit: 687,522
RAC: 261
Canada
Message 1077411 - Posted: 15 Feb 2011, 3:27:25 UTC

Thank you for all the hard work. I still have some W.U.'s left to do, so will wait.
____________

-BeNt-
Avatar
Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1077436 - Posted: 15 Feb 2011, 6:42:50 UTC - in response to Message 1077328.

Usually in these threads I'm griping about public facing servers, or ones causing the BOINC back end to jam up for one reason or another. I rarely mention the mundane, day-to-day, garden variety IT stuff.

- Matt


Well hopefully Matt sometime in the immediate future we can hear more about your mundane, day-to-day, operations and less about the major issues. Growing pains are difficult and when mixing it with older hardware like you are dealing with can be.....less than perfect, to put it mildly. Best of luck gentlemen and hopefully the IT gods will bless you and the gremlins let you stomp them for a change.

____________
Traveling through space at ~67,000mph!

Big Bang
Send message
Joined: 7 Jan 10
Posts: 670
Credit: 28,481
RAC: 0
Message 1077450 - Posted: 15 Feb 2011, 8:29:27 UTC

Thanks for the update Matt. Trust you had a good gig the other night to re-align the neurons. Always good and appreciative thoughts toward you and your colleagues. Cheers mate.

Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 1920
Credit: 9,696,040
RAC: 15,261
United States
Message 1077758 - Posted: 16 Feb 2011, 2:48:49 UTC - in response to Message 1077407.
Last modified: 16 Feb 2011, 2:54:07 UTC

Just an FYI. I was up in Oakland during the last big one during the World Series. Anyway, all our equipment went down even though we had UPS's on everything. What caused it? Seems that the batteries in the UPS's are not eternal and have to be replaced like every 3 - 5 years! Could there be any relation to what is causing your issues?


Are you talking the Earthquake series in '89 or last year?

The routinely spaced hits at the same wall-clock time two weeks apart that Matt is talking about don't seem to be battery-related to me... (assuming, of course, that Matt has checked the possibility that someone [who comes in alternate weeks...] is routinely turning off the circuit that the UPS in question is plugged into...)
____________
.

John McLeod VII
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 15 Jul 99
Posts: 24329
Credit: 519,750
RAC: 37
United States
Message 1077769 - Posted: 16 Feb 2011, 3:40:39 UTC - in response to Message 1077758.

Just an FYI. I was up in Oakland during the last big one during the World Series. Anyway, all our equipment went down even though we had UPS's on everything. What caused it? Seems that the batteries in the UPS's are not eternal and have to be replaced like every 3 - 5 years! Could there be any relation to what is causing your issues?


Are you talking the Earthquake series in '89 or last year?

The routinely spaced hits at the same wall-clock time two weeks apart that Matt is talking about don't seem to be battery-related to me... (assuming, of course, that Matt has checked the possibility that someone [who comes in alternate weeks...] is routinely turning off the circuit that the UPS in question is plugged into...)

The cleaning crew unplugging something to crank up their vacuum cleaners?
____________


BOINC WIKI

N9JFE David SProject donor
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 11150
Credit: 13,922,376
RAC: 12,680
United States
Message 1077921 - Posted: 16 Feb 2011, 15:45:25 UTC - in response to Message 1077769.

The routinely spaced hits at the same wall-clock time two weeks apart that Matt is talking about don't seem to be battery-related to me... (assuming, of course, that Matt has checked the possibility that someone [who comes in alternate weeks...] is routinely turning off the circuit that the UPS in question is plugged into...)

The cleaning crew unplugging something to crank up their vacuum cleaners?

In the school where I work it used to be standard procedure to turn off the power to all the computers in the computer lab every night, via 4 master switches in the lab director's office (now they keep them on 24/7). One night, I saw the custodian turn on the power so he could plug in the vacuum... which of course turned on all the computers. When he was done, he turned them all off again without doing a proper shutdown of Windows. At the time, they were, probably, Pentium 3s running Windows NT. It didn't seem to do them any harm, though (except maybe an unquantifiable shortening of their lives, but they were retired long before that became an issue).

David
____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


Profile KWSN THE Holy Hand Grenade!
Volunteer tester
Avatar
Send message
Joined: 20 Dec 05
Posts: 1920
Credit: 9,696,040
RAC: 15,261
United States
Message 1077952 - Posted: 16 Feb 2011, 17:50:34 UTC - in response to Message 1077769.
Last modified: 16 Feb 2011, 17:53:30 UTC



The routinely spaced hits at the same wall-clock time two weeks apart that Matt is talking about don't seem to be battery-related to me... (assuming, of course, that Matt has checked the possibility that someone [who comes in alternate weeks...] is routinely turning off the circuit that the UPS in question is plugged into...)


The cleaning crew unplugging something to crank up their vacuum cleaners?


Regularly, at the same time, every other weekend? Doesn't seem likely...

The SSL doesn't have carpeted floors, so substitute "floor polisher(s)" for "vacuum cleaner"... ;-)
____________
.

Cheopis
Send message
Joined: 17 Sep 00
Posts: 139
Credit: 10,905,911
RAC: 9,332
United States
Message 1078087 - Posted: 16 Feb 2011, 22:11:12 UTC

How tightly timed are the failures?

Within seconds, minutes, or multiple minutes separation in time?

Seems likely that the UPS is the culprit, but if not, the grouping of the timing might help determine the liklihood of different types of issues.

Profile blueone
Send message
Joined: 18 Sep 00
Posts: 7
Credit: 250,657
RAC: 0
Canada
Message 1078300 - Posted: 17 Feb 2011, 15:21:10 UTC

I think the timing of the reboots is kinda suspicious, and those kitties of Kittieman look kinda suspicious hmmm... It's the cats they must be aliens in disguise and Kittieman is the overlord they are holding the data back cuz they know the next wow signal is ready to be crunched....

:)

Izzit
____________

1 · 2 · Next

Message boards : Technical News : Happy Lupercalia! (Feb 14 2011)

Copyright © 2014 University of California