Upgrades (Jan 30 2013)


log in

Advanced search

Message boards : Technical News : Upgrades (Jan 30 2013)

1 · 2 · Next
Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 1332922 - Posted: 30 Jan 2013, 20:12:18 UTC

The other day synergy (the scheduling server) had one of its (more and more frequency) CPU locks. I'm pretty sure this is a problem with the linux kernel, and not hardware, as this problem happened on bruno when it was the scheduling server. Maybe this is could be a software bug, but it's a pretty ugly crash the seems to be an inability to handle high demand. Maybe it's the way we have the system tuned. In any case, this happened just before the regular weekly outage, so the timing wasn't too bad.

During the outage I wrapped up one lingering project - merging a couple large tables in the Astropulse database. This is why the ap_assimilators have been off for most of the past week. I also have been getting more aggressive in upgrading the OSes on the backend systems for increased security and stability.

In reality the main pushy for upgrading the OSes is to bring everything to a point which will require a minimal amount of hands-on server administration... because we are currently evaluating the pros and cons of moving our server farm to a colocation facility on campus. We haven't decided one way or another yet, as we still have to determine costs and feasibility of moving our Hurricane Electric connection down on campus (where the facility is located). If we do end up making the leap, we immediately gain (a) better air conditioning without worry, (b) full UPS without worry, and (c) much better remote kvm access without worry (our current situation is wonky at best). Maybe we'll also get more bandwidth (that's a big maybe). Plus they have staff on hand to kick machines if necessary. This would vastly free up time and mental bandwidth so Jeff, Eric, and I can work on other things, like science! The con of course is the inconvenience if we do have to be hands-on with a broken server. Anyway, exciting times! This wouldn't be possible, of course, without many recent server upgrades that vastly reduced our physical footprint (or rackprint), thus bringing rack space rental at the colo within a reasonable limit.

I'll have more news on this front, of course, as we work our way through various hurdles, or end up backing out of the move and keeping things where they are. I should mention recent a/c fixes in our current closet were a total success, so there now seems to be less of a reason to rush into a colo situation. On the other hand, we have yet another planned lab-wide power outage coming up in February. We're getting real sick and tired of those. This wouldn't be an issue at the colo.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Claggy
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4048
Credit: 32,693,129
RAC: 604
United Kingdom
Message 1332933 - Posted: 30 Jan 2013, 20:28:26 UTC - in response to Message 1332922.

Thanks for the update Matt,

Have you thought of reverting the feeder size back down to 100 from the 200 it seems to be, and see if that reduces the number of scheduler timeouts?

Claggy

QSilver
Send message
Joined: 26 May 99
Posts: 227
Credit: 4,488,503
RAC: 3,306
United States
Message 1332937 - Posted: 30 Jan 2013, 20:35:20 UTC

Thanks, Matt. Keep up the good work!
____________

Profile -= Vyper =-
Volunteer tester
Avatar
Send message
Joined: 5 Sep 99
Posts: 1039
Credit: 301,790,368
RAC: 164,080
Sweden
Message 1332948 - Posted: 30 Jan 2013, 21:12:28 UTC
Last modified: 30 Jan 2013, 21:13:53 UTC

Matt!

Check the motherboard to see what the brand is and model no., perhaps it's a watchdog timer in bios that locks the machine (fails to reset) when the context/interrupt switching is high.
Todays virtualization Technologies stresses that to the max.
If so tweak all buffers and NIC's to reduce the thing mentioned above and if it "lasts" longer then you're on the right track!

Kind regards Vyper
____________

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group

Tom
Send message
Joined: 12 Aug 11
Posts: 114
Credit: 4,566,097
RAC: 0
United States
Message 1332967 - Posted: 30 Jan 2013, 22:15:03 UTC

Thank you Matt,

Whatever you did to the schedular seems to be working today, for me.

Today is the first day in weeks I havn't had to continuously switch

between direct access (to ask and report) and proxy to download.

When I try to use the proxy to ask and report the gateway timesout

in 1/3 of the time as when I use direct access.

Today I have been on the proxy all day and what a relief it is:-) to just watch

it work.

Thanks again.

Profile Gary Charpentier
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12130
Credit: 6,411,547
RAC: 8,235
United States
Message 1333038 - Posted: 31 Jan 2013, 2:53:38 UTC

Thanks for the updates as always.

____________

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38320
Credit: 559,575,414
RAC: 643,928
United States
Message 1333103 - Posted: 31 Jan 2013, 8:01:02 UTC

Matt, thank you so much for the news bits.
I am not sure if colocation is really the best for the project, but I trust 'the boyz in da lab', as I fondly refer to you sometimes, shall determine what will work best.
Any prospect of increased bandwidth is very exciting, as it has been needed for soooooooo long now. I really would wish that relocating the servers would not be necessary for that to become a reality, but, I know that some hard to overcome politics are in play, and that is a darn shame.

Best of luck no matter what your decisions are. I know that they will be whatever is deemed the best for the Seti project and it's science and it's scientists.

Best Regards and meows,
Mark.
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

Profile Chris S
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 31119
Credit: 11,288,590
RAC: 19,992
United Kingdom
Message 1333131 - Posted: 31 Jan 2013, 11:11:11 UTC

As always many thanks there for the update. Like you, I can also see pros and cons in co-locating the Seti servers elsewhere on campus. The immediate benefits of better air conditioning, full UPS, and better remote kvm access, plus staff on hand to kick machines, is of course a big plus. Also it would mean less time for you guys in being server admins, and let you get on with the science.

But what about the weekly outage, can that all be done remotely, or will it mean a physical visit? It's not been unknown for lab staff to visit on a Saturday or Sunday, to kick ailing servers, which is all well and good when you have your own lab front door key. Will the co-location allow access over a weekend? The Berkeley campus covers a fair area, so presumably we are talking of needing transport to travel between the lab and the co-location, not a 5 minute walk.

These continuing power shutdowns are getting out of hand, and seem not to be diminishing, but if the lab is off air, you wouldn't be able to remotely access the servers elsewhere anyway. Except maybe by a laptop on battery power, assuming there is wifi. You wouldn't want to spend a day traveling up and down the hill a number of times, if there is a real project hiccup either.

As Mark so astutely observes, there are Berkeley internal politics here that have to be taken into consideration as well, in addition to the Hurricane Electric link. My leaning is towards staying put, but I'm not there and you guys are, and I know you will not make any move without fully researching all aspects of it. Which ever way you go, you'll have my 100% support.

N9JFE David S
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 10739
Credit: 13,446,476
RAC: 13,745
United States
Message 1333172 - Posted: 31 Jan 2013, 14:40:14 UTC - in response to Message 1333131.

Maybe I just understand the situation differently than Chris S, but I say go for it unless you find a good reason not to. If I understand Matt correctly, the server kicking that the guys sometimes go in to do at odd times could be done by the staff of the colocation facility, at least most of the time. I've also always understood the weekly maintenance to be of data, not physical maintenance of the machines, so I'd think that could be done remotely.

Bottom line, even if you don't get better bandwidth (fingers crossed that you do), the more reliable power will be a big bonus.

____________
David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.


msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38320
Credit: 559,575,414
RAC: 643,928
United States
Message 1333193 - Posted: 31 Jan 2013, 16:32:22 UTC

If they stayed where they are, a rather modest increase to 150Mb or 200Mb would probably be enough to make a radical difference in comms. I don't think we need a full Gigabit pipe to handle things.
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

Profile Ex
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 12 Mar 12
Posts: 2895
Credit: 1,689,254
RAC: 1,284
United States
Message 1333205 - Posted: 31 Jan 2013, 16:54:07 UTC
Last modified: 31 Jan 2013, 16:56:19 UTC

I'm glad to see there is talk of moving the servers to (what seems to be) a much more suitable location!
But it saddens me that that move doesn't necessarily mean more bandwidth. If more bandwidth wasn't such a "big maybe", then this idea sounds too good to be true. :-)


Too bad they don't have OS independent KVM support. IPMI is nice. With IPMI I have no power/reset wired on my server at all, no DVDROM drive, no monitor or keyboard, wasn't even necessary to install the OS, as IPMI supports your local machines drive alongside a KVM connection. Even my BIOS can be accesed via IPMI during boot, as IPMI is running when your system is soft off.
____________
-Dave #2

3.2.0-33

GALAXY-VOYAGER
Avatar
Send message
Joined: 21 Oct 12
Posts: 85
Credit: 128,843
RAC: 41
Australia
Message 1333224 - Posted: 31 Jan 2013, 17:43:34 UTC

I seem to be havining an Issue with Downloads. My HP Notebook (This Computer), has recently Completed about 38 SETI Tasks. Just Prior to Compleeting the final 3 or 4, a List Appeared in The TRANSFERS TAB. after a short time, it moved down the List from one to the next after Downloading each for a certain time. The STATUS Column changed from Download:Active or Download:Pending, to Retrying In nn:nn:nn. A certain time would Count Down, and it will change to Backing Off in nn:nn:nn
When it shows that it will be Backing Off (after so many seconds/minutes), I have Clicked the Item on The top line, and The Status Column is Reset to Download:Active or Download:Pending, and it Continues to Download.
numerous Tasks have Downloaded successfully, but there's about a dozen yet to finish Downloading.
However, after a while, it keeps reverting back to The Retrying and some have gone to The Backing Off situations (but, so far I have managed to Retry now before they DO Back Off). As it stands at the moment, the remaining ones seem to be Retrying and not threatening to back Off. there only seems to be about 11 Remaining to Download.
I have been sitting here watching the progress and keeping it in hand by Clicking Retry Now when they were about to Back Off. but I can't watch them any longer. I'll just have to hope for the best.
For future Reference, if this happens again, what will happen if I SUSPEND The PROJECT? ..... Will it Suspend The Downloads, and if it does, what affect will it have?
____________
GALAXY-VOYAGER

QSilver
Send message
Joined: 26 May 99
Posts: 227
Credit: 4,488,503
RAC: 3,306
United States
Message 1333226 - Posted: 31 Jan 2013, 17:48:55 UTC - in response to Message 1333224.

@GALAXY-VOYAGER...you really need to ask questions like that in the Number Crunching Forum. Plenty people over there willing & able to help.

This forum is for technical news about the project from the project leaders. Getting assistance with your particular set-up or issues will happen more quickly in Number Crunching.
____________

Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 1333233 - Posted: 31 Jan 2013, 18:15:30 UTC

For clarity: the possible, easier upgrade in bandwidth might be a nice side effect when moving to the colo, not the main reason for moving to colo. Also, weekly outages entirely happen remotely, i.e. I sit at my desk instead of standing in the closet. That is except if I'm moving servers around, replacing drives, etc. In preparation for all this Jeff and I have been keeping track how often we need to go into the closet for reasons that wouldn't be taken care of by the colo. We're looking at once a week, tops. Probably more like once a month. Oh yeah part of the deal is if we keep a stash of hard drives down there, they could do drive swaps for us as well.

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38320
Credit: 559,575,414
RAC: 643,928
United States
Message 1333235 - Posted: 31 Jan 2013, 18:32:29 UTC

Is this a Berkeley run facility? Or a 3rd party facility located on campus?
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 1333237 - Posted: 31 Jan 2013, 18:36:20 UTC - in response to Message 1333235.

Is this a Berkeley run facility? Or a 3rd party facility located on campus?


Part of UC Berkeley. Otherwise there would be no way we could afford it!

- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38320
Credit: 559,575,414
RAC: 643,928
United States
Message 1333238 - Posted: 31 Jan 2013, 18:39:15 UTC - in response to Message 1333237.

Is this a Berkeley run facility? Or a 3rd party facility located on campus?


Part of UC Berkeley. Otherwise there would be no way we could afford it!

- Matt

Then I suppose some consideration would be given to the reduced costs in the present location due to not using the electricity for the servers and AC unit?
Or is that already figured in?
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

Wiyosaya
Send message
Joined: 19 May 99
Posts: 39
Credit: 801,970
RAC: 14
United States
Message 1333385 - Posted: 1 Feb 2013, 1:49:04 UTC

Thanks for the update, Matt.

Personally, I've been holding off running S@H these days because it takes hours sometimes to get a single WU. If I hear this situation has gotten better, I would likely run S@H again.
____________

Profile Gary Charpentier
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12130
Credit: 6,411,547
RAC: 8,235
United States
Message 1333396 - Posted: 1 Feb 2013, 2:59:20 UTC - in response to Message 1333238.

Is this a Berkeley run facility? Or a 3rd party facility located on campus?


Part of UC Berkeley. Otherwise there would be no way we could afford it!

- Matt

Then I suppose some consideration would be given to the reduced costs in the present location due to not using the electricity for the servers and AC unit?
Or is that already figured in?

Berkeley pays for the electricity, A/C, etc. no matter where on campus the machines are. Additional costs should really only be the people being available to kick the machines, plus whatever markup the Regents want. Even the floor space of the server closet would be returned for them to "rent" to some other activity at the SSL.

Releasing the fiber SSL<-->Campus IT from dedicated to Seti@Home to any use hopefully is a consideration that might be able to leverage better bandwidth from the Colo to PAIX, but I don't know who "paid" for that fiber in the first place.

To keep this project running, I'm sure they are very good at counting beans.

____________

Profile Chris S
Volunteer tester
Avatar
Send message
Joined: 19 Nov 00
Posts: 31119
Credit: 11,288,590
RAC: 19,992
United Kingdom
Message 1333531 - Posted: 1 Feb 2013, 14:49:23 UTC

Thanks for the extra clarity Matt.

weekly outages entirely happen remotely, i.e. I sit at my desk instead of standing in the closet. That is except if I'm moving servers around, replacing drives, etc.

That I wasn't fully aware of.

In preparation for all this Jeff and I have been keeping track how often we need to go into the closet for reasons that wouldn't be taken care of by the colo. We're looking at once a week, tops. Probably more like once a month. Oh yeah part of the deal is if we keep a stash of hard drives down there, they could do drive swaps for us as well.

In that case, in the light of this new information, it does seem to make more sense to consider relocating the servers elsewhere.


1 · 2 · Next

Message boards : Technical News : Upgrades (Jan 30 2013)

Copyright © 2014 University of California