Upgrades (Jan 30 2013)

Message boards : Technical News : Upgrades (Jan 30 2013)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1332922 - Posted: 30 Jan 2013, 20:12:18 UTC

The other day synergy (the scheduling server) had one of its (more and more frequency) CPU locks. I'm pretty sure this is a problem with the linux kernel, and not hardware, as this problem happened on bruno when it was the scheduling server. Maybe this is could be a software bug, but it's a pretty ugly crash the seems to be an inability to handle high demand. Maybe it's the way we have the system tuned. In any case, this happened just before the regular weekly outage, so the timing wasn't too bad.

During the outage I wrapped up one lingering project - merging a couple large tables in the Astropulse database. This is why the ap_assimilators have been off for most of the past week. I also have been getting more aggressive in upgrading the OSes on the backend systems for increased security and stability.

In reality the main pushy for upgrading the OSes is to bring everything to a point which will require a minimal amount of hands-on server administration... because we are currently evaluating the pros and cons of moving our server farm to a colocation facility on campus. We haven't decided one way or another yet, as we still have to determine costs and feasibility of moving our Hurricane Electric connection down on campus (where the facility is located). If we do end up making the leap, we immediately gain (a) better air conditioning without worry, (b) full UPS without worry, and (c) much better remote kvm access without worry (our current situation is wonky at best). Maybe we'll also get more bandwidth (that's a big maybe). Plus they have staff on hand to kick machines if necessary. This would vastly free up time and mental bandwidth so Jeff, Eric, and I can work on other things, like science! The con of course is the inconvenience if we do have to be hands-on with a broken server. Anyway, exciting times! This wouldn't be possible, of course, without many recent server upgrades that vastly reduced our physical footprint (or rackprint), thus bringing rack space rental at the colo within a reasonable limit.

I'll have more news on this front, of course, as we work our way through various hurdles, or end up backing out of the move and keeping things where they are. I should mention recent a/c fixes in our current closet were a total success, so there now seems to be less of a reason to rush into a colo situation. On the other hand, we have yet another planned lab-wide power outage coming up in February. We're getting real sick and tired of those. This wouldn't be an issue at the colo.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1332922 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1332933 - Posted: 30 Jan 2013, 20:28:26 UTC - in response to Message 1332922.  

Thanks for the update Matt,

Have you thought of reverting the feeder size back down to 100 from the 200 it seems to be, and see if that reduces the number of scheduler timeouts?

Claggy
ID: 1332933 · Report as offensive
QSilver

Send message
Joined: 26 May 99
Posts: 232
Credit: 6,452,764
RAC: 0
United States
Message 1332937 - Posted: 30 Jan 2013, 20:35:20 UTC

Thanks, Matt. Keep up the good work!
ID: 1332937 · Report as offensive
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 1332948 - Posted: 30 Jan 2013, 21:12:28 UTC
Last modified: 30 Jan 2013, 21:13:53 UTC

Matt!

Check the motherboard to see what the brand is and model no., perhaps it's a watchdog timer in bios that locks the machine (fails to reset) when the context/interrupt switching is high.
Todays virtualization Technologies stresses that to the max.
If so tweak all buffers and NIC's to reduce the thing mentioned above and if it "lasts" longer then you're on the right track!

Kind regards Vyper

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 1332948 · Report as offensive
Tom*

Send message
Joined: 12 Aug 11
Posts: 127
Credit: 20,769,223
RAC: 9
United States
Message 1332967 - Posted: 30 Jan 2013, 22:15:03 UTC

Thank you Matt,

Whatever you did to the schedular seems to be working today, for me.

Today is the first day in weeks I havn't had to continuously switch

between direct access (to ask and report) and proxy to download.

When I try to use the proxy to ask and report the gateway timesout

in 1/3 of the time as when I use direct access.

Today I have been on the proxy all day and what a relief it is:-) to just watch

it work.

Thanks again.

ID: 1332967 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 31013
Credit: 53,134,872
RAC: 32
United States
Message 1333038 - Posted: 31 Jan 2013, 2:53:38 UTC

Thanks for the updates as always.

ID: 1333038 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1333103 - Posted: 31 Jan 2013, 8:01:02 UTC

Matt, thank you so much for the news bits.
I am not sure if colocation is really the best for the project, but I trust 'the boyz in da lab', as I fondly refer to you sometimes, shall determine what will work best.
Any prospect of increased bandwidth is very exciting, as it has been needed for soooooooo long now. I really would wish that relocating the servers would not be necessary for that to become a reality, but, I know that some hard to overcome politics are in play, and that is a darn shame.

Best of luck no matter what your decisions are. I know that they will be whatever is deemed the best for the Seti project and it's science and it's scientists.

Best Regards and meows,
Mark.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1333103 · Report as offensive
David S
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 18352
Credit: 27,761,924
RAC: 12
United States
Message 1333172 - Posted: 31 Jan 2013, 14:40:14 UTC - in response to Message 1333131.  

Maybe I just understand the situation differently than Chris S, but I say go for it unless you find a good reason not to. If I understand Matt correctly, the server kicking that the guys sometimes go in to do at odd times could be done by the staff of the colocation facility, at least most of the time. I've also always understood the weekly maintenance to be of data, not physical maintenance of the machines, so I'd think that could be done remotely.

Bottom line, even if you don't get better bandwidth (fingers crossed that you do), the more reliable power will be a big bonus.

David
Sitting on my butt while others boldly go,
Waiting for a message from a small furry creature from Alpha Centauri.

ID: 1333172 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1333193 - Posted: 31 Jan 2013, 16:32:22 UTC

If they stayed where they are, a rather modest increase to 150Mb or 200Mb would probably be enough to make a radical difference in comms. I don't think we need a full Gigabit pipe to handle things.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1333193 · Report as offensive
Profile Ex: "Socialist"
Volunteer tester
Avatar

Send message
Joined: 12 Mar 12
Posts: 3433
Credit: 2,616,158
RAC: 2
United States
Message 1333205 - Posted: 31 Jan 2013, 16:54:07 UTC
Last modified: 31 Jan 2013, 16:56:19 UTC

I'm glad to see there is talk of moving the servers to (what seems to be) a much more suitable location!
But it saddens me that that move doesn't necessarily mean more bandwidth. If more bandwidth wasn't such a "big maybe", then this idea sounds too good to be true. :-)


Too bad they don't have OS independent KVM support. IPMI is nice. With IPMI I have no power/reset wired on my server at all, no DVDROM drive, no monitor or keyboard, wasn't even necessary to install the OS, as IPMI supports your local machines drive alongside a KVM connection. Even my BIOS can be accesed via IPMI during boot, as IPMI is running when your system is soft off.
#resist
ID: 1333205 · Report as offensive
GALAXY-VOYAGER
Avatar

Send message
Joined: 21 Oct 12
Posts: 85
Credit: 157,743
RAC: 0
Australia
Message 1333224 - Posted: 31 Jan 2013, 17:43:34 UTC

I seem to be havining an Issue with Downloads. My HP Notebook (This Computer), has recently Completed about 38 SETI Tasks. Just Prior to Compleeting the final 3 or 4, a List Appeared in The TRANSFERS TAB. after a short time, it moved down the List from one to the next after Downloading each for a certain time. The STATUS Column changed from Download:Active or Download:Pending, to Retrying In nn:nn:nn. A certain time would Count Down, and it will change to Backing Off in nn:nn:nn
When it shows that it will be Backing Off (after so many seconds/minutes), I have Clicked the Item on The top line, and The Status Column is Reset to Download:Active or Download:Pending, and it Continues to Download.
numerous Tasks have Downloaded successfully, but there's about a dozen yet to finish Downloading.
However, after a while, it keeps reverting back to The Retrying and some have gone to The Backing Off situations (but, so far I have managed to Retry now before they DO Back Off). As it stands at the moment, the remaining ones seem to be Retrying and not threatening to back Off. there only seems to be about 11 Remaining to Download.
I have been sitting here watching the progress and keeping it in hand by Clicking Retry Now when they were about to Back Off. but I can't watch them any longer. I'll just have to hope for the best.
For future Reference, if this happens again, what will happen if I SUSPEND The PROJECT? ..... Will it Suspend The Downloads, and if it does, what affect will it have?
GALAXY-VOYAGER
ID: 1333224 · Report as offensive
QSilver

Send message
Joined: 26 May 99
Posts: 232
Credit: 6,452,764
RAC: 0
United States
Message 1333226 - Posted: 31 Jan 2013, 17:48:55 UTC - in response to Message 1333224.  

@GALAXY-VOYAGER...you really need to ask questions like that in the Number Crunching Forum. Plenty people over there willing & able to help.

This forum is for technical news about the project from the project leaders. Getting assistance with your particular set-up or issues will happen more quickly in Number Crunching.
ID: 1333226 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1333233 - Posted: 31 Jan 2013, 18:15:30 UTC

For clarity: the possible, easier upgrade in bandwidth might be a nice side effect when moving to the colo, not the main reason for moving to colo. Also, weekly outages entirely happen remotely, i.e. I sit at my desk instead of standing in the closet. That is except if I'm moving servers around, replacing drives, etc. In preparation for all this Jeff and I have been keeping track how often we need to go into the closet for reasons that wouldn't be taken care of by the colo. We're looking at once a week, tops. Probably more like once a month. Oh yeah part of the deal is if we keep a stash of hard drives down there, they could do drive swaps for us as well.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1333233 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1333235 - Posted: 31 Jan 2013, 18:32:29 UTC

Is this a Berkeley run facility? Or a 3rd party facility located on campus?
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1333235 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 1333237 - Posted: 31 Jan 2013, 18:36:20 UTC - in response to Message 1333235.  

Is this a Berkeley run facility? Or a 3rd party facility located on campus?


Part of UC Berkeley. Otherwise there would be no way we could afford it!

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 1333237 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51478
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1333238 - Posted: 31 Jan 2013, 18:39:15 UTC - in response to Message 1333237.  

Is this a Berkeley run facility? Or a 3rd party facility located on campus?


Part of UC Berkeley. Otherwise there would be no way we could afford it!

- Matt

Then I suppose some consideration would be given to the reduced costs in the present location due to not using the electricity for the servers and AC unit?
Or is that already figured in?
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 1333238 · Report as offensive
Wiyosaya

Send message
Joined: 19 May 99
Posts: 39
Credit: 2,985,585
RAC: 0
United States
Message 1333385 - Posted: 1 Feb 2013, 1:49:04 UTC

Thanks for the update, Matt.

Personally, I've been holding off running S@H these days because it takes hours sometimes to get a single WU. If I hear this situation has gotten better, I would likely run S@H again.
ID: 1333385 · Report as offensive
Profile Gary Charpentier Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 25 Dec 00
Posts: 31013
Credit: 53,134,872
RAC: 32
United States
Message 1333396 - Posted: 1 Feb 2013, 2:59:20 UTC - in response to Message 1333238.  

Is this a Berkeley run facility? Or a 3rd party facility located on campus?


Part of UC Berkeley. Otherwise there would be no way we could afford it!

- Matt

Then I suppose some consideration would be given to the reduced costs in the present location due to not using the electricity for the servers and AC unit?
Or is that already figured in?

Berkeley pays for the electricity, A/C, etc. no matter where on campus the machines are. Additional costs should really only be the people being available to kick the machines, plus whatever markup the Regents want. Even the floor space of the server closet would be returned for them to "rent" to some other activity at the SSL.

Releasing the fiber SSL<-->Campus IT from dedicated to Seti@Home to any use hopefully is a consideration that might be able to leverage better bandwidth from the Colo to PAIX, but I don't know who "paid" for that fiber in the first place.

To keep this project running, I'm sure they are very good at counting beans.

ID: 1333396 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1335298 - Posted: 6 Feb 2013, 23:53:31 UTC - in response to Message 1332922.  

Any chance you can turn the splitters on at Seti Beta please, they haven't been running for at least four days now,

Claggy
ID: 1335298 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1335307 - Posted: 7 Feb 2013, 0:21:00 UTC

Any chance you can make something to allow us at least report the allready crunched WU? Only server comunications error for the last 4 days.
ID: 1335307 · Report as offensive
1 · 2 · Next

Message boards : Technical News : Upgrades (Jan 30 2013)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.