Message boards :
Technical News :
Upgrades (Jan 30 2013)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
The other day synergy (the scheduling server) had one of its (more and more frequency) CPU locks. I'm pretty sure this is a problem with the linux kernel, and not hardware, as this problem happened on bruno when it was the scheduling server. Maybe this is could be a software bug, but it's a pretty ugly crash the seems to be an inability to handle high demand. Maybe it's the way we have the system tuned. In any case, this happened just before the regular weekly outage, so the timing wasn't too bad. During the outage I wrapped up one lingering project - merging a couple large tables in the Astropulse database. This is why the ap_assimilators have been off for most of the past week. I also have been getting more aggressive in upgrading the OSes on the backend systems for increased security and stability. In reality the main pushy for upgrading the OSes is to bring everything to a point which will require a minimal amount of hands-on server administration... because we are currently evaluating the pros and cons of moving our server farm to a colocation facility on campus. We haven't decided one way or another yet, as we still have to determine costs and feasibility of moving our Hurricane Electric connection down on campus (where the facility is located). If we do end up making the leap, we immediately gain (a) better air conditioning without worry, (b) full UPS without worry, and (c) much better remote kvm access without worry (our current situation is wonky at best). Maybe we'll also get more bandwidth (that's a big maybe). Plus they have staff on hand to kick machines if necessary. This would vastly free up time and mental bandwidth so Jeff, Eric, and I can work on other things, like science! The con of course is the inconvenience if we do have to be hands-on with a broken server. Anyway, exciting times! This wouldn't be possible, of course, without many recent server upgrades that vastly reduced our physical footprint (or rackprint), thus bringing rack space rental at the colo within a reasonable limit. I'll have more news on this front, of course, as we work our way through various hurdles, or end up backing out of the move and keeping things where they are. I should mention recent a/c fixes in our current closet were a total success, so there now seems to be less of a reason to rush into a colo situation. On the other hand, we have yet another planned lab-wide power outage coming up in February. We're getting real sick and tired of those. This wouldn't be an issue at the colo. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Thanks for the update Matt, Have you thought of reverting the feeder size back down to 100 from the 200 it seems to be, and see if that reduces the number of scheduler timeouts? Claggy |
QSilver Send message Joined: 26 May 99 Posts: 232 Credit: 6,452,764 RAC: 0 |
Thanks, Matt. Keep up the good work! |
-= Vyper =- Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537 |
Matt! Check the motherboard to see what the brand is and model no., perhaps it's a watchdog timer in bios that locks the machine (fails to reset) when the context/interrupt switching is high. Todays virtualization Technologies stresses that to the max. If so tweak all buffers and NIC's to reduce the thing mentioned above and if it "lasts" longer then you're on the right track! Kind regards Vyper _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group |
Tom* Send message Joined: 12 Aug 11 Posts: 127 Credit: 20,769,223 RAC: 9 |
Thank you Matt, Whatever you did to the schedular seems to be working today, for me. Today is the first day in weeks I havn't had to continuously switch between direct access (to ask and report) and proxy to download. When I try to use the proxy to ask and report the gateway timesout in 1/3 of the time as when I use direct access. Today I have been on the proxy all day and what a relief it is:-) to just watch it work. Thanks again. |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 31012 Credit: 53,134,872 RAC: 32 |
Thanks for the updates as always. |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
Matt, thank you so much for the news bits. I am not sure if colocation is really the best for the project, but I trust 'the boyz in da lab', as I fondly refer to you sometimes, shall determine what will work best. Any prospect of increased bandwidth is very exciting, as it has been needed for soooooooo long now. I really would wish that relocating the servers would not be necessary for that to become a reality, but, I know that some hard to overcome politics are in play, and that is a darn shame. Best of luck no matter what your decisions are. I know that they will be whatever is deemed the best for the Seti project and it's science and it's scientists. Best Regards and meows, Mark. "Time is simply the mechanism that keeps everything from happening all at once." |
David S Send message Joined: 4 Oct 99 Posts: 18352 Credit: 27,761,924 RAC: 12 |
Maybe I just understand the situation differently than Chris S, but I say go for it unless you find a good reason not to. If I understand Matt correctly, the server kicking that the guys sometimes go in to do at odd times could be done by the staff of the colocation facility, at least most of the time. I've also always understood the weekly maintenance to be of data, not physical maintenance of the machines, so I'd think that could be done remotely. Bottom line, even if you don't get better bandwidth (fingers crossed that you do), the more reliable power will be a big bonus. David Sitting on my butt while others boldly go, Waiting for a message from a small furry creature from Alpha Centauri. |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
If they stayed where they are, a rather modest increase to 150Mb or 200Mb would probably be enough to make a radical difference in comms. I don't think we need a full Gigabit pipe to handle things. "Time is simply the mechanism that keeps everything from happening all at once." |
Ex: "Socialist" Send message Joined: 12 Mar 12 Posts: 3433 Credit: 2,616,158 RAC: 2 |
I'm glad to see there is talk of moving the servers to (what seems to be) a much more suitable location! But it saddens me that that move doesn't necessarily mean more bandwidth. If more bandwidth wasn't such a "big maybe", then this idea sounds too good to be true. :-) Too bad they don't have OS independent KVM support. IPMI is nice. With IPMI I have no power/reset wired on my server at all, no DVDROM drive, no monitor or keyboard, wasn't even necessary to install the OS, as IPMI supports your local machines drive alongside a KVM connection. Even my BIOS can be accesed via IPMI during boot, as IPMI is running when your system is soft off. #resist |
GALAXY-VOYAGER Send message Joined: 21 Oct 12 Posts: 85 Credit: 157,743 RAC: 0 |
I seem to be havining an Issue with Downloads. My HP Notebook (This Computer), has recently Completed about 38 SETI Tasks. Just Prior to Compleeting the final 3 or 4, a List Appeared in The TRANSFERS TAB. after a short time, it moved down the List from one to the next after Downloading each for a certain time. The STATUS Column changed from Download:Active or Download:Pending, to Retrying In nn:nn:nn. A certain time would Count Down, and it will change to Backing Off in nn:nn:nn When it shows that it will be Backing Off (after so many seconds/minutes), I have Clicked the Item on The top line, and The Status Column is Reset to Download:Active or Download:Pending, and it Continues to Download. numerous Tasks have Downloaded successfully, but there's about a dozen yet to finish Downloading. However, after a while, it keeps reverting back to The Retrying and some have gone to The Backing Off situations (but, so far I have managed to Retry now before they DO Back Off). As it stands at the moment, the remaining ones seem to be Retrying and not threatening to back Off. there only seems to be about 11 Remaining to Download. I have been sitting here watching the progress and keeping it in hand by Clicking Retry Now when they were about to Back Off. but I can't watch them any longer. I'll just have to hope for the best. For future Reference, if this happens again, what will happen if I SUSPEND The PROJECT? ..... Will it Suspend The Downloads, and if it does, what affect will it have? GALAXY-VOYAGER |
QSilver Send message Joined: 26 May 99 Posts: 232 Credit: 6,452,764 RAC: 0 |
@GALAXY-VOYAGER...you really need to ask questions like that in the Number Crunching Forum. Plenty people over there willing & able to help. This forum is for technical news about the project from the project leaders. Getting assistance with your particular set-up or issues will happen more quickly in Number Crunching. |
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
For clarity: the possible, easier upgrade in bandwidth might be a nice side effect when moving to the colo, not the main reason for moving to colo. Also, weekly outages entirely happen remotely, i.e. I sit at my desk instead of standing in the closet. That is except if I'm moving servers around, replacing drives, etc. In preparation for all this Jeff and I have been keeping track how often we need to go into the closet for reasons that wouldn't be taken care of by the colo. We're looking at once a week, tops. Probably more like once a month. Oh yeah part of the deal is if we keep a stash of hard drives down there, they could do drive swaps for us as well. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
Is this a Berkeley run facility? Or a 3rd party facility located on campus? "Time is simply the mechanism that keeps everything from happening all at once." |
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Is this a Berkeley run facility? Or a 3rd party facility located on campus? Part of UC Berkeley. Otherwise there would be no way we could afford it! - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
Is this a Berkeley run facility? Or a 3rd party facility located on campus? Then I suppose some consideration would be given to the reduced costs in the present location due to not using the electricity for the servers and AC unit? Or is that already figured in? "Time is simply the mechanism that keeps everything from happening all at once." |
Wiyosaya Send message Joined: 19 May 99 Posts: 39 Credit: 2,985,585 RAC: 0 |
Thanks for the update, Matt. Personally, I've been holding off running S@H these days because it takes hours sometimes to get a single WU. If I hear this situation has gotten better, I would likely run S@H again. |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 31012 Credit: 53,134,872 RAC: 32 |
Is this a Berkeley run facility? Or a 3rd party facility located on campus? Berkeley pays for the electricity, A/C, etc. no matter where on campus the machines are. Additional costs should really only be the people being available to kick the machines, plus whatever markup the Regents want. Even the floor space of the server closet would be returned for them to "rent" to some other activity at the SSL. Releasing the fiber SSL<-->Campus IT from dedicated to Seti@Home to any use hopefully is a consideration that might be able to leverage better bandwidth from the Colo to PAIX, but I don't know who "paid" for that fiber in the first place. To keep this project running, I'm sure they are very good at counting beans. |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Any chance you can turn the splitters on at Seti Beta please, they haven't been running for at least four days now, Claggy |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Any chance you can make something to allow us at least report the allready crunched WU? Only server comunications error for the last 4 days. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.