Message boards :
Technical News :
Creep (Nov 17 2009)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Okay so mork (the mysql database server) crashed again on Friday, and Jeff/Eric took care of getting that all back on line without much ado. Okay, yes, this is a crisis now, but we're not sure what the problem is, nor do we have any immediate solution (since we don't have another 24 processor system with 64GB of memory hanging around). Each time this happens jocelyn (the replica server) gets out of sync and is rendered useless until we can recover it during the next Tuesday weekly outage (which we're just getting out of now, and the jocelyn recovery is taking place as I type). So it's slightly frustrating that jocelyn, a powerful server in its own right, is twiddling its thumbs a lot of the time these days waiting to be resynced. Sigh. We're also still hitting one snag or another trying to remove the corruption in the astropulse signal table. We'll fix it eventually - it's just a matter of shuffling around rather large tables containing millions of rows, etc. I tried doing an OS upgrade on our web server this afternoon, but this had to be abandoned as the root RAID device was showing up half degraded during the install for no apparent reason - and when I'd bail on the install and restart the old OS the root RAID would look just fine. Weird. Wow. Rereading these tech news items they always sound so negative. Okay then here's some good news: Eric and Jeff have been making great leaps in various parts of the scientific analysis back end, i.e. in the NTPCkr and first levels of interference rejection. I'm hoping there's more specific news to report on those fronts in the near future. And there was recent mention of SETI@home perhaps suffering from "feature/scope creep." I actually completely agree with this concern, but this is a common, general problem with academic (i.e. non-professional) endeavors. The lack of resources is usually the main cause, then catalysed by the lack of hard deadlines and financial risk. That said, I think we do a pretty amazing job, given what we have, keeping the whole engine running while making slow but nevertheless non-zero progress on the final data products. The glacial speeds sometimes drive me crazy, but I usually solve that by involving myself in other professional/commercial jobs on the side that have harder defined goals and immediate rewards. I would like to see SETI@home "take a break" to devote all our efforts towards the science part for a while, but I admit there's both pros and cons going this route. I'm currently outvoted on this front, so we stick with the status quo. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Fred W Send message Joined: 13 Jun 99 Posts: 2524 Credit: 11,954,210 RAC: 0 |
Sorry to derail your fantasy, Matt, but "feature/scope creep" is NOT the prerogative of academia. It is one of the most difficult elements to eliminate from any project - particularly in large organisations and bureaucracies. F. [edit]And thanks as ever for taking the time to share what's going on on the "inside". [/edit] |
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Oh I know - I didn't mean to make it sound so black/white. I'm firmly aware of many commercial ventures that never get off the ground due to what you said - that's why I stick to "pro jobs" that are small and digestible. Sorry to derail your fantasy, Matt, but "feature/scope creep" is NOT the prerogative of academia. It is one of the most difficult elements to eliminate from any project - particularly in large organisations and bureaucracies. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 |
Sorry to derail your fantasy, Matt, but "feature/scope creep" is NOT the prerogative of academia. It is one of the most difficult elements to eliminate from any project - particularly in large organisations and bureaucracies. Features must be added to stay competitive, but it is almost impossible to remove a feature - even one that is known to be used by nearly no one. BOINC WIKI |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 31012 Credit: 53,134,872 RAC: 32 |
Sorry to derail your fantasy, Matt, but "feature/scope creep" is NOT the prerogative of academia. It is one of the most difficult elements to eliminate from any project - particularly in large organisations and bureaucracies. Ah yes, the last 95 of 100 features added. Thanks for the updates Matt. |
Sebastian M. Bobrecki Send message Joined: 7 Feb 02 Posts: 23 Credit: 38,375,443 RAC: 0 |
I have some thoughts to consideration. Maybe you should go back with primary database to jocelyn which is more stable and use mork for some cpu intensive tasks like software radar blanker. I know that mork have a lot of ram that will be wasted in such tasks. But it may fit to other server, maybe jocelyn? Like always, thanks for the update Matt. |
supervlb Send message Joined: 2 Aug 08 Posts: 12 Credit: 1,096,752 RAC: 0 |
Or maybe play memory stick round robin every Tuesday and see if you can get the crash to follow one particular chip (into the waste can) ... |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
One of Matt's great attributes as the in-the-bunker communicator, a sort of embedded journalist in this war on (ET) noise, is his willingness to be candid and truthful. So I was not surprised that he admits to the scope/creep issue, which I infer is greater than would be ideal. I think the primary driver for this project management problem is the tendency of participants to succumb to their zeal; thereby they take on more than they can handle, etc. In seti this translates to processing too many wu's per day. So I wonder aloud (what's the word for wondering on an internet message board where no acoustics exist?) whether seti would be better off just reducing the number of wu's issued per day for the project as a whole, backing off to a point where the system fails less frequently (assuming one exists). In tangible terms, would it really hurt anyone if 10% fewer wu's were issued per day? A related question is why not turn off the multibeam sub-project (if it can be called a subproject). Not over night, but reduce the wu availability in deference to AP. I think that was the long term plan, but why not accelerate things and wack the scope/creep problem in this fashion. |
Borgholio Send message Joined: 2 Aug 99 Posts: 654 Credit: 18,623,738 RAC: 45 |
I know you said that the rest of the Seti team voted you down, but as an end user, I would not have any problems with Seti@home going offline for a couple months while you redo the plumbing and take care of any lingering problems / science that needs to be done. I have a dozen other BOINC projects to keep my processors warm while SETI does a long overdue spring cleaning. You will be assimilated...bunghole! |
LiliKrist Send message Joined: 12 Aug 09 Posts: 333 Credit: 143,167 RAC: 0 |
Just wondering... I think the primary driver for this project management problem is the tendency of participants to succumb to their zeal; thereby they take on more than they can handle, etc. I ever heard about this problem too, they are bombarding SETI with work requests when their downloads fail + about pile up wu too. And then I look at 'BOINC Manager Preferences' additional work buffer max 10 days, why not change it? For example 5 days max, so with that cruncers will not too many wu to pile up N = R x fp x ne x fl x fi x fc x L |
Gary Charpentier Send message Joined: 25 Dec 00 Posts: 31012 Credit: 53,134,872 RAC: 32 |
I know you said that the rest of the Seti team voted you down, but as an end user, I would not have any problems with Seti@home going offline for a couple months while you redo the plumbing and take care of any lingering problems / science that needs to be done. I have a dozen other BOINC projects to keep my processors warm while SETI does a long overdue spring cleaning. Something tells me that a couple of months off will just increase the mission creep. But perhaps making the last Tuesday of the month outage an all day outage will give them enough time to clear up all the server issues that build up. With no pressure to get back online quick there is time to get RAIDS happy, move cables around, reduce interdependencies in file mounts, time to update O/S and database programs and all manner of other things all of which can be done single user to speed them up. I also know getting Ntpckr going full blast and getting the radar blanking are important. I also know there has to be cash in hand to pay the salary of the staff programmers for that. I assume once there is funding, time to do the work isn't the issue. |
Tom95134 Send message Joined: 27 Nov 01 Posts: 216 Credit: 3,790,200 RAC: 0 |
The only way to stop mission creep is for someone with a little authority to jump up on the table and say, "NO!, NO!, HELL NO!" to all these "little" expansions. Mission creep is very common to commercial projects where the project manager is too willing to let marketing have their way and keep expanding the feature list. It is the sure way to kill a project even after millions of dollars have been invested because you never really deliver a stable product. Just say NO! |
gizbar Send message Joined: 7 Jan 01 Posts: 586 Credit: 21,087,774 RAC: 0 |
I think we're gonna have to send Mork back to Orson! Not sure how easy it is to diagnose a fault on a big server like that, I only have my 2 home machines to look after. (And half a dozen friends' machines when they get a problem, lol!) I'm presuming that the same rules apply though... I'm also presuming that all the obvious stuff has also been checked too, and that no more can be done with the machine needing to be available and online. Hope you get it sorted, Matt. regards, Gizbar. A proud GPU User Server Donor! |
Paul_Tergeist Send message Joined: 17 Sep 04 Posts: 8 Credit: 11,760,089 RAC: 0 |
Would a single-day outage even permit a better diagnosis of Mork and Ptolemy's crashes? Matt said Jeff had issues with a debug kernel tested on his desktop, and as far as I can gather those haven't been resolved or are low-priority. I'm guessing by the move to try a debug kernel that one of the suspects (and possibly nastiest) is the O/S interface with the hardware. If that is indeed the culprit, Mork at least may be fine and dandy hardware-wise. I haven't been keeping up with Ptolemy, so I'll reserve comment there. Diagnosing the issue may take several crashes to compare multiple dumps. At this point, there's no information on the cause to try to replicate the crash. Without being able to use a debug kernel to get thorough information on errors before and leading up to a crash the options for diagnosis are limited. Matt - have you had the opportunity to run a memory test on Mork? I think that was one of the first things done before it was brought online (back when it first received a questionable reputation). I have no clue how long a full series of tests takes on 64 Gb using one 2.13 GHz core, so perhaps that may not be convenient. Matt, my hat goes off to you for your sys-admin work. These computers are unique beasts. As always, thanks for keeping us all in the loop. <<Photo: ISS & Shuttle Discovery on STS-116 parting ways>> The SETI@home distributed cluster is listening. |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
Some things can only be truly tested under 'full working load'...... And I suspect that is true of the tangled Seti server configuration. My rigs are all nastily OC'd...and although they may boot at speed unknown to some......once Seti crunching commences..... If the settings are not all exactly correct, mayhem ensues. I would posture that the same is true of the Seti servers. Lighten the load, and they might do just fine. But that is not really the point here, is it? You will not find the faults unless you bring things to the breaking point. "Time is simply the mechanism that keeps everything from happening all at once." |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
I'm afraid the only way to solve this mystery is a complete shutdown of Mork and running of memtest 64-bit. After that, a test of stability some such other test on all cores, under the debug kernel. Some options: CPUBurn http://users.bigpond.net.au/CPUburn/ OverclockIX (liveCD) http://iso.linuxquestions.org/download/396/389/http/overclockix.octeams.com/Overclockix_3.8.iso StressCPU http://oldwww.gromacs.org/component/option,com_docman/task,doc_details/gid,80/Itemid,26/ Once you've verified that CPU/memory/chipset are stable, then we can look at disk controllers, drivers, etc. I expect these tests will take 24 hours of downtime (if run back-to-back). Also, is ECC memory enabled in the BIOS? Should enable "correct errors". It's worth checking next time you reboot. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.