| Author |
Message |
Matt LebofskyVolunteer moderator Project administrator Project developer Project scientist
 Send message
Joined: 1 Mar 99 Posts: 1379 Credit: 74,079 RAC: 0

|
|
Okay so mork (the mysql database server) crashed again on Friday, and Jeff/Eric took care of getting that all back on line without much ado. Okay, yes, this is a crisis now, but we're not sure what the problem is, nor do we have any immediate solution (since we don't have another 24 processor system with 64GB of memory hanging around). Each time this happens jocelyn (the replica server) gets out of sync and is rendered useless until we can recover it during the next Tuesday weekly outage (which we're just getting out of now, and the jocelyn recovery is taking place as I type). So it's slightly frustrating that jocelyn, a powerful server in its own right, is twiddling its thumbs a lot of the time these days waiting to be resynced. Sigh.
We're also still hitting one snag or another trying to remove the corruption in the astropulse signal table. We'll fix it eventually - it's just a matter of shuffling around rather large tables containing millions of rows, etc.
I tried doing an OS upgrade on our web server this afternoon, but this had to be abandoned as the root RAID device was showing up half degraded during the install for no apparent reason - and when I'd bail on the install and restart the old OS the root RAID would look just fine. Weird.
Wow. Rereading these tech news items they always sound so negative. Okay then here's some good news: Eric and Jeff have been making great leaps in various parts of the scientific analysis back end, i.e. in the NTPCkr and first levels of interference rejection. I'm hoping there's more specific news to report on those fronts in the near future.
And there was recent mention of SETI@home perhaps suffering from "feature/scope creep." I actually completely agree with this concern, but this is a common, general problem with academic (i.e. non-professional) endeavors. The lack of resources is usually the main cause, then catalysed by the lack of hard deadlines and financial risk. That said, I think we do a pretty amazing job, given what we have, keeping the whole engine running while making slow but nevertheless non-zero progress on the final data products. The glacial speeds sometimes drive me crazy, but I usually solve that by involving myself in other professional/commercial jobs on the side that have harder defined goals and immediate rewards. I would like to see SETI@home "take a break" to devote all our efforts towards the science part for a while, but I admit there's both pros and cons going this route. I'm currently outvoted on this front, so we stick with the status quo.
- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
|
|
|
|
|
Sorry to derail your fantasy, Matt, but "feature/scope creep" is NOT the prerogative of academia. It is one of the most difficult elements to eliminate from any project - particularly in large organisations and bureaucracies.
F.
[edit]And thanks as ever for taking the time to share what's going on on the "inside". [/edit]
____________
|
|
|
Matt LebofskyVolunteer moderator Project administrator Project developer Project scientist
 Send message
Joined: 1 Mar 99 Posts: 1379 Credit: 74,079 RAC: 0

|
|
Oh I know - I didn't mean to make it sound so black/white. I'm firmly aware of many commercial ventures that never get off the ground due to what you said - that's why I stick to "pro jobs" that are small and digestible.
Sorry to derail your fantasy, Matt, but "feature/scope creep" is NOT the prerogative of academia. It is one of the most difficult elements to eliminate from any project - particularly in large organisations and bureaucracies.
- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
|
|
|
|
Sorry to derail your fantasy, Matt, but "feature/scope creep" is NOT the prerogative of academia. It is one of the most difficult elements to eliminate from any project - particularly in large organisations and bureaucracies.
F.
[edit]And thanks as ever for taking the time to share what's going on on the "inside". [/edit]
Features must be added to stay competitive, but it is almost impossible to remove a feature - even one that is known to be used by nearly no one.
____________
BOINC WIKI |
|
|
|
|
Sorry to derail your fantasy, Matt, but "feature/scope creep" is NOT the prerogative of academia. It is one of the most difficult elements to eliminate from any project - particularly in large organisations and bureaucracies.
F.
[edit]And thanks as ever for taking the time to share what's going on on the "inside". [/edit]
Features must be added to stay competitive, but it is almost impossible to remove a feature - even one that is known to be used by nearly no one.
Ah yes, the last 95 of 100 features added.
Thanks for the updates Matt.
____________
|
|
|
|
|
|
I have some thoughts to consideration.
Maybe you should go back with primary database to jocelyn which is more stable and use mork for some cpu intensive tasks like software radar blanker. I know that mork have a lot of ram that will be wasted in such tasks. But it may fit to other server, maybe jocelyn?
Like always, thanks for the update Matt.
____________
|
|
|
|
|
|
Or maybe play memory stick round robin every Tuesday and see if you can get the crash to follow one particular chip (into the waste can) ... |
|
|
|
|
|
One of Matt's great attributes as the in-the-bunker communicator, a sort of embedded journalist in this war on (ET) noise, is his willingness to be candid and truthful. So I was not surprised that he admits to the scope/creep issue, which I infer is greater than would be ideal.
I think the primary driver for this project management problem is the tendency of participants to succumb to their zeal; thereby they take on more than they can handle, etc.
In seti this translates to processing too many wu's per day. So I wonder aloud (what's the word for wondering on an internet message board where no acoustics exist?) whether seti would be better off just reducing the number of wu's issued per day for the project as a whole, backing off to a point where the system fails less frequently (assuming one exists). In tangible terms, would it really hurt anyone if 10% fewer wu's were issued per day?
A related question is why not turn off the multibeam sub-project (if it can be called a subproject). Not over night, but reduce the wu availability in deference to AP. I think that was the long term plan, but why not accelerate things and wack the scope/creep problem in this fashion. |
|
|
|
|
|
I know you said that the rest of the Seti team voted you down, but as an end user, I would not have any problems with Seti@home going offline for a couple months while you redo the plumbing and take care of any lingering problems / science that needs to be done. I have a dozen other BOINC projects to keep my processors warm while SETI does a long overdue spring cleaning.
____________
You will be assimilated...bunghole!
|
|
|
|
|
|
Just wondering...
I think the primary driver for this project management problem is the tendency of participants to succumb to their zeal; thereby they take on more than they can handle, etc.
I ever heard about this problem too, they are bombarding SETI with work requests when their downloads fail + about pile up wu too.
And then I look at 'BOINC Manager Preferences' additional work buffer max 10 days, why not change it? For example 5 days max, so with that cruncers will not too many wu to pile up
____________
N = R x fp x ne x fl x fi x fc x L |
|
|
|
|
I know you said that the rest of the Seti team voted you down, but as an end user, I would not have any problems with Seti@home going offline for a couple months while you redo the plumbing and take care of any lingering problems / science that needs to be done. I have a dozen other BOINC projects to keep my processors warm while SETI does a long overdue spring cleaning.
Something tells me that a couple of months off will just increase the mission creep. But perhaps making the last Tuesday of the month outage an all day outage will give them enough time to clear up all the server issues that build up. With no pressure to get back online quick there is time to get RAIDS happy, move cables around, reduce interdependencies in file mounts, time to update O/S and database programs and all manner of other things all of which can be done single user to speed them up.
I also know getting Ntpckr going full blast and getting the radar blanking are important. I also know there has to be cash in hand to pay the salary of the staff programmers for that. I assume once there is funding, time to do the work isn't the issue.
____________
|
|
|
|
|
|
The only way to stop mission creep is for someone with a little authority to jump up on the table and say, "NO!, NO!, HELL NO!" to all these "little" expansions.
Mission creep is very common to commercial projects where the project manager is too willing to let marketing have their way and keep expanding the feature list. It is the sure way to kill a project even after millions of dollars have been invested because you never really deliver a stable product.
Just say NO!
____________
|
|
|
|
|
|
I think we're gonna have to send Mork back to Orson! Not sure how easy it is to diagnose a fault on a big server like that, I only have my 2 home machines to look after. (And half a dozen friends' machines when they get a problem, lol!)
I'm presuming that the same rules apply though...
I'm also presuming that all the obvious stuff has also been checked too, and that no more can be done with the machine needing to be available and online.
Hope you get it sorted, Matt.
regards, Gizbar.
____________
A proud GPU User Server Donor! |
|
|
|
|
|
Would a single-day outage even permit a better diagnosis of Mork and Ptolemy's crashes? Matt said Jeff had issues with a debug kernel tested on his desktop, and as far as I can gather those haven't been resolved or are low-priority.
I'm guessing by the move to try a debug kernel that one of the suspects (and possibly nastiest) is the O/S interface with the hardware. If that is indeed the culprit, Mork at least may be fine and dandy hardware-wise. I haven't been keeping up with Ptolemy, so I'll reserve comment there. Diagnosing the issue may take several crashes to compare multiple dumps. At this point, there's no information on the cause to try to replicate the crash. Without being able to use a debug kernel to get thorough information on errors before and leading up to a crash the options for diagnosis are limited.
Matt - have you had the opportunity to run a memory test on Mork? I think that was one of the first things done before it was brought online (back when it first received a questionable reputation). I have no clue how long a full series of tests takes on 64 Gb using one 2.13 GHz core, so perhaps that may not be convenient.
Matt, my hat goes off to you for your sys-admin work. These computers are unique beasts. As always, thanks for keeping us all in the loop.
____________
<<Photo: ISS & Shuttle Discovery on STS-116 parting ways>>
The SETI@home distributed cluster is listening. |
|
|
|
|
|
Some things can only be truly tested under 'full working load'......
And I suspect that is true of the tangled Seti server configuration.
My rigs are all nastily OC'd...and although they may boot at speed unknown to some......once Seti crunching commences.....
If the settings are not all exactly correct, mayhem ensues.
I would posture that the same is true of the Seti servers.
Lighten the load, and they might do just fine.
But that is not really the point here, is it?
You will not find the faults unless you bring things to the breaking point.
____________
******
"Ask not, what your kitty can do for you. Ask what you can do for your kitty."
As it is kitten, so shall it be done.
|
|
|
|
|
|
I'm afraid the only way to solve this mystery is a complete shutdown of Mork and running of memtest 64-bit.
After that, a test of stability some such other test on all cores, under the debug kernel. Some options:
CPUBurn http://users.bigpond.net.au/CPUburn/
OverclockIX (liveCD) http://iso.linuxquestions.org/download/396/389/http/overclockix.octeams.com/Overclockix_3.8.iso
StressCPU http://oldwww.gromacs.org/component/option,com_docman/task,doc_details/gid,80/Itemid,26/
Once you've verified that CPU/memory/chipset are stable, then we can look at disk controllers, drivers, etc. I expect these tests will take 24 hours of downtime (if run back-to-back).
Also, is ECC memory enabled in the BIOS? Should enable "correct errors". It's worth checking next time you reboot. |
|
|