Creep (Nov 17 2009)


log in

Advanced search

Message boards : Technical News : Creep (Nov 17 2009)

Author Message
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 947917 - Posted: 17 Nov 2009, 22:48:26 UTC

Okay so mork (the mysql database server) crashed again on Friday, and Jeff/Eric took care of getting that all back on line without much ado. Okay, yes, this is a crisis now, but we're not sure what the problem is, nor do we have any immediate solution (since we don't have another 24 processor system with 64GB of memory hanging around). Each time this happens jocelyn (the replica server) gets out of sync and is rendered useless until we can recover it during the next Tuesday weekly outage (which we're just getting out of now, and the jocelyn recovery is taking place as I type). So it's slightly frustrating that jocelyn, a powerful server in its own right, is twiddling its thumbs a lot of the time these days waiting to be resynced. Sigh.

We're also still hitting one snag or another trying to remove the corruption in the astropulse signal table. We'll fix it eventually - it's just a matter of shuffling around rather large tables containing millions of rows, etc.

I tried doing an OS upgrade on our web server this afternoon, but this had to be abandoned as the root RAID device was showing up half degraded during the install for no apparent reason - and when I'd bail on the install and restart the old OS the root RAID would look just fine. Weird.

Wow. Rereading these tech news items they always sound so negative. Okay then here's some good news: Eric and Jeff have been making great leaps in various parts of the scientific analysis back end, i.e. in the NTPCkr and first levels of interference rejection. I'm hoping there's more specific news to report on those fronts in the near future.

And there was recent mention of SETI@home perhaps suffering from "feature/scope creep." I actually completely agree with this concern, but this is a common, general problem with academic (i.e. non-professional) endeavors. The lack of resources is usually the main cause, then catalysed by the lack of hard deadlines and financial risk. That said, I think we do a pretty amazing job, given what we have, keeping the whole engine running while making slow but nevertheless non-zero progress on the final data products. The glacial speeds sometimes drive me crazy, but I usually solve that by involving myself in other professional/commercial jobs on the side that have harder defined goals and immediate rewards. I would like to see SETI@home "take a break" to devote all our efforts towards the science part for a while, but I admit there's both pros and cons going this route. I'm currently outvoted on this front, so we stick with the status quo.

- Matt

____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

Fred W
Volunteer tester
Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 947924 - Posted: 17 Nov 2009, 23:03:44 UTC
Last modified: 17 Nov 2009, 23:04:55 UTC

Sorry to derail your fantasy, Matt, but "feature/scope creep" is NOT the prerogative of academia. It is one of the most difficult elements to eliminate from any project - particularly in large organisations and bureaucracies.

F.

[edit]And thanks as ever for taking the time to share what's going on on the "inside". [/edit]
____________

Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 1 Mar 99
Posts: 1389
Credit: 74,079
RAC: 0
United States
Message 947931 - Posted: 17 Nov 2009, 23:28:55 UTC - in response to Message 947924.

Oh I know - I didn't mean to make it sound so black/white. I'm firmly aware of many commercial ventures that never get off the ground due to what you said - that's why I stick to "pro jobs" that are small and digestible.

Sorry to derail your fantasy, Matt, but "feature/scope creep" is NOT the prerogative of academia. It is one of the most difficult elements to eliminate from any project - particularly in large organisations and bureaucracies.


- Matt
____________
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude

John McLeod VII
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 15 Jul 99
Posts: 24102
Credit: 518,020
RAC: 150
United States
Message 947969 - Posted: 18 Nov 2009, 3:19:36 UTC - in response to Message 947924.

Sorry to derail your fantasy, Matt, but "feature/scope creep" is NOT the prerogative of academia. It is one of the most difficult elements to eliminate from any project - particularly in large organisations and bureaucracies.

F.

[edit]And thanks as ever for taking the time to share what's going on on the "inside". [/edit]

Features must be added to stay competitive, but it is almost impossible to remove a feature - even one that is known to be used by nearly no one.
____________


BOINC WIKI

Profile Gary Charpentier
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12144
Credit: 6,424,138
RAC: 8,100
United States
Message 947999 - Posted: 18 Nov 2009, 6:15:37 UTC - in response to Message 947969.

Sorry to derail your fantasy, Matt, but "feature/scope creep" is NOT the prerogative of academia. It is one of the most difficult elements to eliminate from any project - particularly in large organisations and bureaucracies.

F.

[edit]And thanks as ever for taking the time to share what's going on on the "inside". [/edit]

Features must be added to stay competitive, but it is almost impossible to remove a feature - even one that is known to be used by nearly no one.

Ah yes, the last 95 of 100 features added.

Thanks for the updates Matt.

____________

Profile Sebastian M. Bobrecki
Send message
Joined: 7 Feb 02
Posts: 13
Credit: 16,018,626
RAC: 0
Poland
Message 948028 - Posted: 18 Nov 2009, 10:57:32 UTC

I have some thoughts to consideration.

Maybe you should go back with primary database to jocelyn which is more stable and use mork for some cpu intensive tasks like software radar blanker. I know that mork have a lot of ram that will be wasted in such tasks. But it may fit to other server, maybe jocelyn?

Like always, thanks for the update Matt.
____________

supervlb
Send message
Joined: 2 Aug 08
Posts: 12
Credit: 1,046,468
RAC: 0
United States
Message 948036 - Posted: 18 Nov 2009, 13:29:45 UTC - in response to Message 948028.

Or maybe play memory stick round robin every Tuesday and see if you can get the crash to follow one particular chip (into the waste can) ...

PhonAcq
Send message
Joined: 14 Apr 01
Posts: 1622
Credit: 21,971,983
RAC: 3,924
United States
Message 948056 - Posted: 18 Nov 2009, 15:47:55 UTC

One of Matt's great attributes as the in-the-bunker communicator, a sort of embedded journalist in this war on (ET) noise, is his willingness to be candid and truthful. So I was not surprised that he admits to the scope/creep issue, which I infer is greater than would be ideal.

I think the primary driver for this project management problem is the tendency of participants to succumb to their zeal; thereby they take on more than they can handle, etc.

In seti this translates to processing too many wu's per day. So I wonder aloud (what's the word for wondering on an internet message board where no acoustics exist?) whether seti would be better off just reducing the number of wu's issued per day for the project as a whole, backing off to a point where the system fails less frequently (assuming one exists). In tangible terms, would it really hurt anyone if 10% fewer wu's were issued per day?

A related question is why not turn off the multibeam sub-project (if it can be called a subproject). Not over night, but reduce the wu availability in deference to AP. I think that was the long term plan, but why not accelerate things and wack the scope/creep problem in this fashion.

Profile Borgholio
Avatar
Send message
Joined: 2 Aug 99
Posts: 651
Credit: 11,874,679
RAC: 4,106
United States
Message 948129 - Posted: 18 Nov 2009, 22:56:07 UTC

I know you said that the rest of the Seti team voted you down, but as an end user, I would not have any problems with Seti@home going offline for a couple months while you redo the plumbing and take care of any lingering problems / science that needs to be done. I have a dozen other BOINC projects to keep my processors warm while SETI does a long overdue spring cleaning.
____________


You will be assimilated...bunghole!

Profile LiliKrist
Volunteer tester
Avatar
Send message
Joined: 12 Aug 09
Posts: 333
Credit: 143,167
RAC: 0
Indonesia
Message 948161 - Posted: 19 Nov 2009, 1:23:43 UTC - in response to Message 948056.

Just wondering...

I think the primary driver for this project management problem is the tendency of participants to succumb to their zeal; thereby they take on more than they can handle, etc.


I ever heard about this problem too, they are bombarding SETI with work requests when their downloads fail + about pile up wu too.

And then I look at 'BOINC Manager Preferences' additional work buffer max 10 days, why not change it? For example 5 days max, so with that cruncers will not too many wu to pile up
____________


N = R x fp x ne x fl x fi x fc x L

Profile Gary Charpentier
Volunteer tester
Avatar
Send message
Joined: 25 Dec 00
Posts: 12144
Credit: 6,424,138
RAC: 8,100
United States
Message 948202 - Posted: 19 Nov 2009, 6:36:53 UTC - in response to Message 948129.

I know you said that the rest of the Seti team voted you down, but as an end user, I would not have any problems with Seti@home going offline for a couple months while you redo the plumbing and take care of any lingering problems / science that needs to be done. I have a dozen other BOINC projects to keep my processors warm while SETI does a long overdue spring cleaning.

Something tells me that a couple of months off will just increase the mission creep. But perhaps making the last Tuesday of the month outage an all day outage will give them enough time to clear up all the server issues that build up. With no pressure to get back online quick there is time to get RAIDS happy, move cables around, reduce interdependencies in file mounts, time to update O/S and database programs and all manner of other things all of which can be done single user to speed them up.

I also know getting Ntpckr going full blast and getting the radar blanking are important. I also know there has to be cash in hand to pay the salary of the staff programmers for that. I assume once there is funding, time to do the work isn't the issue.
____________

Tom95134
Send message
Joined: 27 Nov 01
Posts: 213
Credit: 3,305,322
RAC: 1,007
United States
Message 948274 - Posted: 19 Nov 2009, 16:38:43 UTC

The only way to stop mission creep is for someone with a little authority to jump up on the table and say, "NO!, NO!, HELL NO!" to all these "little" expansions.

Mission creep is very common to commercial projects where the project manager is too willing to let marketing have their way and keep expanding the feature list. It is the sure way to kill a project even after millions of dollars have been invested because you never really deliver a stable product.

Just say NO!
____________

Profile gizbar
Avatar
Send message
Joined: 7 Jan 01
Posts: 586
Credit: 21,087,774
RAC: 0
United Kingdom
Message 948284 - Posted: 19 Nov 2009, 17:17:28 UTC
Last modified: 19 Nov 2009, 17:18:45 UTC

I think we're gonna have to send Mork back to Orson! Not sure how easy it is to diagnose a fault on a big server like that, I only have my 2 home machines to look after. (And half a dozen friends' machines when they get a problem, lol!)

I'm presuming that the same rules apply though...

I'm also presuming that all the obvious stuff has also been checked too, and that no more can be done with the machine needing to be available and online.

Hope you get it sorted, Matt.

regards, Gizbar.
____________


A proud GPU User Server Donor!

Profile Paul_Tergeist
Avatar
Send message
Joined: 17 Sep 04
Posts: 8
Credit: 11,760,089
RAC: 0
United States
Message 948482 - Posted: 20 Nov 2009, 8:38:53 UTC

Would a single-day outage even permit a better diagnosis of Mork and Ptolemy's crashes? Matt said Jeff had issues with a debug kernel tested on his desktop, and as far as I can gather those haven't been resolved or are low-priority.

I'm guessing by the move to try a debug kernel that one of the suspects (and possibly nastiest) is the O/S interface with the hardware. If that is indeed the culprit, Mork at least may be fine and dandy hardware-wise. I haven't been keeping up with Ptolemy, so I'll reserve comment there. Diagnosing the issue may take several crashes to compare multiple dumps. At this point, there's no information on the cause to try to replicate the crash. Without being able to use a debug kernel to get thorough information on errors before and leading up to a crash the options for diagnosis are limited.

Matt - have you had the opportunity to run a memory test on Mork? I think that was one of the first things done before it was brought online (back when it first received a questionable reputation). I have no clue how long a full series of tests takes on 64 Gb using one 2.13 GHz core, so perhaps that may not be convenient.

Matt, my hat goes off to you for your sys-admin work. These computers are unique beasts. As always, thanks for keeping us all in the loop.
____________
<<Photo: ISS & Shuttle Discovery on STS-116 parting ways>>

The SETI@home distributed cluster is listening.

msattler
Volunteer tester
Avatar
Send message
Joined: 9 Jul 00
Posts: 38327
Credit: 560,796,351
RAC: 655,906
United States
Message 948483 - Posted: 20 Nov 2009, 8:44:35 UTC

Some things can only be truly tested under 'full working load'......

And I suspect that is true of the tangled Seti server configuration.

My rigs are all nastily OC'd...and although they may boot at speed unknown to some......once Seti crunching commences.....

If the settings are not all exactly correct, mayhem ensues.

I would posture that the same is true of the Seti servers.
Lighten the load, and they might do just fine.

But that is not really the point here, is it?
You will not find the faults unless you bring things to the breaking point.
____________
*********************************************
Embrace your inner kitty...ya know ya wanna!

I have met a few friends in my life.
Most were cats.

DJStarfox
Send message
Joined: 23 May 01
Posts: 1040
Credit: 540,292
RAC: 561
United States
Message 948587 - Posted: 20 Nov 2009, 19:05:22 UTC - in response to Message 948482.

I'm afraid the only way to solve this mystery is a complete shutdown of Mork and running of memtest 64-bit.

After that, a test of stability some such other test on all cores, under the debug kernel. Some options:
CPUBurn http://users.bigpond.net.au/CPUburn/
OverclockIX (liveCD) http://iso.linuxquestions.org/download/396/389/http/overclockix.octeams.com/Overclockix_3.8.iso
StressCPU http://oldwww.gromacs.org/component/option,com_docman/task,doc_details/gid,80/Itemid,26/

Once you've verified that CPU/memory/chipset are stable, then we can look at disk controllers, drivers, etc. I expect these tests will take 24 hours of downtime (if run back-to-back).

Also, is ECC memory enabled in the BIOS? Should enable "correct errors". It's worth checking next time you reboot.

Message boards : Technical News : Creep (Nov 17 2009)

Copyright © 2014 University of California