Message boards :
Technical News :
Testing the Plumbing (Nov 12 2009)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Turns out the replica recovery was much faster than expected on Tuesday, so I was able to get that on line before the day was out. Then we had the day off yesterday, and now today. Let's see. Seems like I've been lost in testing land today. First, we finally decided on a method to fix the corruption in our Astropulse signal table. It's just one row that needs to be deleted, but we can just delete it using sql - we have to dump the entire database fragment (containing 25% of all the ap signals) and reload it without the one bad row. I wrote a program to test the data flowing in and out of this plumbing to make sure all the funny blob columns remain intact during the procedure. Bob also sleuthed out that this particular corruption actually happened months ago, not during this last RAID hiccup. Fine. Second, I'm also working on a suite of more robust tests/etc. for the software radar blanked results, now that we're getting lots of them. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
LiliKrist Send message Joined: 12 Aug 09 Posts: 333 Credit: 143,167 RAC: 0 |
Thank you for the update news Master Matt =) N = R x fp x ne x fl x fi x fc x L |
BMgoau Send message Joined: 8 Jan 07 Posts: 29 Credit: 1,562,200 RAC: 0 |
I don't really comment here very often, but I have been following the technical news for years. I've noticed in the last two or so, with increasing regularity that SETI@home seems to be suffering from "creep". Be it management creep, scope creep or mission creep. Take your pick: http://en.wikipedia.org/wiki/Creep_%28project_management%29 http://en.wikipedia.org/wiki/Mission_creep http://en.wikipedia.org/wiki/Scope_creep It seems like the projects only goal is to keep the data flowing. To perpetuate itself without end. The goal is still clear: find life, but i think the project has become bogged down in the methodology. You're a legend Matt, your tireless work and effort are wonderful, and i appreciate you always keeping us up to date. Its not something you have to do, but i'm sure we all very much appreciate it. As you mentioned, you have been working on radar blanking, and ntpckr and achieved some wonderful results. But I imagine how much more might be achieved if the project (the @home part) was shut down for say 12 months so all these continual bugs and things like radar blanking could be worked out without the overhanging need to keep the data pipeline flowing. You have suggested this once before, and i think it would be a great idea. The project could be consolidated, redefined and smoothed out into something more effective and manageable so more time can be spent on analysing the data, rather than just ensuring it flows. Just my 2c :) |
John McLeod VII Send message Joined: 15 Jul 99 Posts: 24806 Credit: 790,712 RAC: 0 |
I don't really comment here very often, but I have been following the technical news for years. We have had this discussion before. The bugs that show up are mostly because of the high load on the servers. If you remove the stress by not handing out work, the bugs disappear as well. BOINC WIKI |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Just my 2c :) As JM7 pointed out, if you stop distributing data, the load goes down and all of the "bugs" you mention vanish and the servers run smoothly at zero load. More to the point, they would no longer have an environment where they can test the problems caused by such high loading. But that isn't the problem, and taking a year off won't solve the problem. The problem is: too many users think that the problems are show-stoppers -- that they're issues that must be fixed right now! or the project is doomed. IMO, the project doesn't need "time off" to fix problems, what they need is a bit of better hardware (new stuff that isn't hand-me-down) and a little more staff. Unfortunately, two cents won't buy that. I think the idea that this is caused by "creep" is simply incorrect -- the goals are the same, but the budget doesn't allow things to be done as quickly as anyone would like. |
kittyman Send message Joined: 9 Jul 00 Posts: 51478 Credit: 1,018,363,574 RAC: 1,004 |
Shutting the project down and doing a restart months later would trash the project's user base and it's reputation. It might take them years to recover. Much better that Matt, Eric, and the others fight the good fight and keep us all crunching the data whilst they work out the details. And Matt, please do keep an eye on those 'funny blob columns' for us, eh? LOL. "Time is simply the mechanism that keeps everything from happening all at once." |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
I think BMgoau isn't entirely wrong, yet some of us do treat this project as though it were something more than it is. One mis-assumption most of us have is that this project is uniquely about finding ET. That most certainly is not what it is today, although it is how seti was started. Today, the project has two themes with almost equal purpose: finding ET and developing BOINC. It isn't hard to imagine that having two masters it may be difficult to satisfy either, despite the apparent synergies between them. Reading this thread, it appears that if data distribution were stopped, the servers would hum along nicely doing nothing and if we go full tilt, which I assume is the case today, then we have intermittent problems. Ok, then wouldn't it make sense to cut back a bit to understand the point at which the stress introduces problems, fix them, and then increase the distribution rate and fix the next layer of problems that show up, and so on? If the current budget and technology prevents solving the problem in that matter, is it efficient to run the project at a higher level? Isn't that kindof like watering your yard during a haboob? Are we wasting resources by running at too high a (distribution) level? Today this question is probably rhetorical, of course. A professional development project would also embrace a road map methodology, which would detail how to get to certain performance goals, rather than a fire drill one. In the boinc case, if it were to be treated as 'professional', qualities like reliability would have measurable metrics and be used to structure the problem solving and development activities. Another would be the number of active hosts; I for one would like to see a goal of hosting 2x our current number of 300K active hosts within reasonable and published reliability criteria. As a counter example, the budget should not be used as an excuse, but be used as another boundary condition; few budgets are boundless, so whining about seti's is not being productive. Another project management tool to use to demonstrate progress and achievement is the breaking of a complex project into subprojects. Seti seems to be an endless stream of calculation and filling a table that is getting more and more unwieldy. I wish they could package seti in this way, so we can say we are doing more than that, and might be a way to increase the active user base more rapidly. The big variable in all this is probably the source of data to analyze. This fact may indeed limit the seti side of the equation, at least, or alternatively be used to justify an alternative source of data to analyze. Ok, I'm rambling. [/ramblin] |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Reading this thread, it appears that if data distribution were stopped, the servers would hum along nicely doing nothing and if we go full tilt, which I assume is the case today, then we have intermittent problems. Ok, then wouldn't it make sense to cut back a bit to understand the point at which the stress introduces problems, fix them, and then increase the distribution rate and fix the next layer of problems that show up, and so on? If the current budget and technology prevents solving the problem in that matter, is it efficient to run the project at a higher level? Isn't that kindof like watering your yard during a haboob? Are we wasting resources by running at too high a (distribution) level? Today this question is probably rhetorical, of course. Reducing the number of hosts may allow work to upload and download more smoothly, but the goal is not to run smoothly, it is to search for signals in recorded data from the telescope. To do so on a small budget, they are pushing very high loading compared to the typical E-Commerce standards. This is all set out in the whitepapers at boinc.berkeley.edu. I'm sure SETI@Home would like the budget to increase staff, get more bandwidth, buy faster servers, etc. But you're both missing the most important question: how far can they really push it, and what can they do to better utilize a small, finite resource? Matt gives us a valuable view into the challenges of running at high sustained levels, and some of us take it as a sign of impending doom instead of charting new territory. |
Dr.Argentum Send message Joined: 24 Nov 99 Posts: 6 Credit: 6,977,431 RAC: 0 |
I don't really comment here very often, but I have been following the technical news for years. I'm much the same, but I want to say that I am impressed with the work that Matt & crew do to keep data flowing, including to us in the forum. I am surprised that there have been very few major interruptions. This is a research project in more areas than one. On the other hand, I now work in a government building and we get interrupted about every quarter. I have often wondered why the staff has not planned to take the project offline for two to three days to overhaul/reset the system. There have been a few times, particularly in the last year, when I was fully expecting this to happen. That such maintenance is made to fit into the weekly outages is also impressive. (I'm a chemical / metallurgical engineer and plant shut downs are common yearly occurrences.) But then, the start-up after a three day shut-down would swamp the servers for a week after... |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
I have often wondered why the staff has not planned to take the project offline for two to three days to overhaul/reset the system. Probably because that isn't good science. To actually diagnose the problem, you want to isolate each variable, and test each one. You want to make one change at a time. It's slow, but when you make one change, and the logjam lets go, you know what you changed to do that. If you "shotgun" the fix (change everything all at once) and it fixes it, you run the risk of the problem coming back someday, and you're in the same spot you were in before. Besides, a major overhaul can probably be divided into smaller tasks, and taken on one at a time without causing a multi-day outage. |
Bill Walker Send message Joined: 4 Sep 99 Posts: 3868 Credit: 2,697,267 RAC: 0 |
I think Ned and John have made some good points. As I understand it, there are 2 goals here: finding ET, and pushing the limits of distributed computing. Us out here in the peanut gallery expect a good data flow, but that is really only a by-product of the 2 main goals. The analogy of the annual plant shut down is a good one only if your only goal is to maintain a high average production. A better analogy here might be found in developmental hardware testing: push it, break it, fix it, push harder till it breaks again, repeat as long as the money lasts. It reminds me of a quote from one of my first bosses in the testing business - "if you don't break something once in awhile you are not trying hard enough". Having said all that, many thanks to Matt and crew for all they do. |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
It reminds me of a quote from one of my first bosses in the testing business - "if you don't break something once in awhile you are not trying hard enough". You used to work for Red Green? |
zour Send message Joined: 7 Jun 08 Posts: 10 Credit: 369,794 RAC: 0 |
Without having a clue of the technical aspects of SETI, I have some rather simple questions. My last workunits are from April and now March 2007. They have already been processed two years ago, am I right? Why do this again? Just to simulate SETI is still intact? What's with the signals of the last weeks, when fresh WU's run out? Are they recorded and will be available or is the dish not working since? If so, when will the dish work? Any guesses or estimations are welcome. Sorry for being so impatient! |
Bill Walker Send message Joined: 4 Sep 99 Posts: 3868 Credit: 2,697,267 RAC: 0 |
Zour, I think this has been covered before on the Number Crunching forum and here. There is no new data for a few months, but Matt and friends have implemented new radar blanking software, that permits the crunching of old data that either wasn't sent out before or bombed immediately when sent out, because of the radar interference in the data. The radar blanking software pre-processes the data, so us normal people can crunch it in a useful way. Last I heard, fresh data should start around the end of November. |
Bill Walker Send message Joined: 4 Sep 99 Posts: 3868 Credit: 2,697,267 RAC: 0 |
It reminds me of a quote from one of my first bosses in the testing business - "if you don't break something once in awhile you are not trying hard enough". Now that you mention it, he did look a little like Mr. Green. And I looked a little like Harold at the time. Red Green actually just recycled a lot of old Canadian sayings, I think all his standards were in use long before the show started. I can only think of one that was truly original, "You're only young once, but you can always be immature". |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 66359 Credit: 55,293,173 RAC: 49 |
It reminds me of a quote from one of my first bosses in the testing business - "if you don't break something once in awhile you are not trying hard enough". So that's why that show sounds like a friend of Mine and yep, He's Canadian, He works nearby and has an American wife and 2 kids(1 His with Her and 1 that I think they adopted or wanted to adopt). Savoir-Faire is everywhere! The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST |
Bill Walker Send message Joined: 4 Sep 99 Posts: 3868 Credit: 2,697,267 RAC: 0 |
Well ya see, we all talk like that, eh? |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 66359 Credit: 55,293,173 RAC: 49 |
Well ya see, we all talk like that, eh? Well It's not like Yer hard to understand or that Yer speaking a foreign tongue, Different would be an Aussie, At least accent wise, But still understandable beyond the slang terms they use, Which I think Canadians and Americans as a whole don't use(using their slang that is). :D Savoir-Faire is everywhere! The T1 Trust, T1 Class 4-4-4-4 #5550, America's First HST |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
They have already been processed two years ago, am I right? It's best not to assume that recordings are processed in any particular order. It's more likely now, but at one time work was split from tapes, based on whatever happened to be near the front of the shelf. Most of the time, that was new work, and old work was on tapes at the back of the shelf. SETI@Home said they would not reprocess work just to keep us "busy" -- there are other worthy projects if they have a long dry-spell. |
zour Send message Joined: 7 Jun 08 Posts: 10 Credit: 369,794 RAC: 0 |
Can you recommend another project you find worth of supporting also? I have no idea where to start. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.