Message boards :
Technical News :
Gnat Attack (Dec 04 2007)
Message board moderation
Author | Message |
---|---|
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Yesterday afternoon some of our servers choked on random NFS mounts again. This may have been due to me messing around with sshd of all things. I doubt it, as the reasons why are totally mysterious, but the timing was fairly coincidental. Anyway, this simply meant kicking some NFS services and restarting informix on the science db services. The secondary db on bambi actually got stuck during recovery and was restarted/fixed this morning before the outage. The outage itself was fairly uneventful. Question: Will doubling the WU size help? Unfortunately it's not that simple. It will have the immediate benefit of reducing the bandwidth/database load. But while the results are out in the field the workunits remain on disk. Which means the workunits will be stuck on disk at least twice the current average. As long as redundancy is set to two (see below) this isn't a wash - slower computers will have a greater opportunity to dominate and keep more work on disk than before, as least that's been our experience. Long story short, doubling WU size does help, but not as much as you'd think, and it would months before we saw any positive results. Question from previous thread: Why do we need two results to validate? Until BOINC employs some kind of "trustworthiness" score per host, and even then, we'll need two results per workunit for scientific validation. Checksumming plays no part. What we find at every frequency/chirp rate/sky position is as important as what we don't find. And there's no way to tell beforehand just looking at the raw data. So every client has to go through every permutation of the above. Nefarious people (or CPU hiccups) can add signals, delete signals, or alter signals and the only way to catch this is by chewing on the complete workunit twice. We could go down to accepting just one result, and statistically we might have well over 99% validity. But it's still not 100%. If one in every thousand results is messed up that would be a major headache when looking for repeating events. With two results, the odds are one in a million that two matched results would both be messed up, and far less likely messed up in the exact same way, so they won't be validated. Not sure if I stated this analogy elsewhere, but we who work on the SETI@home/BOINC project are like a basketball team. Down on the court, in the middle of the action, it's very hard to see everything going on. We're all experienced pros fighting through the immediate chaos of our surroundings, not always able to find the open teammate or catch the coach's signals. This shouldn't be seen as a poor reflection of our abilities - just the nature of the game. Up in the stands, observers see a bigger picture. It's no surprise the people in the crowd are sometimes confused or frustrated by the actions of the players when they have the illusion of "seeing it all." Key word: "illusion." Comments from the fans to the players (and vice versa) usually reflect this disparity in perspective, which is fine as long as both parties are aware of it. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Labbie Send message Joined: 19 Jun 06 Posts: 4083 Credit: 5,930,102 RAC: 0 |
Thanks for the update Matt. Are you going to go thru all the Magnum Opus subtitles this week? Calm Chaos Forum...Join Calm Chaos Now |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 |
As mentioned in the previous thread, project result deadlines are set by the individual project teams. I, and a few others, have suggested that your choice (SETI's choice) of deadline length is too long. It seems that we are both saying something, perhaps without "hearing" the other. If your choice of deadline length was to greatly reduce the incidence of participants getting upset about Earliest Deadline First (EDF), then perhaps you overshot the optimal choice of deadline length and are now reaping some of what you have sown. If your choice of deadline length is grounded in solid scientific analysis of what a reasonble deadline for a specific amount of work should be, then pardon my intrusion into this matter as I go back up to the stands and happily eat my popcorn... ;) |
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
It seems that we are both saying something, perhaps without "hearing" the other. True - one big part your missing is that I had, and continue to have, nothing to do with creating/editing BOINC deadline and credit policies. So these words are completely lost on me as I have zero knowledge of the history. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Dr. C.E.T.I. Send message Joined: 29 Feb 00 Posts: 16019 Credit: 794,685 RAC: 0 |
Message 688764 - well said Matt, well said . . . Thanks for the Post Sir . . . BOINC Wiki . . . Science Status Page . . . |
Neil Blaikie Send message Joined: 17 May 99 Posts: 143 Credit: 6,652,341 RAC: 0 |
As always you guys at Berkley are working hard to keep us users happy and things at your end running as smoothly as is humanely possible. Personally, I let BOINC do the work it has got 24/7, I am not using it to be critical of what might be good / bad for it. My goal is to be part of what it was setup to do in the first place which is help find "those interesting, potentially life-changing signals". Yes, I could be critical of a few things but personally if it stops working, it stops working until it is fixed. I think that analogy was very good Matt, it happens all over the place. I manage 9 x 3TB+ servers with my work, which have recently been offline a bit as we make changes to filesystems etc etc. No matter how many messages were sent out, memos posted, you always get the "grandstand" crowd wanting to know why x isn't working or y has been acting up. |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 |
It seems that we are both saying something, perhaps without "hearing" the other. I think you (Matt) took my use of "you"/"your" to mean YOU (Matt) instead of the team as a whole. See Richard's reply to you for what he and I are asking... Thanks, Brian |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
It seems that we are both saying something, perhaps without "hearing" the other. Matt, I accept and appreciate that you have no role to play in creating the deadline policies (and let's not even talk about credit - no need to go there). But I do think that those deadline policies have a significant bearing on the project's database and storage needs, and hence the server capacities you do have to manage. And the deadline policies are set - at the SETI project level, not by BOINC - by your friend and colleague Eric Korpela. Perhaps you and he could have at least a chat about the subject, rather than dismissing all the users' suggestions out of hand? |
Ncrab Send message Joined: 1 Dec 07 Posts: 10 Credit: 57,389 RAC: 0 |
... With two results, the odds are one in a million that two matched results would both be messed up, and far less likely messed up in the exact same way, so they won't be validated. What the odds if you perform: 1) 100% of redundance (the actual aproach, all phases are recalculated): "far less then one in a million" 2) 90% of redundance (key process or sorted process recalculated): ???, you save 10% of computer power 3) 80% of redundance (...) ... 4) ... Maybe the answer is: testing How the acceptable odds (one in a million) ? How the primary cause when we actually cacth failure in validation ? This issue resemble "program validation methods", but here things are more simples cause we have a way to do: running twice. The same program running twice achieve the "odds", but seems it is not the only way, just the simple way. And I do not saying about "checksummings". For instance, one aproach slightly diferent (but that meet the actual rules) is sending a WU to retest only after the first result reach the server. That way, the result can be encoded and attached too to the second machine and errors, if exist, be detected early in the process. Well, I know you have a lot of more information (and so) that we have, thus the questions maybe are useless, but this issue is very attractive. If it is the case, thanks for your patience. |
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Perhaps you and he could have at least a chat about the subject, rather than dismissing all the users' suggestions out of hand? Sorry if I seem dismissive. Of course these comments bubble their way to the right people eventually. We're forwarding messages to each other all the time. Just want to let you know I can only ingest so much, and my lack of response is sometimes wrongly seen as (a) indifference (b) concurrence or (c) disagreement when really I'm either too busy or you're talkin' to the wrong guy. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14653 Credit: 200,643,578 RAC: 874 |
Perhaps you and he could have at least a chat about the subject, rather than dismissing all the users' suggestions out of hand? That's OK: as Brian says - sometimes it's our fault for failing to distinguish between you the individual, and you the (collective) project team. |
Keith T. Send message Joined: 23 Aug 99 Posts: 962 Credit: 537,293 RAC: 9 |
5.10.x (5.8.17 for Mac) supports lost/ghost WU re-sends. My P233 MMX has had a pending WU since 24 October (sent 16 October) crunch time was 451347 seconds (125.37 hours or 5.22 days). The WU took just over 8 days to return because the host crunches other projects as well. AR is 0.387146 Claimed Credit 74.13 The crunching partner (wingman) appears to be using BOINC ver 4.25 according to his returned results. He has returned at least 5 results in the last few days, but has over 250 unreturned from the period 15 - 18 October. http://setiathome.berkeley.edu/results.php?hostid=844931. I think that it is highly likely that TMR had a problem with that host, and those 250+ WU's will get re-issued over the next few weeks. 250* 375,300 bytes = 93825000 or 91.625 MB that's just for one host's cache. I have seen reports of 1500 - 2000 WU caches in extreme cases. Enforcing a BOINC version may be unpopular with some, but it would help to rid the project of some of the Ghost/Zombie/Lost WU's. [edit] I just noticed that TMR has about 25 active hosts. They are running a wide variety of BOINC versions ranging from 4.25 to 5.10.7, most of them seem to be 4.xx versions. Sir Arthur C Clarke 1917-2008 |
DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2 |
Thank you for answering those questions, Matt. Yes, that is a good analogy. The users can see things overall, and we can tell you what's happening from a different perspective. But we don't have the skills or training to understand why certain things are. Part of the game is, of course, watching to see what happens. Anyway, good luck with your plays. :) |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
Not sure you understood my suggestion: Again, reduce wu to 1 send. When promising signals are detected/claimed, then send out one or more replications to check the original client's result. This has the effect of dividing your storage by two (more or less); granting credit immediately (sticky subject, but true in theory); and lets you churn through more seti-noise faster. Downside is missing any false negative signals (signal was there but not detected by original client). This would be a rare event given the quality of the clients today. A second downside is the granting of credit to cheats. I suppose one can filter for this as well and issue redundant wu's based on this criteria as well. ASIDE: Also, note, that as far as I am concerned nobody is getting personal about criticisms on this board. You sound rather defensive, which is understandable at times I'm sure. |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 |
What would a filter like that look like? How would you handle early enough detection of a "cheat" result so that you didn't generate spurious results thinking that they need to validate the fraudulent result as being something like, I dunno, a new "WOW signal"? How would you handle detecting "cheat" results that come in that do not meet the criteria of a second look? Based on the complexity of what I think the "cheat detector" would look like, the feasibility goes out the window pretty quickly, IMO, but perhaps there is a more simplistic approach to its' design... |
Jesse Viviano Send message Joined: 27 Feb 00 Posts: 100 Credit: 3,949,583 RAC: 0 |
Trustworthiness scores will not help much. Occasionally, a computer will overheat and start spewing garbage results. This happens with my laptop when the ventilators, which exhaust to the left and the back, are blocked on the left. Your computer might be vacuuming up enough dust that it starts overheating if you use a desktop computer. Your computer's fan might fail. Some radiation flips a memory cell. Someone's optimized client might be buggy. Since there is no control over the compute nodes, we need redundancy to check the results. They will help in determining that a particular host needs attention and might pop up a dialog box stating that the computer needs examination, though. Some people are unaware that they need to open their computers to vacuum them out. |
PhonAcq Send message Joined: 14 Apr 01 Posts: 1656 Credit: 30,658,217 RAC: 1 |
The credit granting policy is pretty difficult to address here. But to respond to the above, it is within seti's perogative to remove credit unjustly granted. A claim to grossly excessive credit can be re-examined with redundant units issued; the credit granted becomes the lowest value like it is now. This works as long as the redundant units issued is a small fraction of the overall work load. What may be hard to detect is someone who diddles with the code and asks for a small 10% more undeserved credit each time. A second difficult case is if a large team begins running diddled code, so that there is a high likelihood that redundant results become 'validated' by bogus code. Einstein works on a single wu basis, but I think their stock project only uses closed code; they rely on their beta project to find programming improvements to roll out. (I'm not an expert, so I don't really know.) |
Brian Silvers Send message Joined: 11 Jun 99 Posts: 1681 Credit: 492,052 RAC: 0 |
The Einstein application is closed-source, but they have the exact same Initial Replication and Minimum Quorum (IR=2, MQ=2) as here. What you are likely thinking about is the credit claims of the participants do not matter, as credit granted is set at a fixed amount on the server. Recently, this has been slightly troublesome as instead of having results of the same credit being around the same runtime, they have a very wide range of runtimes, +/- 20% in some cases. Oh, and BTW, don't get me started about the way they've handled "beta" lately. I like Bernd and think he's doing a good job in finding and fixing things as they go, but the past two applications / data runs (S5R2 and S5R3) have both essentially been userbase-wide public betas with a small set of users being quasi-alpha testers, somewhere between alpha and beta... |
Matt Lebofsky Send message Joined: 1 Mar 99 Posts: 1444 Credit: 957,058 RAC: 0 |
Not sure you understood my suggestion: Again, reduce wu to 1 send. When promising signals are detected/claimed, then send out one or more replications to check the original client's result. Ah... here's the catch - the definition of "promising." No single result is promising - it has to be matched with all the signals from previous results to determine if there is any interesting persistency. As of right now this process happens every 3-4 years. Redundancy/validation is much faster. This is where the Near Time Persistency Checker comes in (still in development - news to come on that at some point), which will continually be checking persisitency on incoming results. But the key here is "near time" - i.e. not right away. It may be a couple weeks behind on checking your result, maybe less, maybe more - so no significant gain if any on the current 2/2 scheme. ASIDE: Also, note, that as far as I am concerned nobody is getting personal about criticisms on this board. You sound rather defensive, which is understandable at times I'm sure. I didn't think any of this was personal, nor was I feeling particularly defensive. Once again the magic of the internet seems to artificially apply everybody's jerk factor by 500%. Tone of voice/body language is far more important than people think. Given my dealings with the music industry (and music press) I cannot possibly be as offended by what anybody says around here. If anybody thinks I'm being curt or dismissive - this is probably because I don't have the time to ingest everything people are suggesting/asking. - Matt -- BOINC/SETI@home network/web/science/development person -- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.