Gnat Attack (Dec 04 2007)

Message boards : Technical News : Gnat Attack (Dec 04 2007)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 688764 - Posted: 4 Dec 2007, 22:15:22 UTC

Yesterday afternoon some of our servers choked on random NFS mounts again. This may have been due to me messing around with sshd of all things. I doubt it, as the reasons why are totally mysterious, but the timing was fairly coincidental. Anyway, this simply meant kicking some NFS services and restarting informix on the science db services. The secondary db on bambi actually got stuck during recovery and was restarted/fixed this morning before the outage. The outage itself was fairly uneventful.

Question: Will doubling the WU size help?

Unfortunately it's not that simple. It will have the immediate benefit of reducing the bandwidth/database load. But while the results are out in the field the workunits remain on disk. Which means the workunits will be stuck on disk at least twice the current average. As long as redundancy is set to two (see below) this isn't a wash - slower computers will have a greater opportunity to dominate and keep more work on disk than before, as least that's been our experience. Long story short, doubling WU size does help, but not as much as you'd think, and it would months before we saw any positive results.

Question from previous thread: Why do we need two results to validate?

Until BOINC employs some kind of "trustworthiness" score per host, and even then, we'll need two results per workunit for scientific validation. Checksumming plays no part. What we find at every frequency/chirp rate/sky position is as important as what we don't find. And there's no way to tell beforehand just looking at the raw data. So every client has to go through every permutation of the above. Nefarious people (or CPU hiccups) can add signals, delete signals, or alter signals and the only way to catch this is by chewing on the complete workunit twice. We could go down to accepting just one result, and statistically we might have well over 99% validity. But it's still not 100%. If one in every thousand results is messed up that would be a major headache when looking for repeating events. With two results, the odds are one in a million that two matched results would both be messed up, and far less likely messed up in the exact same way, so they won't be validated.

Not sure if I stated this analogy elsewhere, but we who work on the SETI@home/BOINC project are like a basketball team. Down on the court, in the middle of the action, it's very hard to see everything going on. We're all experienced pros fighting through the immediate chaos of our surroundings, not always able to find the open teammate or catch the coach's signals. This shouldn't be seen as a poor reflection of our abilities - just the nature of the game. Up in the stands, observers see a bigger picture. It's no surprise the people in the crowd are sometimes confused or frustrated by the actions of the players when they have the illusion of "seeing it all." Key word: "illusion." Comments from the fans to the players (and vice versa) usually reflect this disparity in perspective, which is fine as long as both parties are aware of it.

- Matt

-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 688764 · Report as offensive
Profile Labbie
Avatar

Send message
Joined: 19 Jun 06
Posts: 4083
Credit: 5,930,102
RAC: 0
United States
Message 688771 - Posted: 4 Dec 2007, 22:50:23 UTC

Thanks for the update Matt.

Are you going to go thru all the Magnum Opus subtitles this week?


Calm Chaos Forum...Join Calm Chaos Now
ID: 688771 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 688773 - Posted: 4 Dec 2007, 22:58:48 UTC - in response to Message 688764.  


It's no surprise the people in the crowd are sometimes confused or frustrated by the actions of the players when they have the illusion of "seeing it all." Key word: "illusion." Comments from the fans to the players (and vice versa) usually reflect this disparity in perspective, which is fine as long as both parties are aware of it.


As mentioned in the previous thread, project result deadlines are set by the individual project teams. I, and a few others, have suggested that your choice (SETI's choice) of deadline length is too long. It seems that we are both saying something, perhaps without "hearing" the other.

If your choice of deadline length was to greatly reduce the incidence of participants getting upset about Earliest Deadline First (EDF), then perhaps you overshot the optimal choice of deadline length and are now reaping some of what you have sown.

If your choice of deadline length is grounded in solid scientific analysis of what a reasonble deadline for a specific amount of work should be, then pardon my intrusion into this matter as I go back up to the stands and happily eat my popcorn... ;)
ID: 688773 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 688776 - Posted: 4 Dec 2007, 23:12:41 UTC - in response to Message 688773.  

It seems that we are both saying something, perhaps without "hearing" the other.


True - one big part your missing is that I had, and continue to have, nothing to do with creating/editing BOINC deadline and credit policies. So these words are completely lost on me as I have zero knowledge of the history.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 688776 · Report as offensive
Profile Dr. C.E.T.I.
Avatar

Send message
Joined: 29 Feb 00
Posts: 16019
Credit: 794,685
RAC: 0
United States
Message 688796 - Posted: 4 Dec 2007, 23:56:15 UTC


Message 688764 -


well said Matt, well said . . . Thanks for the Post Sir . . .


BOINC Wiki . . .

Science Status Page . . .
ID: 688796 · Report as offensive
Profile Neil Blaikie
Volunteer tester
Avatar

Send message
Joined: 17 May 99
Posts: 143
Credit: 6,652,341
RAC: 0
Canada
Message 688798 - Posted: 4 Dec 2007, 23:59:28 UTC

As always you guys at Berkley are working hard to keep us users happy and things at your end running as smoothly as is humanely possible.

Personally, I let BOINC do the work it has got 24/7, I am not using it to be critical of what might be good / bad for it. My goal is to be part of what it was setup to do in the first place which is help find "those interesting, potentially life-changing signals".

Yes, I could be critical of a few things but personally if it stops working, it stops working until it is fixed. I think that analogy was very good Matt, it happens all over the place. I manage 9 x 3TB+ servers with my work, which have recently been offline a bit as we make changes to filesystems etc etc. No matter how many messages were sent out, memos posted, you always get the "grandstand" crowd wanting to know why x isn't working or y has been acting up.

ID: 688798 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 688800 - Posted: 5 Dec 2007, 0:03:27 UTC - in response to Message 688776.  
Last modified: 5 Dec 2007, 0:24:45 UTC

It seems that we are both saying something, perhaps without "hearing" the other.


True - one big part your missing is that I had, and continue to have, nothing to do with creating/editing BOINC deadline and credit policies. So these words are completely lost on me as I have zero knowledge of the history.

- Matt


I think you (Matt) took my use of "you"/"your" to mean YOU (Matt) instead of the team as a whole.

See Richard's reply to you for what he and I are asking...

Thanks,

Brian
ID: 688800 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 688802 - Posted: 5 Dec 2007, 0:08:07 UTC - in response to Message 688776.  

It seems that we are both saying something, perhaps without "hearing" the other.

True - one big part your missing is that I had, and continue to have, nothing to do with creating/editing BOINC deadline and credit policies. So these words are completely lost on me as I have zero knowledge of the history.

- Matt

Matt,

I accept and appreciate that you have no role to play in creating the deadline policies (and let's not even talk about credit - no need to go there).

But I do think that those deadline policies have a significant bearing on the project's database and storage needs, and hence the server capacities you do have to manage. And the deadline policies are set - at the SETI project level, not by BOINC - by your friend and colleague Eric Korpela. Perhaps you and he could have at least a chat about the subject, rather than dismissing all the users' suggestions out of hand?
ID: 688802 · Report as offensive
Ncrab

Send message
Joined: 1 Dec 07
Posts: 10
Credit: 57,389
RAC: 0
Brazil
Message 688806 - Posted: 5 Dec 2007, 0:17:23 UTC - in response to Message 688764.  

... With two results, the odds are one in a million that two matched results would both be messed up, and far less likely messed up in the exact same way, so they won't be validated.


What the odds if you perform:

1) 100% of redundance (the actual aproach, all phases are recalculated):
"far less then one in a million"

2) 90% of redundance (key process or sorted process recalculated):
???, you save 10% of computer power

3) 80% of redundance (...)
...

4) ...

Maybe the answer is: testing

How the acceptable odds (one in a million) ?

How the primary cause when we actually cacth failure in validation ?

This issue resemble "program validation methods", but here things are more simples cause we have a way to do: running twice. The same program running twice achieve the "odds", but seems it is not the only way, just the simple way. And I do not saying about "checksummings".

For instance, one aproach slightly diferent (but that meet the actual rules) is sending a WU to retest only after the first result reach the server. That way, the result can be encoded and attached too to the second machine and errors, if exist, be detected early in the process.

Well, I know you have a lot of more information (and so) that we have, thus the questions maybe are useless, but this issue is very attractive. If it is the case, thanks for your patience.
ID: 688806 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 688815 - Posted: 5 Dec 2007, 0:42:01 UTC - in response to Message 688802.  

Perhaps you and he could have at least a chat about the subject, rather than dismissing all the users' suggestions out of hand?


Sorry if I seem dismissive. Of course these comments bubble their way to the right people eventually. We're forwarding messages to each other all the time. Just want to let you know I can only ingest so much, and my lack of response is sometimes wrongly seen as (a) indifference (b) concurrence or (c) disagreement when really I'm either too busy or you're talkin' to the wrong guy.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 688815 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 688817 - Posted: 5 Dec 2007, 0:50:45 UTC - in response to Message 688815.  

Perhaps you and he could have at least a chat about the subject, rather than dismissing all the users' suggestions out of hand?


Sorry if I seem dismissive. Of course these comments bubble their way to the right people eventually. We're forwarding messages to each other all the time. Just want to let you know I can only ingest so much, and my lack of response is sometimes wrongly seen as (a) indifference (b) concurrence or (c) disagreement when really I'm either too busy or you're talkin' to the wrong guy.

- Matt

That's OK: as Brian says - sometimes it's our fault for failing to distinguish between you the individual, and you the (collective) project team.
ID: 688817 · Report as offensive
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 962
Credit: 537,293
RAC: 9
United Kingdom
Message 688821 - Posted: 5 Dec 2007, 1:04:02 UTC
Last modified: 5 Dec 2007, 1:22:34 UTC

5.10.x (5.8.17 for Mac) supports lost/ghost WU re-sends.

My P233 MMX has had a pending WU since 24 October (sent 16 October) crunch time was 451347 seconds (125.37 hours or 5.22 days). The WU took just over 8 days to return because the host crunches other projects as well. AR is 0.387146 Claimed Credit 74.13

The crunching partner (wingman) appears to be using BOINC ver 4.25 according to his returned results.

He has returned at least 5 results in the last few days, but has over 250 unreturned from the period 15 - 18 October. http://setiathome.berkeley.edu/results.php?hostid=844931.

I think that it is highly likely that TMR had a problem with that host, and those 250+ WU's will get re-issued over the next few weeks.

250* 375,300 bytes = 93825000 or 91.625 MB that's just for one host's cache. I have seen reports of 1500 - 2000 WU caches in extreme cases.

Enforcing a BOINC version may be unpopular with some, but it would help to rid the project of some of the Ghost/Zombie/Lost WU's.

[edit]
I just noticed that TMR has about 25 active hosts. They are running a wide variety of BOINC versions ranging from 4.25 to 5.10.7, most of them seem to be 4.xx versions.
Sir Arthur C Clarke 1917-2008
ID: 688821 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 688835 - Posted: 5 Dec 2007, 1:47:38 UTC - in response to Message 688764.  

Thank you for answering those questions, Matt. Yes, that is a good analogy. The users can see things overall, and we can tell you what's happening from a different perspective. But we don't have the skills or training to understand why certain things are. Part of the game is, of course, watching to see what happens. Anyway, good luck with your plays. :)
ID: 688835 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 688843 - Posted: 5 Dec 2007, 2:25:13 UTC - in response to Message 688764.  


Question from previous thread: Why do we need two results to validate?

Until BOINC employs some kind of "trustworthiness" score per host, and even then, we'll need two results per workunit for scientific validation. Checksumming plays no part. What we find at every frequency/chirp rate/sky position is as important as what we don't find. And there's no way to tell beforehand just looking at the raw data. So every client has to go through every permutation of the above. Nefarious people (or CPU hiccups) can add signals, delete signals, or alter signals and the only way to catch this is by chewing on the complete workunit twice. We could go down to accepting just one result, and statistically we might have well over 99% validity. But it's still not 100%. If one in every thousand results is messed up that would be a major headache when looking for repeating events. With two results, the odds are one in a million that two matched results would both be messed up, and far less likely messed up in the exact same way, so they won't be validated.



Not sure you understood my suggestion: Again, reduce wu to 1 send. When promising signals are detected/claimed, then send out one or more replications to check the original client's result.

This has the effect of dividing your storage by two (more or less); granting credit immediately (sticky subject, but true in theory); and lets you churn through more seti-noise faster.

Downside is missing any false negative signals (signal was there but not detected by original client). This would be a rare event given the quality of the clients today.

A second downside is the granting of credit to cheats. I suppose one can filter for this as well and issue redundant wu's based on this criteria as well.

ASIDE: Also, note, that as far as I am concerned nobody is getting personal about criticisms on this board. You sound rather defensive, which is understandable at times I'm sure.
ID: 688843 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 688844 - Posted: 5 Dec 2007, 2:37:38 UTC - in response to Message 688843.  


A second downside is the granting of credit to cheats. I suppose one can filter for this as well and issue redundant wu's based on this criteria as well.


What would a filter like that look like? How would you handle early enough detection of a "cheat" result so that you didn't generate spurious results thinking that they need to validate the fraudulent result as being something like, I dunno, a new "WOW signal"? How would you handle detecting "cheat" results that come in that do not meet the criteria of a second look?

Based on the complexity of what I think the "cheat detector" would look like, the feasibility goes out the window pretty quickly, IMO, but perhaps there is a more simplistic approach to its' design...
ID: 688844 · Report as offensive
Jesse Viviano

Send message
Joined: 27 Feb 00
Posts: 100
Credit: 3,949,583
RAC: 0
United States
Message 688888 - Posted: 5 Dec 2007, 7:24:35 UTC

Trustworthiness scores will not help much. Occasionally, a computer will overheat and start spewing garbage results. This happens with my laptop when the ventilators, which exhaust to the left and the back, are blocked on the left. Your computer might be vacuuming up enough dust that it starts overheating if you use a desktop computer. Your computer's fan might fail. Some radiation flips a memory cell. Someone's optimized client might be buggy. Since there is no control over the compute nodes, we need redundancy to check the results.

They will help in determining that a particular host needs attention and might pop up a dialog box stating that the computer needs examination, though. Some people are unaware that they need to open their computers to vacuum them out.
ID: 688888 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 688930 - Posted: 5 Dec 2007, 13:26:42 UTC - in response to Message 688844.  


A second downside is the granting of credit to cheats. I suppose one can filter for this as well and issue redundant wu's based on this criteria as well.


What would a filter like that look like? How would you handle early enough detection of a "cheat" result so that you didn't generate spurious results thinking that they need to validate the fraudulent result as being something like, I dunno, a new "WOW signal"? How would you handle detecting "cheat" results that come in that do not meet the criteria of a second look?

Based on the complexity of what I think the "cheat detector" would look like, the feasibility goes out the window pretty quickly, IMO, but perhaps there is a more simplistic approach to its' design...


The credit granting policy is pretty difficult to address here. But to respond to the above, it is within seti's perogative to remove credit unjustly granted. A claim to grossly excessive credit can be re-examined with redundant units issued; the credit granted becomes the lowest value like it is now. This works as long as the redundant units issued is a small fraction of the overall work load.

What may be hard to detect is someone who diddles with the code and asks for a small 10% more undeserved credit each time. A second difficult case is if a large team begins running diddled code, so that there is a high likelihood that redundant results become 'validated' by bogus code.

Einstein works on a single wu basis, but I think their stock project only uses closed code; they rely on their beta project to find programming improvements to roll out. (I'm not an expert, so I don't really know.)



ID: 688930 · Report as offensive
Brian Silvers

Send message
Joined: 11 Jun 99
Posts: 1681
Credit: 492,052
RAC: 0
United States
Message 688934 - Posted: 5 Dec 2007, 13:48:09 UTC - in response to Message 688930.  


Einstein works on a single wu basis, but I think their stock project only uses closed code; they rely on their beta project to find programming improvements to roll out. (I'm not an expert, so I don't really know.)


The Einstein application is closed-source, but they have the exact same Initial Replication and Minimum Quorum (IR=2, MQ=2) as here. What you are likely thinking about is the credit claims of the participants do not matter, as credit granted is set at a fixed amount on the server. Recently, this has been slightly troublesome as instead of having results of the same credit being around the same runtime, they have a very wide range of runtimes, +/- 20% in some cases.

Oh, and BTW, don't get me started about the way they've handled "beta" lately. I like Bernd and think he's doing a good job in finding and fixing things as they go, but the past two applications / data runs (S5R2 and S5R3) have both essentially been userbase-wide public betas with a small set of users being quasi-alpha testers, somewhere between alpha and beta...
ID: 688934 · Report as offensive
Profile Matt Lebofsky
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 1 Mar 99
Posts: 1444
Credit: 957,058
RAC: 0
United States
Message 688996 - Posted: 5 Dec 2007, 20:24:07 UTC - in response to Message 688843.  

Not sure you understood my suggestion: Again, reduce wu to 1 send. When promising signals are detected/claimed, then send out one or more replications to check the original client's result.


Ah... here's the catch - the definition of "promising." No single result is promising - it has to be matched with all the signals from previous results to determine if there is any interesting persistency. As of right now this process happens every 3-4 years. Redundancy/validation is much faster.

This is where the Near Time Persistency Checker comes in (still in development - news to come on that at some point), which will continually be checking persisitency on incoming results. But the key here is "near time" - i.e. not right away. It may be a couple weeks behind on checking your result, maybe less, maybe more - so no significant gain if any on the current 2/2 scheme.

ASIDE: Also, note, that as far as I am concerned nobody is getting personal about criticisms on this board. You sound rather defensive, which is understandable at times I'm sure.


I didn't think any of this was personal, nor was I feeling particularly defensive. Once again the magic of the internet seems to artificially apply everybody's jerk factor by 500%. Tone of voice/body language is far more important than people think. Given my dealings with the music industry (and music press) I cannot possibly be as offended by what anybody says around here. If anybody thinks I'm being curt or dismissive - this is probably because I don't have the time to ingest everything people are suggesting/asking.

- Matt
-- BOINC/SETI@home network/web/science/development person
-- "Any idiot can have a good idea. What is hard is to do it." - Jeanne-Claude
ID: 688996 · Report as offensive

Message boards : Technical News : Gnat Attack (Dec 04 2007)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.