Joined: 13 Feb 99
SETI@home is a chain of algorithms, most of them pretty complex, most of them developed by trial and error. Every chain has a weakest link, and the weakness of our weakest link could determine whether we find ET. Although Nebula's "birdie" framework has been helpful in evaluating the algorithms, it's certainly not infallible.
The algorithms have accumulated over the years, and many people have contributed to them. The recent ideas have mostly come from a small group: Eric and me, with input from Dan and Jeff. Like everyone else, we make mistakes and have blind spots. What if we're missing something obvious, and one of our links is very weak?
To prevent this, I decided that we need an "external design review" - to bring in some outside experts, explain our algorithms to them, and get their feedback. We did this last week (over Zoom, like everything else these days) with experts from Harvard, UC Berkeley, and Australia. I though it went extremely well. We focused on the multiplet scoring function. The experts understood what we're currently doing, and didn't find any glaring problems with it, but each of them had some ideas for improvements.
Two key ideas emerged, which I've implemented in the last few days.
1) Normalize score factors
As discussed elsewhere, multiplet scores have several factors. The original idea was that each factor was a probability: it measured a particular property of the multiplet, and estimated the probability that a multiplet with that value of the property would occur in noise. Then - assuming that the properties are independent - we multiply the factors to get an overall probability. (Actually we work in log space, so we add the factors).
For various reasons, the factors have different variances. The ND factor varies (in log space) by 10 or 100, but the time factor only varies by 1 or so. So if we just add the factors, time factor doesn't make much difference. A multiplet could have a great time factor but still get a bad score.
I tried to solve this problem by optimizing the score factor weights to favor birdies, but this was unsuccessful.
So instead (on the suggestion of one of the experts) I decided to use a simple solution: normalize the factors so they all have the same variance - let the data tell us how to scale the factors. Actually I used a slightly more robust approach: scale the factors so that the 25% and 75% quantiles are the same across factors.
Note: we do this normalization separately for each multiplet "category", i.e. combination of detection type and baryness. The factor ranges are quite different across categories. For example, for bary spike/gaussian multiplets, the quantiles for ND factor are -44 and -15, while for non-bary spike/gaussian multiplets they're -148 and -65.
2) Score variants
We designed the score factors to measure properties that we expect to distinguish ET from noise. But what if - perhaps for a particular category - one of the properties isn't doing this. Perhaps its value is effectively random. Then - especially now that we're normalizing the factors - this could cause an ET multiplet to get a mediocre score and be missed. Again, I hoped that weight optimization would discover and fix these situations, but it didn't work.
So (also at the suggestion of one of the experts) I'm trying a simpler approach: look at all the (7) combinations of the three scoring factors. I call these "score variants". The web interface now lets you look at the top multiplet lists for any category, and any score variant.
Score variants give us a tool for understanding the score factors. It's possible we'll find that for some categories it's better to omit one or two of the factors. For the spike/multiplet categories we can see which score variant finds the most birdies, and use that one. For the other categories we don't have birdies, but we can look at the top-scoring multiplets for each variant and see - by our intuition - how much they look like ET. Or we could make a combined list of the multiplets that score highly (say in the top 501) for any of the variants.
Joined: 25 Nov 01
Good to prove that normalization is good, geometric means are bad! (Except for very special cases or for just glossing over the real results...)
Silly simplistic question (something already done?):
Could a blind look at a matrix comparing all (normalised) parameters for noise vs pulses show what is most significant?
Aside: I still wonder if artifacts lurk in the 1-bit source data collection.
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
Joined: 20 Apr 00
Are similar design review meetings planned to look at the other functions (RFI removal, etc.)?
Great to know that the external experts broadly agree with your solutions to date and are able to bring in further refinements.
Getting closer to ET with each step.
Michael E. Hoffpauir
Joined: 13 Nov 99
Glad you all are still thinking and analyzing. Would also hope the data is all stored for the next generation to apply new algorithms and computing technologies ... to find what has been missed, but always there.
©2021 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.