Rethinking RFI algorithms

Author	Message
David Anderson Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 13 Feb 99 Posts: 176 Credit: 502,653 RAC: 0	Message 2058408 - Posted: 5 Oct 2020, 21:09:46 UTC Last modified: 5 Oct 2020, 21:42:31 UTC Things have been a little crazy around here, what with the rapidly accelerating climate change eco-disaster and the political machinations of its instigators. The down-sizing of the SETI@home server complex is finally finished, and Eric is ramping back up on Nebula. A lot of my recent work involved getting the Nebula software to run on Atlas again. Atlas is in the process of upgrading their machines from Debian 8 to 10, and the system libraries changed in a non-backward-compatible way. Also the new machines don't have libraries we need (healpix and fitsio). I spent a while trying to get things to work with shared (.so) libraries. Eventually we gave up, got the latest sources, and built the libraries in static (.a) form. I spent several days trying to figure out a problem that turned out to be in our data - there were millions of copies of the same small set of detections, and the RFI removal code choked on that. Probably a bug in the assimilator. I fixed this by having "filter" limit the number of detections per result. After all this, I finally finished a complete run - RFI removal, scoring of 256K pixels - on the new (final) data set, which includes the last several years of results. Everything was a little slower, of course, but this was balanced by the speed of the 96-core Atlas machine we're using now. Eric looked at the results and found some pulse/triplet detections that should have been flagged as RFI, but weren't. We also noticed that some birdie detections (spikes) were being flagged as RFI, where they didn't look like RFI on the waterfall plots. This got us looking at RFI removal again. Currently there are three main algorithms. They have the same basic idea: find groups of detections that are separated in sky position but similar in other respects. They differ in terms of time scale: The zone algorithm looks at the entire 15-year data set. The drifting algorithm looks at 10-minute intervals. The multibeam algorithm looks the duration of a detection (a few seconds or less). We ended up rethinking all three of these. For the zone algorithm, we divide the detections of a given type and FFT length into "bins" based on their frequency (spikes, gaussians) or period (pulses, triplets). For each bin, we count the number of 0.1-day periods during which it has a statistical excess of detections. We then discard the 5% (for spikes) or 2% (others) of the bins for which this count is highest. I asked: why 5% and 2%? Turns out they're guesstimates. To get more defensible values, I generated graphs that show the % of detections discarded as a function of % of bins discarded. If there really is zone RFI, we'd expect this function to rise sharply, then have a "knee" and level off. We'd then want to discard the bins up to and including the knee, but not beyond. I did this and the results are here. It shows (I think) that we should use 2% instead of 5% for spikes, and for triplets we shouldn't use this approach at all. On to the drifting algorithm, which looks for excesses of detections in opposed triangles in time/freq space. We designed this for spikes and gaussians but were using it for pulses and triplets as well, and it's not appropriate for these types because the detections have large bandwidth. So Eric thought up a new algorithm that works on the same time scale (~10 minutes). It works as follows. For a given detection type, process detections in 10-minute windows. For a given window, divide the detections by frequency into "bins", whose size is the median bandwidth for that type (38 Hz for pulses, 305 Hz for triplets). Find the sky position of a detection in the middle of the window. For each bin, count the number of detections that are "far" from this position (more than 1.75 beam widths); actually, compute separate counts for detections before and after the midpoint. See which bins have a statistical excess of detections on both sides of the midpoint, and flag all their detections (near and far) as RFI. We'll use this algorithm in the next Nebula run and look at the results. It may need some tweaking. Finally, multibeam. Currently (slightly simplified) this algorithm computes a time/freq rectangle for each detection D based on its FFT length. If it finds another detection D1 whose center is in this rectangle and is "far" (> 1.75 beam widths) from D, it flags all detections in the rectangle as RFI. Eric observed a problem with this: if there are a lot of detections in this time/freq/position area, it's likely that such a D1 will exist randomly, but that doesn't mean everything is RFI; the algorithm was flagging too much, especially for pulse/triplet, whose rectangles generally have a larger area than for spikes/gaussians. So he refined the algorithm as follows. As before, we compute the time/freq rectangle for D. Then, based on the density of detections in this time window across all frequencies, we compute the number of "far" signals that we'd expect to find, assuming that the other 6 beams are all "far". Then, using the incomplete Gamma function, we compute the probability of finding as many "far" signals as we did. If this is below a threshold (.001) we flag D (and only D) as RFI. Again, we'll need to try this out in our next run and see how it performs. ID: 2058408 · Reply Quote

Jon Golding Send message Joined: 20 Apr 00 Posts: 105 Credit: 841,861 RAC: 0	Message 2058462 - Posted: 6 Oct 2020, 15:19:54 UTC - in response to Message 2058408. Last modified: 6 Oct 2020, 15:20:25 UTC Not really expecting an answer to this (as it would obviously take the novelty off any paper or announcement), but are you seeing any consistent 'hotspots' in any regions of the sky from the complete run? ID: 2058462 · Reply Quote

Sesson Send message Joined: 29 Feb 16 Posts: 44 Credit: 1,353,463 RAC: 3	Message 2058922 - Posted: 11 Oct 2020, 17:17:05 UTC - in response to Message 2058462. I don't think anything would be found right now. Learned from Zooniverse projects, telescopes have some error aiming at the target, resulting in an offset of one or two pixels among images taken for the same target at different times. What's more, if ET is near the Earth, for example 20pc, their proper motion over 20 years would be significant, and I haven't seen any model about that in Nebula code. ID: 2058922 · Reply Quote

Jon Golding Send message Joined: 20 Apr 00 Posts: 105 Credit: 841,861 RAC: 0	Message 2058928 - Posted: 11 Oct 2020, 17:40:28 UTC - in response to Message 2058922. Excellent point about the drift of stellar position. Presumably not an issue for targeted observations of the same star over time, but could certainly affect basket-weave scans. You might also expect the search sensitivity to improve over time, as new equipment comes along, and I'm unaware whether that has been considered. I suspect these data will be analysed and re-analysed for years to come as new ideas and algorithms come along. Hopefully, there's something in there just waiting to be discovered.... ID: 2058928 · Reply Quote

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22823 Credit: 416,307,556 RAC: 380	Message 2058929 - Posted: 11 Oct 2020, 17:42:26 UTC - in response to Message 2058922. The data collected by SETI is nowhere near the same as that used for the Zooniverse work as it uses radio telescopes unlike the visible light used for the Zooniverse work; the exact position of the two radio telescopes used to feed SETI@Home has been very well calibrated over the years. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 2058929 · Reply Quote

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.