Joined: 13 Feb 99
So, I've been working for the past month primarily on birdies, and reached a point where I could generate a lot of birdies, add them to the signal data, and run the entire Nebula pipeline to see if the birdies were rejected as RFI, and if they were detected as multiplets.
In the course of doing this, I realized that running the pipeline was taking too long, and was error-prone. Instead of continuing with birdies, I decided to go back to basics and fix these problems.
The first issue is that the pipeline consists of a dozen or so steps, some of which take many hours to finish. I had been running these steps manually. This caused lots of dead time between when a step finished and when I saw this and started the next one. Also, I would sometimes make errors like leaving out a step, and this would waste lots (like, days) of time.
I fixed this by automating the pipeline with "make". The makefile does all the steps, using dependencies to decide what steps are needed, and there's no dead time.
The other issues have to do with data storage.
From the beginning of Nebula, I assumed that it would be better to store signals in binary rather than text format. Text format requires conversion with scanf and printf, which are relatively slow. Binary files can be memory-mapped, letting you access the signal structures with no overhead at all.
The problem with binary files is that (shockingly, in my opinion) there is no Unix utility for sorting large binary files. There's an excellent program ("sort") for sorting large text files. So in order to sort signals (by time, and later by pixel) I was having to convert back and forth between binary and text format. This negated the performance gain of binary files.
So I decided to switch to text files exclusively.
The final issue involves indexing. For example, after sorting signals by time, I needed to be able to efficiently access the signals in any given 5-minute period. I had been doing this by putting the signals for each period in a separate file, and storing these ~1 million files in a directory hierarchy. Essentially this used the Unix file system as the index.
An alternative approach is to store the time-sorted signals in one big data file, and construct a separate "index" file that maps time to offsets in the data file. The index file is binary, and you can memory-map it.
This approach is simpler, and eliminates the overhead of creating and deleting millions of files. It retains the key property that you can access signals with sequential disk reads.
These changes aren't as sweeping as they sound - they involve changes only to the code that inputs and outputs signals, which is a relatively small part of the total.
Nonetheless, the changes are a bit complex, and I was hesitant to do this at a point where Nebula is close to being finished. I thought about it for some time, and decided the benefits outweighed the costs.
This is a common situation in software development: you make some initial design decisions, and later realize they weren't quite right, and you have to decide whether to go back and redo. It's analogous to deciding how long to keep pumping money into an old car, which coincidentally is a situation I'm also currently in.
Joined: 16 Jun 01
Thanks for update. Quite vivid analogy :)
Regarding binary sorting - maybe own STL-based utility? Operating with numbers in text format sounds weird by definition...
SETI apps news
We're not gonna fight them. We're gonna transcend them.
Joined: 20 Apr 00
So, in conclusion, have the recent code changes resulted in a system that's more efficient and less error-prone?
Is Nebula now ready for routine use, and can you estimate on how long it will take to process all the SETI@home result files?
Many thanks for all your hard work on this.
©2021 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.