Performance

Performance is a huge factor in Nebula. We are constantly tweaking all the alorithms. If it takes a long time - say a month - to see the effects of a tweak, it won't be possible to finish the project. Our goal is to have the entire pipeline take less than a day to complete.

Here are some the techniques we use to boost performance.

R-trees

All of Nebula's major components need some form of geometric lookup - e.g. out of a large set of signals, to find those that lie within a given time/frequency rectangle. Doing this the crude way leads to O(N^2) performance, which is a killer.

For this purpose I used a wonderful data structure called an R-tree. I'd like to say I already knew about R-trees, but in fact I discovered them using a web search.

An R-tree stores a set of points or rectangles in any number of dimensions. It then lets you do various types of queries: given another rectangle, you can find which rectangles it overlaps, or which it's contained in, and so on. All this can be done efficiently: the cost of adding or removing an item is log(N) where N is the size of the set, and the cost of doing a query is Mlog(N), where M is the number of items returned.

Many implementations of R-trees are available. I tried a few, and chose the one that is part of the Boost geometry library. It was both fast and general.

File formats

The way Nebula stores and accesses data has a big impact on the performance of I/O-intensive back-end tasks.

Nebula stores signals in files, not in a database. Initially I used binary files. But it turns out that the Unix "sort" program, which works only on text tiles, was so incredibly fast and useful that I switched to using only text files. Record are lines of ASCII text, and each line is a list of pipe-separated values. This is the Informix DB dump format.

Memory-mapped files

Nebula makes heavy use of a Unix feature called "memory-mapped files", which lets you embed a file in the address space of a process. The contents of the file (typically an array of binary records) are not initially read from disk. Rather, when the program accesses the memory segment, a virtual-memory page fault occurs and the page is read from disk. This has the following benefits, especially when array accesses are non-sequential:

Indices

At various points the back-end code needs to get all the signals for which some attribute is within a given range - for example, the sky position is in a given pixel, or the time is in a particular hour. An index is a structure that lets this be done efficiently. Without an index we'd have to scan through the entire set of signals, which would take too long.

At first I used directory hierarchies as an index. For each signal type, I divided the signals into about 16 million files, one per pixel, and stored this in 4096 directories. To my surprise, I found that creating this hierarchy was much slower than I expected. As the directories got large, the time to create new files increased, and the whole thing was taking days instead of hours.

I never figured out the underlying problem; maybe my assumptions about Unix directories were wrong. Instead, I used a completely different approach for indexing by pixel. I store all the signals of a given type in a single file, ordered by pixel. In the process of creating this file, I also create a binary index file, which is essentially an array indexed by pixel number. Each entry consists of an offset into the signal file (i.e. where the signals in that pixel start) and a count (how many signals are in the pixel).

This works fine, and creating these files is extremely fast. Also, it's very simple to access the signals. A program can memory-map the index file, and can treat it as an array.


Next: Flattening the table hierarchy.



 
©2018 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.