Storing and accessing data

I've talked about Nebula's implementation of RFI removal and scoring - in particular its use of R-trees to avoid O(N2) runtime. Now I'll back up and describe how Nebula stores and accesses data. This has a big impact on the performance of I/O-intensive back-end tasks.

File formats

Nebula stores signals in files, not in a database. There are two formats for signal records:

Memory-mapped files

Nebula makes heavy use of a Unix feature called "memory-mapped files", which lets you embed a file in the address space of a process. The contents of the file (typically an array of binary records) are not initially read from disk. Rather, when the program accesses the memory segment, a virtual-memory page fault occurs and the page is read from disk. This has the following benefits, especially when array accesses are non-sequential:

Indices

At various points the back-end code needs to get all the signals for which some attribute is within a given range - for example, the sky position is in a given pixel, or the time is in a particular hour. An index is a structure that lets this be done efficiently. Without an index we'd have to scan through the entire set of signals, which would take too long.

Directories as indices

In some cases Nebula uses Unix directories as indices. For example, to index spikes by time, we divide them into 62,000 files, one for each 0.1 day period since the start of SETI@home. To get the spikes in period number 1000, we read the file spike_1000.bin.

This is predicated on the fact that modern Unix filesystems implement directory lookups efficiently. When we open a file in a given directory, Unix doesn't scan sequentially through the directory to find it; it maintains its own internal index - a hash table - that lets it locate the file in a small, constant amount of time.

Because Unix directory lookup is efficient, we could keep all these little files in a single directory. However, doing "ls" in such a directory would take forever. So instead we divide the files into directory hierarchies in a way that limits the number of files per directory. For example, for the index on time for spikes we create a directory structure like this:

spike/
    0/
        spike_1024.bin
        spike_2048.bin
        ...
    1/
        spike_1023.bin
    ...
    1023
gaussian/
...
The files are divided into 1024 directories; directory i contains the time periods p for which p mod 1024 = i. There are about 62,000 time periods, so each directory has about 62 files - a manageable number.

Index files

At first I used the above approach for pixel indexing as well as time. For each signal type, I divided the signals into about 16 million files, one per pixel, and stored this in 4096 directories.

To my surprise, I found that creating this hierarchy was much slower than I expected. As the directories got large, the time to create new files increased, and the whole thing was taking days instead of hours.

I never figured out the underlying problem; maybe my assumptions about Unix directories were wrong. Instead, I used a completely different approach for indexing by pixel. I store all the signals of a given type in a single binary file, ordered by pixel. In the process of creating this file, I also create a binary index file, which is essentially an array indexed by pixel number. Each entry consists of an offset into the signal file (i.e. where the signals in that pixel start) and a count (how many signals are in the pixel).

This works fine, and creating these files is extremely fast. Also, it's very simple to access the signals. A program can memory-map both the signal and the index file, and can treat them as arrays.

Next: Flattening the table hierarchy.



 
©2018 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.