Nebula pipeline

I've described most of the major pieces of Nebula. Now let's look at how they connect together to form the "Nebula pipeline" that converts data to results.

The steps are as follows. The first few steps need to be done only once. If the RFI or scoring algorithms change, only those steps onward need to be done.

Dump the Informix database to flat files

This is fast compared to other types of DB access; it takes a day or two to dump the entire database. The flat files total about 3.3 TB. The largest of these is the spike table, at 1.01 TB.

Transfer files to Atlas

I use "rsynch" for this; it's restartable and does compression. I transfer all the files (~20 of them) in parallel. Given the speed of the Internet connection between UC Berkeley and AEI (about 100 Mbps), it takes 2-3 days.

Flatten tables

Input: flat files for the tape, workunit group, workunit, and result tables.

Output: binary files mapping result ID to redundant flag, beam number, and angle range.

Code: digest.cpp, discussed earlier.

Generate birdies

Input: beam pointing histories (stored in the workunit_group table)

Output: a set of birdie signals.

Filter signals

Input: signal files

Output: signal files with signals removed that are invalid in some way, or that are from redundant tapes.

Time-sort flat files

Input: filtered signal files and birdie signal files.

Output: combined signal files, sorted by time, and corresponding index files.

This uses the Unix "sort" program, with the "--parallel" option to use multiple cores. Signal types are done sequentially. Takes a few hours.

Find RFI zones

Input: time-sorted signal files.

Output: zone RFI files, one per (signal type, FFT length). Each file is a bitmap, one bit per RFI zone.

Remove RFI

Input: time-sorted signal files.

Output: "clean" time-sorted signal files.

The program is multithreaded; it takes a "--ncpus" argument and uses that many CPUs in parallel, each processing a subset of the time_rfi/ files. We also parallelize over signal types. The result is very fast: an hour or two to remove RFI from all 5 billion signals.

Index signals by pixel

Input: clean time-sorted files.

Output: signal files ordered by pixel, and corresponding indices. Also creates a list of pixels that have signals.

Scoring

Multiplet finding and scoring is done by a program nebula_score that takes a pixel number of a command-line argument and generates three files:

nebula_score is the central part of a system for scoring some or all of the 16 million pixels. This system is based on two concepts:

The steps in the scoring process are:

The pipeline Makefile

If you've programmed on Unix, you're probably familiar with a program called "make". This lets you express, in a "Makefile", dependencies between files, and rules for generating a file when there's a new version of a file it depends on.

This is usually used for building large programs - when you change a file, it automatically rebuilds everything that depends on it. But it can be used for other purposes.

The Nebula pipeline is expressed in a Makefile. Each stage is represented by a "done file"; when it's completed, the done file is modified. The next stage, which depends on the done file, is then started.

Before it occurred to me to use make, I ran the stages manually, waiting for each one to finish and starting the next. This was extremely error-prone and also wasted lots of time.

If I change an algorithm in the middle of the pipeline, there's no reason to start from the beginning. With the "make" approach, for example, if I change the RFI algorithm I just remove the "done file" for RFI removal and run make. The necessary stages are run, no more and no less.

Moving data to Berkeley

The pipeline makefile runs on Atlas. When it's done, there are two more steps, which are initiated from Berkeley computers. The first is to copy the signal files, both pre- and post-RFI removal. These are necessary to efficiently generate waterfall plots on the SETI@home web site.

The second is to get recently-generated scoring data from Atlas: files with lists of multiplet and pixel scores and parameters. These are appended to existing files, and then multiplet scores are normalized across signal types, as described here.

Code

The pipeline Makefile is here.

The various programs and scripts it refers to are all in the Nebula directory.


Next: Status and plans.



 
©2018 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.