Nebula workflow

The following is a list of the steps to do back-end processing using Nebula. The first few steps need to be done only once. If the RFI or scoring algorithms change, only those steps onward need to be done.

Dump the Informix database to flat files

This is fast compared to other types of DB access; it takes a day or two to dump the entire database. The flat files total about 3.3 TB. The largest of these is the spike table, at 1.01 TB.

Transfer files to Atlas

I use "rsynch" for this; it's restartable and does compression. I transfer all the files (~20 of them) in parallel. Given the speed of the Internet connection between UC Berkeley and AEI (about 100 Mbps), it takes 2-3 days.

Flatten tables

Input: flat files for the tape, workunit group, workunit, and result tables.

Output: binary files mapping result ID to redundant flag, beam number, and angle range.

Code: digest.cpp, discussed earlier.

Time-sort flat files

Input: signal table dump files.

Output: same, sorted by time.

Uses the Unix "sort" program, with the "--parallel 16" option to use multiple cores. Signal types are done sequentially. Takes a few hours.

Code: time_sort.sh

Split signals by time

Input: time-sorted flat files.

Output: separate files for each .1 day period, in a "time_rfi/" directory hierarchy.

In the process:

Note: it's critical to start with time-sorted files. Otherwise you have to constantly open/close output files. I tried this and it was too slow.

Code: time_split.cpp

Find RFI zones

Input: time_rfi/ hierarchy

Output: zone RFI files, one per (signal type, FFT length). Each file is a bitmap, one bit per RFI zone.

Code: zone_finder.cpp. The algorithm was described earlier.

Remove RFI

Input: time_rfi/ hierarchy and zone RFI files.

Output: time_clean/ hierarchy (same structure as time_rfi/, but with RFI removed).

This program scans files in the time_rfi/ hierarchy and applies the 3 types of RFI removal described earlier. The program is multithreaded; it takes a "--ncpus" argument and uses that many CPUs in parallel, each processing a subset of the time_rfi/ files. We also parallelize over signal types. The result is very fast: about an hour to remove RFI from all 5 billion signals.

Code: remove_rfi.cpp

Index signals by pixel

Input: the time_clean/ hierarchy.

Output: for each signal type, a signal file and index file, as described earlier. Also creates a list of all pixels that have signals.

This involves the following steps:

pixelize also produces a list of pixels for that signal type. A script merge_pixel_lists.sh merges these files, sorts the result, and removes duplicates.

Code: pixelize.cpp

Scoring

Multiplet finding and scoring is done by a program nebula_score that takes a pixel number of a command-line argument and generates three files:

Code: nebula_score.cpp.

nebula_score is the central part of a system for scoring some or all of the 16 million pixels. This system is based on two concepts:

The steps in the scoring process are: Code: group_pixels.py, make_todo.py, score_setup.py, score_finish.py.

Next: Status and plans.



 
©2017 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.