The following is a list of the steps to do back-end processing using Nebula.
The first few steps need to be done only once.
If the RFI or scoring algorithms change,
only those steps onward need to be done.
Dump the Informix database to flat files
This is fast compared to other types of DB access;
it takes a day or two to dump the entire database.
The flat files total about 3.3 TB.
The largest of these is the spike table, at 1.01 TB.
Transfer files to Atlas
I use "rsynch" for this; it's restartable and does compression.
I transfer all the files (~20 of them) in parallel.
Given the speed of the Internet connection between
UC Berkeley and AEI (about 100 Mbps), it takes 2-3 days.
Input: flat files for the tape, workunit group, workunit, and result tables.
Output: binary files mapping result ID to redundant flag,
beam number, and angle range.
Code: digest.cpp, discussed earlier.
Time-sort flat files
Input: signal table dump files.
Output: same, sorted by time.
Uses the Unix "sort" program, with the "--parallel 16" option to
use multiple cores.
Signal types are done sequentially.
Takes a few hours.
Split signals by time
Input: time-sorted flat files.
Output: separate files for each .1 day period,
in a "time_rfi/" directory hierarchy.
In the process:
- Convert to binary.
- Remove redundant signals.
- Remove invalid signals (bad FFT len, time, or detection frequency).
- Add beam# to binary record.
Note: it's critical to start with time-sorted files.
Otherwise you have to constantly open/close output files.
I tried this and it was too slow.
Find RFI zones
Input: time_rfi/ hierarchy
Output: zone RFI files, one per (signal type, FFT length).
Each file is a bitmap, one bit per RFI zone.
The algorithm was described earlier.
Input: time_rfi/ hierarchy and zone RFI files.
Output: time_clean/ hierarchy (same structure as time_rfi/, but with RFI removed).
This program scans files in the time_rfi/
hierarchy and applies the 3 types of RFI removal
The program is multithreaded; it takes a "--ncpus" argument and uses
that many CPUs in parallel, each processing a subset of the time_rfi/ files.
We also parallelize over signal types.
The result is very fast: about an hour to remove RFI from all 5 billion signals.
Index signals by pixel
Input: the time_clean/ hierarchy.
Output: for each signal type, a signal file and index file,
as described earlier.
Also creates a list of all pixels that have signals.
This involves the following steps:
- For each signal type, read the binary files from time_clean/
and write a temp file in text format.
- Sort these files by pixel using the Unix sort program.
- Run the pixelize program, which creates the signal and index files.
also produces a list of pixels for that signal type.
A script merge_pixel_lists.sh
merges these files,
sorts the result, and removes duplicates.
Multiplet finding and scoring is done by a program nebula_score
that takes a pixel number of a command-line argument and generates three files:
- detail_N (where N is the pixel number).
This is a JSON-format file containing details output:
a list of all the multiplets found in the pixel,
summaries of their constituent signals,
and the score of the pixel.
a text-format file with 1-line summaries of the multiplets, including their score.
a 1-line summary of the pixel, including its score.
nebula_score is the central part of a system for scoring
some or all of the 16 million pixels.
This system is based on two concepts:
- A job is a set of 64 pixels, starting on a multiple of 64.
When scoring is done on the Atlas cluster, this is the unit of work.
- A job group is a set of 64 jobs.
It's the unit of accounting - keeping track of what pixels have been scored so far.
Each job group is 4096 pixels, so there are roughly 4,000 job groups.
The steps in the scoring process are:
- group_pixels.py: reads the list of pixels and generates
a list of jobs.
- make_todo.py: reads the list of jobs,
and generates a list of job groups.
Each job group is represented by a file with one job (pixel number) per line.
The jobs in a job group are not sequential;
in fact they are chosen randomly, so that if we score only a few
job groups we get pixels scattered across the sky.
The job group files are stored in a directory todo/.
After a job group is processed, its file is moved to another directory, done/.
takes a number of pixels as a command-line argument.
It chooses job groups (from todo/) containing at least that many pixels,
and generates the Condor input files for
the jobs needed to score the pixels in those groups.
- Using condor commands (condor_submit_dag, condor_wait) we
run those jobs on the Atlas cluster.
This appends the multiplet summary and pixel summary files
to "master" files, and sorts these master files by score.
This gives us the highest-scoring multiplets and pixels found so far.
It moves the JSON "detail" files into a directory hierarchy.
Finally it moves the job group files from todo/ to done/.
Next: Status and plans.