Keeping up with data in 2002
Bob Bankay, Jeff Cobb, Eric Person
The search for extraterrestrial intelligence involves sifting through mountains of data. As of January 27, 2003, SETI@home receives over one million results per day from its participants, and each of these returned results include information about spikes, gaussians, pulses, and triplets detected in the screensaver client's analysis. The graph below shows the number of results SETI@home has amassed since its beginning in 1999.
Examples of large-scale database modifications in 2002
To keep up with such a fast rate of data accumulation, SETI@home monitors and updates its data storage capacity on a regular basis. For example, last year our Informix database underwent major updates to handle our huge volume of spikes. The graph below shows that we have processed and stored almost 5 billion spikes (as of January 2003).
The following are examples of large-scale database modifications SETI@home performed in 2002:
Modification of Spike ID Field: Early in 2002 our database table for spikes had run out of assignable IDs; we hit the limit of 2 billion IDs used to identify each row in our spike table. This limit was due to the fact that Informix uses a 32-bit integer by default for a serial ID type. The solution was to define ID as serial8 (64 bits long, or long integer) in a new spike table. A major side effect of this change was an increase in row size, since Informix fills out row lengths to 8-byte boundaries. Consequently, we had to place the new spike table into a new, larger storage area. To insulate the tables from on-line activity during the data migration process, we stored all incoming results into temporary storage for the duration (see Improved Flow Control Processsing below). In approximately 2 weeks all of the spike data was migrated to the new spike format. For the following week the new spike table was renamed and tested against the old spike table to ensure that the data had been moved correctly. Once testing was completed, the new spike table was put into service.
Improved Flow Control Processing: Flow control processing involves storing incoming results into temporary storage during database maintenance tasks, then moving these results into permanent storage once maintenance is completed. Early in the project we could simultaneously process both real-time returned results and temporarily stored results without difficulty. However, in 2002 the increasing volume of incoming results became large enough such that we could not process results from temporary storage without producing significant slowdowns for on-line users. Over a period of a few weeks it was determined that a fixed number of results should be accepted from temporary storage every 5 minutes if the queue for the on-line results was sufficiently low. This solution resolved the overloading problem.
Increase of Database Storage Capacity: In 2001 we distributed the spike table among 8 Informix dbspaces. (A dbspace is a storage allocation that can store up to a fixed number of data rows.) Later in 2002 we found that growth of the number of stored spikes had exceeded the capacity of the 8 dbspaces we had allocated. To solve this problem we replaced the 8 dbspace spike table with one spread across 16 dbspaces. We kept the old spike table because we need it for data integrity testing and signal archiving.
Improved Signal Archival:
Near the end of 2002, we improved signal archiving by introducing a new table linking all results to their source tapes. (Previously we needed to perform an expensive query to access tape information for individual results.) This modification accellerated our rate of archiving signals to tape and deleting them from the online tables, keeping our database lean and efficient.
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.