Memory Access Efficient Pulse Folding Algorithms

Message boards : Number crunching : Memory Access Efficient Pulse Folding Algorithms
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Eddie Cottongim Project Donor
Volunteer tester

Send message
Joined: 15 Sep 01
Posts: 26
Credit: 6,149,499
RAC: 6
United States
Message 782067 - Posted: 13 Jul 2008, 7:12:31 UTC

I was wondering if anyone has come across pulse folding algorithms that minimize the required number of passes through memory. I haven't had much luck coming up with one myself. There are few papers that could be related but no definite hits yet.

Such an algorithm for the FFT exists, known as the "Four Step" FFT. It basically treats a big 1D FFT as a 2D FFT and deals with rows and columns at a time, achieving as low as two passes through memory at the cost of some more operations.

This has applications for FPGA implementation. FPGAs often don't have enough on board memory for an entire folding set. (Or if they do, it isn't just one big chunk, which is good and bad, but not simple in any case) They can have dozens of adders, but just one adder is enough to saturate a memory bus in a straightforward implementation like SETI@Home's. However, once the data set is small enough, they can work very rapidly.

Thanks!
ID: 782067 · Report as offensive
Profile Gecko
Volunteer tester
Avatar

Send message
Joined: 17 Nov 99
Posts: 454
Credit: 6,946,910
RAC: 47
United States
Message 782174 - Posted: 13 Jul 2008, 15:37:46 UTC

With the limited cache per PPE (512K) & SPEs (256K) on CBE, maybe this is a better approach?
It would also appear to suit GPU computing to minimize data trannsfer latency between GPU & system main memory?

Maybe Mimo & Dotsch can weigh-in on this & their familiarity re: above referenced H/W.
ID: 782174 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 782230 - Posted: 13 Jul 2008, 16:51:25 UTC
Last modified: 13 Jul 2008, 16:53:15 UTC

Funny, I was looking along similar FFA lines the other day and had similar problems locating material. All the material I came across was either derivative of an original 1969 staelin paper, or sufficiently vague descriptions of pulsar detection setups with no low-level algorithmic detail.

With similar proprietary signal processing (not SaH FFA) code later implemented in fpga hardware that I've seen, the implementations suffered the very same problems you describe when implemented in Streams-C derived directly from a software implementation.

Re-implementation of the mechanism directly in hardware VHDL by an experienced designer, in that case reduced logic area by half, doubled clock rate and completely eliminated the scratch RAM requirement.

The lesson learned from that was naive, even brute force, algorithms may be more appropriate where hardware parallelism is available, simply because of the high gate densities now available, that simpler algorithms are more easily divided, and that sophisticated 'cache oblivious' designs you describe meant for streaming through serial processors, being serial in nature, are inherently difficult to parallelise, and require intermediate storage at each step by design, which leads again to the problems you describe.

I haven't seen a cache oblivious FFA that I'm aware of, though my memory for such things has become shocking, and could likely have missed one in front of my face while preoccupied with other things.

I'd suggest consider what level of parallelism you have available (if any) and pick the simplest algorithm to maximise the use of it, and you may find a serial prologue and epilogue then easier to implement.

Sorry I can't be more helpful than that, but it really is something I've also been struggling with.

Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 782230 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 782243 - Posted: 13 Jul 2008, 17:09:13 UTC

The maximum array to be folded is 40960 single floats (set by the WU header <pulse_max>40960 parameter). In practice, the longest commonly encountered is 32768 in Very Low Angle Range WUs. Only WUs with angle range between 0.08 and 0.20 use anything longer, and they're quite rare.

In addition, the longest arrays in a WU represent the broadest bandwidth so are done relatively few times, for instance the 32K length for VLARs is used 2387 times during a full WU run. In those same VLARs, length 128 fold arrays are used 163107383 times. For S@H, the memory issue is more getting the array set up than folding it.

I haven't come across any variant folding algorithms which are aimed at memory efficiency, but I also haven't been looking for such.
                                                               Joe

ID: 782243 · Report as offensive

Message boards : Number crunching : Memory Access Efficient Pulse Folding Algorithms


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.