Message boards :
Number crunching :
Documentation of multibeam app internal programming?
Message board moderation
Author | Message |
---|---|
Ben Send message Joined: 15 Jun 99 Posts: 54 Credit: 60,003,756 RAC: 150 |
Is there any documentation for how the multibeam app works. I would like to work on the opencl GPU version but so far it is slow going. Also, does anyone have any Linux based debugger suggestions. CodeXL works with opencl but is so slow as to be unusable so far. A test work unit that takes 1.5 minutes on my RX 570 takes hours with CodeXL just to reach my breakpoint! Thank you. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13161 Credit: 1,160,866,277 RAC: 1,873 |
Nothing that I know of. There are snippets of explanations about the search algorithm scattered in the Number Crunching threads and at Beta. You would need developer access to Lunatics to get to the source code and any documentation that there might be. Probably better luck starting a PM discussion with the developer Raistmer. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Mike Send message Joined: 17 Feb 01 Posts: 34253 Credit: 79,922,639 RAC: 80 |
The source code is not at Lunatics its on seti svn. https://setisvn.ssl.berkeley.edu/svn/ For Linux it would be better to try to get in contact with Urs Echternacht. He is porting Raistmer`s code to Linux. With each crime and every kindness we birth our future. |
Ben Send message Joined: 15 Jun 99 Posts: 54 Credit: 60,003,756 RAC: 150 |
Nothing that I know of. There are snippets of explanations about the search algorithm scattered in the Number Crunching threads and at Beta. You would need developer access to Lunatics to get to the source code and any documentation that there might be. Probably better luck starting a PM discussion with the developer Raistmer. I did that, he suggested asking here. He also suggested that I read the cpu version as that is easier to understand. So I will try that version first. thank you. |
Ben Send message Joined: 15 Jun 99 Posts: 54 Credit: 60,003,756 RAC: 150 |
The source code is not at Lunatics its on seti svn. https://setisvn.ssl.berkeley.edu/svn/ That is useful, I thought Raistmer was the main person. Thank you. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13161 Credit: 1,160,866,277 RAC: 1,873 |
Raistmer is the prime developer. Urs developed the older beta SoG code for Linux but as far as I know hasn't been doing anything with the current code branch. But without knowing anything about the special developer accounts, I could be just guessing. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Regarding code itself it's here: https://setisvn.ssl.berkeley.edu/svn/branches/sah_v7_opt I attempted to add textual comments to ease initial code understanding as well as recalling what part doing what after long breaks. And regarding CPU versions – keep in mind that there are 2 source trees for CPU – “stockâ€one and “AKv8†one. I chose optimized AKv8 CPU tree as base for GPU MB builds while original CUDA client and Petri’s Special CUDA app use stock codebase. Stock CPU codebase located here: https://setisvn.ssl.berkeley.edu/svn/seti_boinc Hence for OpenCL build better to get impression about high-level app organization looking into stock codebase and then compare CPU and GPU builds looking into AKv8 source tree. Regarding algorithm explanation - there were some descriptions of MultiBeam algorithm I read here on SETI@home site but was unable to quick find recently. Maybe someone could provide direct links. Also would be worth to read relatively recently added Nebula algorithm descriptions (and there could be missed links to MultiBeam itself). And don't be deceived with running times under profiler. Modern GPU cards do really HUGE work in very short time. And to log and process all that done (and that is CPU work after all) is hard. That's why I recommend not to look how fast your GPU processes task. It could be minute but still too much for profiler. Usage of shortened task is almost mandatory. To do that you need to open task and edit these fields: <chirps> <chirp_parameter_t> <chirp_limit>30</chirp_limit> <chirp_parameter_t> <chirp_limit>100</chirp_limit> For benchmark tasks they are set to 3 and 10. But even shortened benchmark tasks are too long for GPU profiling. During app development I found most convenient to modify code itself and set hard limit on number of full iterations. Corresponding place in OpenCL MB is: //R:for profiling only, should be commented out in production release // if(icfft>=500){ DoSyncExit(); } Number can be adjusted if needed but first 500 iterations give enough info for profiling. Also, sync exit correctly terminates app allowing profiler to handle all gathered info. Simple exit() will not work. Feel free to ask for more particular details but better in this thread than PM for all could benefit from conversation (finding thread in boards history later). In PM one even can’t see own answer after sending it. SETI apps news We're not gonna fight them. We're gonna transcend them. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13720 Credit: 208,696,464 RAC: 304 |
In PM one even can’t see own answer after sending it. Yeah, I think that would be my biggest issue with the PM system as it stands. Grant Darwin NT |
Ben Send message Joined: 15 Jun 99 Posts: 54 Credit: 60,003,756 RAC: 150 |
The CPU version is definitely easier to understand, thanks. I have another question though, why is the fft part of the program enclosed in a c++ program and not a separate file? It is really hard to follow. Thank you. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
The CPU version is definitely easier to understand, thanks. I have another question though, why is the fft part of the program enclosed in a c++ program and not a separate file? It is really hard to follow. Could you be more specific about what particular FFT you speak? There are FFTW, Ooura, clFFT, cuFFT libraries in use for FFT along with some codelets for particular FFT sizes. FFTW generally used as library calls - so it is "in separate file". But some sizes allow more close interconnet between FFT and next algorithm stages. In such cases better to perform FFT+smth else in single function call from performance point of view. All those different GPU versions are about performance improvement being inside same algorithmic frames. And in GPU world re-reading even from GPU memory (not saying about re-transfer from host memory-that's absolute performance killer for non-APU devices) is costly. And often faster to re-compute smth from data already in registers than to compute once, store it in memory and re-load from memory later. SETI apps news We're not gonna fight them. We're gonna transcend them. |
Ben Send message Joined: 15 Jun 99 Posts: 54 Credit: 60,003,756 RAC: 150 |
Thank you.[/quote] Could you be more specific about what particular FFT you speak? There are FFTW, Ooura, clFFT, cuFFT libraries in use for FFT along with some codelets for particular FFT sizes. FFTW generally used as library calls - so it is "in separate file". But some sizes allow more close interconnet between FFT and next algorithm stages. In such cases better to perform FFT+smth else in single function call from performance point of view. All those different GPU versions are about performance improvement being inside same algorithmic frames. And in GPU world re-reading even from GPU memory (not saying about re-transfer from host memory-that's absolute performance killer for non-APU devices) is costly. And often faster to src/OpenCL_FFT/fft_kernelstring.cppre-compute smth from data already in registers than to compute once, store it in memory and re-load from memory later.[/quote] I was looking at: src/OpenCL_FFT/fft_kernelstring.cpp It seems to have opencl encapsulated in strings, or am I reading it wrong? I didn't realize seti uses so many fft libraries. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Yes, that's the form in that I found that lib. And it's in such unfamiliar form for the reason. It uses codelets approach so particular kernel generated "on fly" . With static CL file it would be impossible to achieve w/o calling some subroutines (== losing performance on calls). SETI apps news We're not gonna fight them. We're gonna transcend them. |
Ben Send message Joined: 15 Jun 99 Posts: 54 Credit: 60,003,756 RAC: 150 |
Just one more question before I go back to reading the code. In the pulse finding routines "folding" is mentioned but I don't see anything that says what that means. Thank you. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Just one more question before I go back to reading the code. In the pulse finding routines "folding" is mentioned but I don't see anything that says what that means. Folding is addition of slices of time-scale data . If slice size corresponds signal period one can see sharp increase in such sums and detects signal with particular period over noise. Cause we don't know what periods we need we try all possible ones (with some reasonable step) hence so many different foldings performed in PulseFind part. It corresponds to AstroPulse FFA (in my understanding). One of unsolved enigmas for me why OpenCL AP implementation achieves much better speedups than PulseFind's one in MB. Only bigger array sizeswith less conditionals or smth else... Perhaps comparing with Petri's implementation I would have some hints. SETI apps news We're not gonna fight them. We're gonna transcend them. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Just one more question before I go back to reading the code. In the pulse finding routines "folding" is mentioned but I don't see anything that says what that means. Hi, I'm sorry if I do not have enough power in me to answer any more questions after work days before the summer comes but I like to guess an answer to this AP/MB pulse find speed enigma. a) The AP has unroll. It may use more GPU sm units at a time (parallel processing). The original CUDA version of MB used only one sm before I implemented unroll to the CUDA code. (sm being a symmetrical multiprocessing unit. There are 80 of those in Titan V and about 70 in RTX 2080/Ti, 28 in GTX1080Ti, 20 in GTX1080. The 10x0 series run 128 simultaneous threads and the Titan & RTX run 64. Or something like that. A Titan V's performance is roughly twice the 1080.) -- The unroll means at best about 40 times more performance but it increases the need for housekeeping which results came first (to maintain an acceptable compatibility with the sequential CPU version). b) The AP has a hand coded part that does something like summing up first 64 values, then 32, 16, 8, 4, 2 down to 1 and the values are compared to a limit to check if something is found in an optimized way. All that reduces the writes and reads to GPU memory when 'folding'. Like Raistmer stated the memory access kills performance, even if done on GPU memory. That is my quess. Raistmer has the best knowledge of the OpenCL and AP software internals. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
Thank you. I am an amateur at programming but I am still interested in the discussion. A proud member of the OFA (Old Farts Association). |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Well, SoG's PulseFind has unroll's in some sense too... Maybe it's not PulseFind after all. But comparing CPU/GPU pairs for AP andMB I see better speedup on AP (staying in OpenCL area on GPU, of course). BTW, did anyone compare mean CPU temperatures doing MB vs AP tasks (optimized ones)? Device temperature could indicate computational density and indirectly - optimization level (only indirectly of course cause avoidable computations could warm device as good as needed ones). SETI apps news We're not gonna fight them. We're gonna transcend them. |
Ben Send message Joined: 15 Jun 99 Posts: 54 Credit: 60,003,756 RAC: 150 |
So , new question in "pulsefind.h" struct PoTPlan { unsigned long cperiod; // *0xC000 <-- what does cperiod represent? } Also what are: #define C3X2TO14 0xC000 #define C3X2TO13 0x6000 // constant for preplan limit #define PPLANMAX 511 Thank you. |
Ben Send message Joined: 15 Jun 99 Posts: 54 Credit: 60,003,756 RAC: 150 |
One other thing, looking at the code I'm guessing it chops up the PoT down to a minimum of 32 time slices divided in half? Thank you. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
So , new question in "pulsefind.h" No idea, I didn't touch these values in GPU version. SETI apps news We're not gonna fight them. We're gonna transcend them. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.