Documentation of multibeam app internal programming?

Message boards : Number crunching : Documentation of multibeam app internal programming?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Ben

Send message
Joined: 15 Jun 99
Posts: 54
Credit: 60,003,756
RAC: 150
United States
Message 1978958 - Posted: 6 Feb 2019, 19:37:35 UTC

Is there any documentation for how the multibeam app works. I would like to work on the opencl GPU version but so far it is slow going. Also, does anyone have any Linux based debugger suggestions. CodeXL works with opencl but is so slow as to be unusable so far. A test work unit that takes 1.5 minutes on my RX 570 takes hours with CodeXL just to reach my breakpoint!

Thank you.
ID: 1978958 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1978968 - Posted: 6 Feb 2019, 20:22:05 UTC - in response to Message 1978958.  

Nothing that I know of. There are snippets of explanations about the search algorithm scattered in the Number Crunching threads and at Beta. You would need developer access to Lunatics to get to the source code and any documentation that there might be. Probably better luck starting a PM discussion with the developer Raistmer.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1978968 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34253
Credit: 79,922,639
RAC: 80
Germany
Message 1978986 - Posted: 6 Feb 2019, 21:42:27 UTC
Last modified: 6 Feb 2019, 21:43:21 UTC

The source code is not at Lunatics its on seti svn. https://setisvn.ssl.berkeley.edu/svn/
For Linux it would be better to try to get in contact with Urs Echternacht.
He is porting Raistmer`s code to Linux.


With each crime and every kindness we birth our future.
ID: 1978986 · Report as offensive
Ben

Send message
Joined: 15 Jun 99
Posts: 54
Credit: 60,003,756
RAC: 150
United States
Message 1979064 - Posted: 7 Feb 2019, 5:55:01 UTC - in response to Message 1978968.  

Nothing that I know of. There are snippets of explanations about the search algorithm scattered in the Number Crunching threads and at Beta. You would need developer access to Lunatics to get to the source code and any documentation that there might be. Probably better luck starting a PM discussion with the developer Raistmer.


I did that, he suggested asking here. He also suggested that I read the cpu version as that is easier to understand. So I will try that version first.

thank you.
ID: 1979064 · Report as offensive
Ben

Send message
Joined: 15 Jun 99
Posts: 54
Credit: 60,003,756
RAC: 150
United States
Message 1979065 - Posted: 7 Feb 2019, 5:57:23 UTC - in response to Message 1978986.  

The source code is not at Lunatics its on seti svn. https://setisvn.ssl.berkeley.edu/svn/
For Linux it would be better to try to get in contact with Urs Echternacht.
He is porting Raistmer`s code to Linux.


That is useful, I thought Raistmer was the main person.

Thank you.
ID: 1979065 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1979067 - Posted: 7 Feb 2019, 6:52:39 UTC

Raistmer is the prime developer. Urs developed the older beta SoG code for Linux but as far as I know hasn't been doing anything with the current code branch. But without knowing anything about the special developer accounts, I could be just guessing.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1979067 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1979151 - Posted: 7 Feb 2019, 18:25:52 UTC
Last modified: 7 Feb 2019, 18:27:49 UTC

Regarding code itself it's here: https://setisvn.ssl.berkeley.edu/svn/branches/sah_v7_opt
I attempted to add textual comments to ease initial code understanding as well as recalling what part doing what after long breaks.
And regarding CPU versions – keep in mind that there are 2 source trees for CPU – “stock”one and “AKv8” one. I chose optimized AKv8 CPU tree as base for GPU MB builds while original CUDA client and Petri’s Special CUDA app use stock codebase.
Stock CPU codebase located here: https://setisvn.ssl.berkeley.edu/svn/seti_boinc
Hence for OpenCL build better to get impression about high-level app organization looking into stock codebase and then compare CPU and GPU builds looking into AKv8 source tree.
Regarding algorithm explanation - there were some descriptions of MultiBeam algorithm I read here on SETI@home site but was unable to quick find recently. Maybe someone could provide direct links.
Also would be worth to read relatively recently added Nebula algorithm descriptions (and there could be missed links to MultiBeam itself).
And don't be deceived with running times under profiler. Modern GPU cards do really HUGE work in very short time. And to log and process all that done (and that is CPU work after all) is hard.
That's why I recommend not to look how fast your GPU processes task. It could be minute but still too much for profiler.
Usage of shortened task is almost mandatory.
To do that you need to open task and edit these fields:

<chirps>
<chirp_parameter_t>
<chirp_limit>30</chirp_limit>

<chirp_parameter_t>
<chirp_limit>100</chirp_limit>

For benchmark tasks they are set to 3 and 10.
But even shortened benchmark tasks are too long for GPU profiling.
During app development I found most convenient to modify code itself and set hard limit on number of full iterations.

Corresponding place in OpenCL MB is:

//R:for profiling only, should be commented out in production release
// if(icfft>=500){ DoSyncExit(); }

Number can be adjusted if needed but first 500 iterations give enough info for profiling.
Also, sync exit correctly terminates app allowing profiler to handle all gathered info. Simple exit() will not work.

Feel free to ask for more particular details but better in this thread than PM for all could benefit from conversation (finding thread in boards history later). In PM one even can’t see own answer after sending it.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1979151 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1979196 - Posted: 7 Feb 2019, 22:58:13 UTC - in response to Message 1979151.  

In PM one even can’t see own answer after sending it.

Yeah, I think that would be my biggest issue with the PM system as it stands.
Grant
Darwin NT
ID: 1979196 · Report as offensive
Ben

Send message
Joined: 15 Jun 99
Posts: 54
Credit: 60,003,756
RAC: 150
United States
Message 1979217 - Posted: 8 Feb 2019, 2:29:02 UTC - in response to Message 1979151.  

The CPU version is definitely easier to understand, thanks. I have another question though, why is the fft part of the program enclosed in a c++ program and not a separate file? It is really hard to follow.

Thank you.
ID: 1979217 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1979323 - Posted: 8 Feb 2019, 14:39:05 UTC - in response to Message 1979217.  

The CPU version is definitely easier to understand, thanks. I have another question though, why is the fft part of the program enclosed in a c++ program and not a separate file? It is really hard to follow.

Thank you.

Could you be more specific about what particular FFT you speak? There are FFTW, Ooura, clFFT, cuFFT libraries in use for FFT along with some codelets for particular FFT sizes. FFTW generally used as library calls - so it is "in separate file". But some sizes allow more close interconnet between FFT and next algorithm stages. In such cases better to perform FFT+smth else in single function call from performance point of view. All those different GPU versions are about performance improvement being inside same algorithmic frames. And in GPU world re-reading even from GPU memory (not saying about re-transfer from host memory-that's absolute performance killer for non-APU devices) is costly. And often faster to re-compute smth from data already in registers than to compute once, store it in memory and re-load from memory later.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1979323 · Report as offensive
Ben

Send message
Joined: 15 Jun 99
Posts: 54
Credit: 60,003,756
RAC: 150
United States
Message 1979423 - Posted: 8 Feb 2019, 21:48:05 UTC - in response to Message 1979323.  

Thank you.[/quote]
Could you be more specific about what particular FFT you speak? There are FFTW, Ooura, clFFT, cuFFT libraries in use for FFT along with some codelets for particular FFT sizes. FFTW generally used as library calls - so it is "in separate file". But some sizes allow more close interconnet between FFT and next algorithm stages. In such cases better to perform FFT+smth else in single function call from performance point of view. All those different GPU versions are about performance improvement being inside same algorithmic frames. And in GPU world re-reading even from GPU memory (not saying about re-transfer from host memory-that's absolute performance killer for non-APU devices) is costly. And often faster to src/OpenCL_FFT/fft_kernelstring.cppre-compute smth from data already in registers than to compute once, store it in memory and re-load from memory later.[/quote]

I was looking at: src/OpenCL_FFT/fft_kernelstring.cpp
It seems to have opencl encapsulated in strings, or am I reading it wrong?

I didn't realize seti uses so many fft libraries.
ID: 1979423 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1979429 - Posted: 8 Feb 2019, 22:08:35 UTC - in response to Message 1979423.  


I was looking at: src/OpenCL_FFT/fft_kernelstring.cpp
It seems to have opencl encapsulated in strings, or am I reading it wrong?

Yes, that's the form in that I found that lib. And it's in such unfamiliar form for the reason. It uses codelets approach so particular kernel generated "on fly" .
With static CL file it would be impossible to achieve w/o calling some subroutines (== losing performance on calls).
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1979429 · Report as offensive
Ben

Send message
Joined: 15 Jun 99
Posts: 54
Credit: 60,003,756
RAC: 150
United States
Message 1979444 - Posted: 8 Feb 2019, 22:46:09 UTC - in response to Message 1979429.  

Just one more question before I go back to reading the code. In the pulse finding routines "folding" is mentioned but I don't see anything that says what that means.

Thank you.
ID: 1979444 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1979449 - Posted: 8 Feb 2019, 23:05:00 UTC - in response to Message 1979444.  
Last modified: 8 Feb 2019, 23:05:49 UTC

Just one more question before I go back to reading the code. In the pulse finding routines "folding" is mentioned but I don't see anything that says what that means.

Thank you.

Folding is addition of slices of time-scale data . If slice size corresponds signal period one can see sharp increase in such sums and detects signal with particular period over noise. Cause we don't know what periods we need we try all possible ones (with some reasonable step) hence so many different foldings performed in PulseFind part. It corresponds to AstroPulse FFA (in my understanding). One of unsolved enigmas for me why OpenCL AP implementation achieves much better speedups than PulseFind's one in MB. Only bigger array sizeswith less conditionals or smth else... Perhaps comparing with Petri's implementation I would have some hints.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1979449 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1979580 - Posted: 9 Feb 2019, 17:16:48 UTC - in response to Message 1979449.  

Just one more question before I go back to reading the code. In the pulse finding routines "folding" is mentioned but I don't see anything that says what that means.

Thank you.

Folding is addition of slices of time-scale data . If slice size corresponds signal period one can see sharp increase in such sums and detects signal with particular period over noise. Cause we don't know what periods we need we try all possible ones (with some reasonable step) hence so many different foldings performed in PulseFind part. It corresponds to AstroPulse FFA (in my understanding). One of unsolved enigmas for me why OpenCL AP implementation achieves much better speedups than PulseFind's one in MB. Only bigger array sizeswith less conditionals or smth else... Perhaps comparing with Petri's implementation I would have some hints.


Hi,

I'm sorry if I do not have enough power in me to answer any more questions after work days before the summer comes but I like to guess an answer to this AP/MB pulse find speed enigma.

a) The AP has unroll. It may use more GPU sm units at a time (parallel processing). The original CUDA version of MB used only one sm before I implemented unroll to the CUDA code. (sm being a symmetrical multiprocessing unit. There are 80 of those in Titan V and about 70 in RTX 2080/Ti, 28 in GTX1080Ti, 20 in GTX1080. The 10x0 series run 128 simultaneous threads and the Titan & RTX run 64. Or something like that. A Titan V's performance is roughly twice the 1080.) -- The unroll means at best about 40 times more performance but it increases the need for housekeeping which results came first (to maintain an acceptable compatibility with the sequential CPU version).

b) The AP has a hand coded part that does something like summing up first 64 values, then 32, 16, 8, 4, 2 down to 1 and the values are compared to a limit to check if something is found in an optimized way. All that reduces the writes and reads to GPU memory when 'folding'. Like Raistmer stated the memory access kills performance, even if done on GPU memory.

That is my quess.

Raistmer has the best knowledge of the OpenCL and AP software internals.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1979580 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1979606 - Posted: 9 Feb 2019, 20:20:22 UTC

Thank you. I am an amateur at programming but I am still interested in the discussion.
A proud member of the OFA (Old Farts Association).
ID: 1979606 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1979719 - Posted: 10 Feb 2019, 15:08:15 UTC - in response to Message 1979580.  


a) The AP has unroll. It may use more GPU sm units at a time (parallel processing). The original CUDA version of MB used only one sm before I implemented unroll to the CUDA code. (sm being a symmetrical multiprocessing unit. There are 80 of those in Titan V and about 70 in RTX 2080/Ti, 28 in GTX1080Ti, 20 in GTX1080. The 10x0 series run 128 simultaneous threads and the Titan & RTX run 64. Or something like that. A Titan V's performance is roughly twice the 1080.) -- The unroll means at best about 40 times more performance but it increases the need for housekeeping which results came first (to maintain an acceptable compatibility with the sequential CPU version).

Well, SoG's PulseFind has unroll's in some sense too... Maybe it's not PulseFind after all.
But comparing CPU/GPU pairs for AP andMB I see better speedup on AP (staying in OpenCL area on GPU, of course).
BTW, did anyone compare mean CPU temperatures doing MB vs AP tasks (optimized ones)? Device temperature could indicate computational density and indirectly - optimization level (only indirectly of course cause avoidable computations could warm device as good as needed ones).
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1979719 · Report as offensive
Ben

Send message
Joined: 15 Jun 99
Posts: 54
Credit: 60,003,756
RAC: 150
United States
Message 1980366 - Posted: 14 Feb 2019, 21:24:31 UTC - in response to Message 1979719.  

So , new question in "pulsefind.h"

struct PoTPlan {

unsigned long cperiod; // *0xC000 <-- what does cperiod represent?
}

Also what are:

#define C3X2TO14 0xC000
#define C3X2TO13 0x6000

// constant for preplan limit
#define PPLANMAX 511

Thank you.
ID: 1980366 · Report as offensive
Ben

Send message
Joined: 15 Jun 99
Posts: 54
Credit: 60,003,756
RAC: 150
United States
Message 1980370 - Posted: 14 Feb 2019, 21:30:35 UTC - in response to Message 1980366.  

One other thing, looking at the code I'm guessing it chops up the PoT down to a minimum of 32 time slices divided in half?

Thank you.
ID: 1980370 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1980476 - Posted: 15 Feb 2019, 10:50:55 UTC - in response to Message 1980366.  

So , new question in "pulsefind.h"

struct PoTPlan {

unsigned long cperiod; // *0xC000 <-- what does cperiod represent?
}

Also what are:

#define C3X2TO14 0xC000
#define C3X2TO13 0x6000

// constant for preplan limit
#define PPLANMAX 511

Thank you.


No idea, I didn't touch these values in GPU version.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1980476 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Documentation of multibeam app internal programming?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.