Some considerations regarding OpenCL MultiBeam app tuning from algorithm view

Message boards : Number crunching : Some considerations regarding OpenCL MultiBeam app tuning from algorithm view
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1793964 - Posted: 6 Jun 2016, 13:18:58 UTC
Last modified: 6 Jun 2016, 13:51:07 UTC

Here: http://lunatics.kwsn.info/index.php/topic,1808.msg60931.html#msg60931 I tried to explain some of peculiarities of VLAR task and options (in progress) of OpenCL MultiBeam app that could help to deal with them.

If some clarifications needed please ask here and I will edit original text id deemed required.

(link corrected)
ID: 1793964 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1793968 - Posted: 6 Jun 2016, 13:34:31 UTC - in response to Message 1793964.  

Here: http://lunatics.kwsn.info/index.php/topic,1808.msg60931.html#msg60931 I tried to explain some of peculiarities of VLAR task and options (in progress) of OpenCL MultiBeam app that could help to deal with them.

If some clarifications needed please ask here and I will edit original text id deemed required.

made link clickable.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1793968 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1793972 - Posted: 6 Jun 2016, 14:29:32 UTC

added -sbs N option description
ID: 1793972 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1793991 - Posted: 6 Jun 2016, 15:54:58 UTC - in response to Message 1793972.  

added -use_sleep description
ID: 1793991 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13732
Credit: 208,696,464
RAC: 304
Australia
Message 1794098 - Posted: 6 Jun 2016, 22:19:09 UTC - in response to Message 1793964.  

Here: http://lunatics.kwsn.info/index.php/topic,1808.msg60931.html#msg60931 I tried to explain some of peculiarities of VLAR task and options (in progress) of OpenCL MultiBeam app that could help to deal with them.

Thank you!
This is what I've been hoping for. Looking forward to the explanations of the other options.
Knowing what they are & why, as well as an idea of why particular values are selected will make it a lot easier for people to tweak their settings intelligently, and not just randomly or asking Mike over & over again for values & settings to try.
Grant
Darwin NT
ID: 1794098 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1794138 - Posted: 7 Jun 2016, 0:39:51 UTC - in response to Message 1793964.  
Last modified: 7 Jun 2016, 0:49:30 UTC

-period_iterations_num N
"splits single call to N subsequent calls each of those process only 1/Nth of all available periods"

What is the max and min # of "all available periods"?
(1/50 of 1000 can be easy on GPU and 1/50 of 1000000 can be hard (lags))

1/Nth is relative to the # of all available periods

Do we have a way to set the (absolute) max # of periods in single kernel call?
e.g. something like:
-max_num_of_periods_per_call 1000
-num_of_periods_per_call_max 1000

(i.e. if "1/Nth of all available periods" gives more than 1000 the N should be changed automatically (for this sequence of calls) so less than 1000 periods be computed in single kernel call.
(I don't know if 1000 is anywhere near the reality.)
)
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1794138 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1794139 - Posted: 7 Jun 2016, 0:44:11 UTC - in response to Message 1793964.  

Here: http://lunatics.kwsn.info/index.php/topic,1808.msg60931.html#msg60931 I tried to explain some of peculiarities of VLAR task and options (in progress) of OpenCL MultiBeam app that could help to deal with them.

If some clarifications needed please ask here and I will edit original text id deemed required.

(link corrected)



. . Hi,

. . Is it OK if I edit that and add it to the text file with the distribution package?
ID: 1794139 · Report as offensive
Profile Gianfranco Lizzio
Volunteer tester
Avatar

Send message
Joined: 5 May 99
Posts: 39
Credit: 28,049,113
RAC: 87
Italy
Message 1794189 - Posted: 7 Jun 2016, 8:13:15 UTC - in response to Message 1793964.  

Raistmer,
with Arecibo data the processing times were always the same for same AR. But now with data of Greenbank it is no longer so , and with the same AR times are very different, depending on whether it's blc2 , blc3 , blc5 or blc6 data . The question then is, what changes as blc changes?
I asked this question to Eric without receiving any response from him...
I don't want to believe, I want to know!
ID: 1794189 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1794207 - Posted: 7 Jun 2016, 9:05:29 UTC - in response to Message 1794138.  

-period_iterations_num N
"splits single call to N subsequent calls each of those process only 1/Nth of all available periods"

What is the max and min # of "all available periods"?
(1/50 of 1000 can be easy on GPU and 1/50 of 1000000 can be hard (lags))

You are right in that. Currently to separate between "easy" and "hard" cases another parameter of PulseFind algorithm used. So some of calls go undivided at all.


Do we have a way to set the (absolute) max # of periods in single kernel call?
e.g. something like:
-max_num_of_periods_per_call 1000
-num_of_periods_per_call_max 1000

(i.e. if "1/Nth of all available periods" gives more than 1000 the N should be changed automatically (for this sequence of calls) so less than 1000 periods be computed in single kernel call.
(I don't know if 1000 is anywhere near the reality.)
)


It's possible to add such limitation (actually, such approach (namely, fixed number of periods per kernel call) used in original CUDA MB). But there is another issue that makes this approach IMHO as ineffective as one I chose: kernel execution length depends not only from number of periods. The array to fold length (btw, number of periods computed as fraction of this size) influences kernel time directly.
So, being set at some particular number we again receive "easy" kernels and "hard" kernels just because PulsePoTLen (the size of array to fold) differs for different tasks and for different FFT sizes inside same task.

So, if one would try to improve app's adaptation to different PulseFind kernels one should look how to devise/predict kernel call length from given PulsePoTLen size. Then indeed number of iterations for particular kernel call can be made function of given PulsePoTLen value.

Worth to consider it as next possible optimization.

P.S. I plan to embed smth like own profiler into app so some statistical data about kernel execution time on particular device can be collected. Then, having such statistics, additional improvements (and, most important, automatic improvements) in adaptation possible. Not true AI still, but step to it, LoL.
ID: 1794207 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1794208 - Posted: 7 Jun 2016, 9:08:02 UTC - in response to Message 1794139.  


. . Is it OK if I edit that and add it to the text file with the distribution package?

What distribution package? What you want (and where?) to distribute?
ID: 1794208 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1794209 - Posted: 7 Jun 2016, 9:09:16 UTC - in response to Message 1794098.  

to tweak their settings intelligently, and not just randomly or asking Mike over & over again for values & settings to try.

Yep, that exactly the goal I have sacrificing whole evening for those writings.
ID: 1794209 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1794213 - Posted: 7 Jun 2016, 9:34:48 UTC - in response to Message 1794189.  
Last modified: 7 Jun 2016, 9:36:52 UTC

Raistmer,
with Arecibo data the processing times were always the same for same AR. But now with data of Greenbank it is no longer so , and with the same AR times are very different, depending on whether it's blc2 , blc3 , blc5 or blc6 data . The question then is, what changes as blc changes?
I asked this question to Eric without receiving any response from him...

I'm not familiar with these differencies (Joe, we miss you so much :( )...
Well, I'll try to look and spot some pattern. First guess would be different number of so called icffts per task. It's chirp/fft pair. chirp component defines how many doppler shift corrections (to account for relative movement of Earth and possible signal source) will be made per task. fft part define how many (and what) frequency ranges/bands will be used in data processing per given doppler shift corrected data array. Number is vary cause if correction too big there is no sence to look at some bands as even putatively correct. BTW, this explains inherent (!) non-linear MultiBeam task progress. As I wrote somewhere before, it's inherent property of MB task. CPU experience same inhomogenity as GPU while processing task. What is different is the scale of reaction on this inhomogenity that quite low for CPU and has current maximum on NV SoG app.
Processing advanced from "no correction" (zero shift) to highest one. Through this advance some searches switched off. How far correction goes and with what incremental step - depends on task header (btw, v6->v7 transition as far as I could recall decreased 2-fold chirp step size so number of icfft pairs roughly doubled).

You can check this hypothesis by comparing <chirps> tag in those tasks.
It looks like:

<chirps>
<chirp_parameter_t>
<chirp_limit>30</chirp_limit>
<fft_len_flags>262136</fft_len_flags>
</chirp_parameter_t>
<chirp_parameter_t>
<chirp_limit>100</chirp_limit>
<fft_len_flags>65528</fft_len_flags>
</chirp_parameter_t>
</chirps>

Will parameters be the same or bigger in longer task?

Also, you can look to state.sah file in its very begining:

<ncfft>191816</ncfft>
<cr>-9.673405e+001</cr>
<fl>16384</fl>
<prog>0.97944197</prog>

<prog> tag will show progress made at last checkpoint. Wait when it will be near 100 (you need as last checkpoint as possible).
In this example it's last checkpoint for some offline bench, ~98% of task done, good enough. Then look at very first tag <ncfft> it shows number of cfft pair where processing was at time of checkpont. With almost finished task it will be almost maximum possible number of icfft pairs for this task.

And finally (perhaps most easy way ;) ) you can look into task's stderr for any of my builds:

ar=0.411824 NumCfft=199713 NumGauss=1147531846 NumPulse=226386633766 NumTriplet=452774020430


It lists task's AR and number of chirp/fft pairs that task constitutes: NumCfft=199713

Compare these numbers. Also compare numbers of PoT searches to be done (other 3 values in that line).
ID: 1794213 · Report as offensive
Profile Gianfranco Lizzio
Volunteer tester
Avatar

Send message
Joined: 5 May 99
Posts: 39
Credit: 28,049,113
RAC: 87
Italy
Message 1794237 - Posted: 7 Jun 2016, 11:42:07 UTC - in response to Message 1794213.  

And finally (perhaps most easy way ;) ) you can look into task's stderr for any of my builds:


Raistmer as you suggested

blc5 data reports

ar=0.007159 NumCfft=123489 NumGauss=0 NumPulse=54509597824 NumTriplet=67492265376


blc6 data reports

ar=0.006972 NumCfft=99877 NumGauss=0 NumPulse=29801966464 NumTriplet=42750031008


and so your assumptions were correct.
I don't want to believe, I want to know!
ID: 1794237 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1794302 - Posted: 8 Jun 2016, 0:37:51 UTC - in response to Message 1794208.  


. . Is it OK if I edit that and add it to the text file with the distribution package?

What distribution package? What you want (and where?) to distribute?


. . Sorry I wasn't clear. What I meant was can I add the info from those postings to the *.txt files that came with the BOINC/Lunatics distribution package?
ID: 1794302 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1794340 - Posted: 8 Jun 2016, 4:35:08 UTC - in response to Message 1794302.  


. . Is it OK if I edit that and add it to the text file with the distribution package?

What distribution package? What you want (and where?) to distribute?


. . Sorry I wasn't clear. What I meant was can I add the info from those postings to the *.txt files that came with the BOINC/Lunatics distribution package?

yes
ID: 1794340 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1794682 - Posted: 9 Jun 2016, 11:32:49 UTC

And now we have a new blc7 with (on a sample of one)

<chirps>
<chirp_parameter_t>
<chirp_limit>30</chirp_limit>
<fft_len_flags>262136</fft_len_flags>
</chirp_parameter_t>
<chirp_parameter_t>
<chirp_limit>100</chirp_limit>
<fft_len_flags>65528</fft_len_flags>
</chirp_parameter_t>
</chirps>
ID: 1794682 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1794866 - Posted: 10 Jun 2016, 0:04:49 UTC - in response to Message 1794340.  


. . Is it OK if I edit that and add it to the text file with the distribution package?

What distribution package? What you want (and where?) to distribute?


. . Sorry I wasn't clear. What I meant was can I add the info from those postings to the *.txt files that came with the BOINC/Lunatics distribution package?

yes


Thanks
ID: 1794866 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1794875 - Posted: 10 Jun 2016, 0:22:28 UTC - in response to Message 1794682.  

And now we have a new blc7 with (on a sample of one)

<chirps>
<chirp_parameter_t>
<chirp_limit>30</chirp_limit>
<fft_len_flags>262136</fft_len_flags>
</chirp_parameter_t>
<chirp_parameter_t>
<chirp_limit>100</chirp_limit>
<fft_len_flags>65528</fft_len_flags>
</chirp_parameter_t>
</chirps>


. .

. . Well I am still working on understanding that but so far blc7 and blc6 seem to process in the same or similar runtimes, it is only blc5 that adds about 50% to runtimes. I am not seeing blc2 or blc3 in my allotment of WUs
ID: 1794875 · Report as offensive

Message boards : Number crunching : Some considerations regarding OpenCL MultiBeam app tuning from algorithm view


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.