Message boards :
Number crunching :
Some considerations regarding OpenCL MultiBeam app tuning from algorithm view
Message board moderation
Author | Message |
---|---|
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Here: http://lunatics.kwsn.info/index.php/topic,1808.msg60931.html#msg60931 I tried to explain some of peculiarities of VLAR task and options (in progress) of OpenCL MultiBeam app that could help to deal with them. If some clarifications needed please ask here and I will edit original text id deemed required. (link corrected) |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
Here: http://lunatics.kwsn.info/index.php/topic,1808.msg60931.html#msg60931 I tried to explain some of peculiarities of VLAR task and options (in progress) of OpenCL MultiBeam app that could help to deal with them. made link clickable. A person who won't read has no advantage over one who can't read. (Mark Twain) |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
added -sbs N option description |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
added -use_sleep description |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13732 Credit: 208,696,464 RAC: 304 |
Here: http://lunatics.kwsn.info/index.php/topic,1808.msg60931.html#msg60931 I tried to explain some of peculiarities of VLAR task and options (in progress) of OpenCL MultiBeam app that could help to deal with them. Thank you! This is what I've been hoping for. Looking forward to the explanations of the other options. Knowing what they are & why, as well as an idea of why particular values are selected will make it a lot easier for people to tweak their settings intelligently, and not just randomly or asking Mike over & over again for values & settings to try. Grant Darwin NT |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
-period_iterations_num N What is the max and min # of "all available periods"? (1/50 of 1000 can be easy on GPU and 1/50 of 1000000 can be hard (lags)) 1/Nth is relative to the # of all available periods Do we have a way to set the (absolute) max # of periods in single kernel call? e.g. something like: -max_num_of_periods_per_call 1000 -num_of_periods_per_call_max 1000 (i.e. if "1/Nth of all available periods" gives more than 1000 the N should be changed automatically (for this sequence of calls) so less than 1000 periods be computed in single kernel call. (I don't know if 1000 is anywhere near the reality.) ) Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Here: http://lunatics.kwsn.info/index.php/topic,1808.msg60931.html#msg60931 I tried to explain some of peculiarities of VLAR task and options (in progress) of OpenCL MultiBeam app that could help to deal with them. . . Hi, . . Is it OK if I edit that and add it to the text file with the distribution package? |
Gianfranco Lizzio Send message Joined: 5 May 99 Posts: 39 Credit: 28,049,113 RAC: 87 |
Raistmer, with Arecibo data the processing times were always the same for same AR. But now with data of Greenbank it is no longer so , and with the same AR times are very different, depending on whether it's blc2 , blc3 , blc5 or blc6 data . The question then is, what changes as blc changes? I asked this question to Eric without receiving any response from him... I don't want to believe, I want to know! |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
-period_iterations_num N You are right in that. Currently to separate between "easy" and "hard" cases another parameter of PulseFind algorithm used. So some of calls go undivided at all.
It's possible to add such limitation (actually, such approach (namely, fixed number of periods per kernel call) used in original CUDA MB). But there is another issue that makes this approach IMHO as ineffective as one I chose: kernel execution length depends not only from number of periods. The array to fold length (btw, number of periods computed as fraction of this size) influences kernel time directly. So, being set at some particular number we again receive "easy" kernels and "hard" kernels just because PulsePoTLen (the size of array to fold) differs for different tasks and for different FFT sizes inside same task. So, if one would try to improve app's adaptation to different PulseFind kernels one should look how to devise/predict kernel call length from given PulsePoTLen size. Then indeed number of iterations for particular kernel call can be made function of given PulsePoTLen value. Worth to consider it as next possible optimization. P.S. I plan to embed smth like own profiler into app so some statistical data about kernel execution time on particular device can be collected. Then, having such statistics, additional improvements (and, most important, automatic improvements) in adaptation possible. Not true AI still, but step to it, LoL. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
What distribution package? What you want (and where?) to distribute? |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
to tweak their settings intelligently, and not just randomly or asking Mike over & over again for values & settings to try. Yep, that exactly the goal I have sacrificing whole evening for those writings. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Raistmer, I'm not familiar with these differencies (Joe, we miss you so much :( )... Well, I'll try to look and spot some pattern. First guess would be different number of so called icffts per task. It's chirp/fft pair. chirp component defines how many doppler shift corrections (to account for relative movement of Earth and possible signal source) will be made per task. fft part define how many (and what) frequency ranges/bands will be used in data processing per given doppler shift corrected data array. Number is vary cause if correction too big there is no sence to look at some bands as even putatively correct. BTW, this explains inherent (!) non-linear MultiBeam task progress. As I wrote somewhere before, it's inherent property of MB task. CPU experience same inhomogenity as GPU while processing task. What is different is the scale of reaction on this inhomogenity that quite low for CPU and has current maximum on NV SoG app. Processing advanced from "no correction" (zero shift) to highest one. Through this advance some searches switched off. How far correction goes and with what incremental step - depends on task header (btw, v6->v7 transition as far as I could recall decreased 2-fold chirp step size so number of icfft pairs roughly doubled). You can check this hypothesis by comparing <chirps> tag in those tasks. It looks like: <chirps> <chirp_parameter_t> <chirp_limit>30</chirp_limit> <fft_len_flags>262136</fft_len_flags> </chirp_parameter_t> <chirp_parameter_t> <chirp_limit>100</chirp_limit> <fft_len_flags>65528</fft_len_flags> </chirp_parameter_t> </chirps> Will parameters be the same or bigger in longer task? Also, you can look to state.sah file in its very begining: <ncfft>191816</ncfft> <cr>-9.673405e+001</cr> <fl>16384</fl> <prog>0.97944197</prog> <prog> tag will show progress made at last checkpoint. Wait when it will be near 100 (you need as last checkpoint as possible). In this example it's last checkpoint for some offline bench, ~98% of task done, good enough. Then look at very first tag <ncfft> it shows number of cfft pair where processing was at time of checkpont. With almost finished task it will be almost maximum possible number of icfft pairs for this task. And finally (perhaps most easy way ;) ) you can look into task's stderr for any of my builds: ar=0.411824 NumCfft=199713 NumGauss=1147531846 NumPulse=226386633766 NumTriplet=452774020430 It lists task's AR and number of chirp/fft pairs that task constitutes: NumCfft=199713 Compare these numbers. Also compare numbers of PoT searches to be done (other 3 values in that line). |
Gianfranco Lizzio Send message Joined: 5 May 99 Posts: 39 Credit: 28,049,113 RAC: 87 |
And finally (perhaps most easy way ;) ) you can look into task's stderr for any of my builds: Raistmer as you suggested blc5 data reports ar=0.007159 NumCfft=123489 NumGauss=0 NumPulse=54509597824 NumTriplet=67492265376 blc6 data reports ar=0.006972 NumCfft=99877 NumGauss=0 NumPulse=29801966464 NumTriplet=42750031008 and so your assumptions were correct. I don't want to believe, I want to know! |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
. . Sorry I wasn't clear. What I meant was can I add the info from those postings to the *.txt files that came with the BOINC/Lunatics distribution package? |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
yes |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
And now we have a new blc7 with (on a sample of one) <chirps> <chirp_parameter_t> <chirp_limit>30</chirp_limit> <fft_len_flags>262136</fft_len_flags> </chirp_parameter_t> <chirp_parameter_t> <chirp_limit>100</chirp_limit> <fft_len_flags>65528</fft_len_flags> </chirp_parameter_t> </chirps> |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
Thanks |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
And now we have a new blc7 with (on a sample of one) . . . . Well I am still working on understanding that but so far blc7 and blc6 seem to process in the same or similar runtimes, it is only blc5 that adds about 50% to runtimes. I am not seeing blc2 or blc3 in my allotment of WUs |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.