Some considerations regarding OpenCL MultiBeam app tuning from algorithm view

Author	Message
Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1793964 - Posted: 6 Jun 2016, 13:18:58 UTC Last modified: 6 Jun 2016, 13:51:07 UTC Here: http://lunatics.kwsn.info/index.php/topic,1808.msg60931.html#msg60931 I tried to explain some of peculiarities of VLAR task and options (in progress) of OpenCL MultiBeam app that could help to deal with them. If some clarifications needed please ask here and I will edit original text id deemed required. (link corrected) ID: 1793964 ·

William Volunteer tester Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0	Message 1793968 - Posted: 6 Jun 2016, 13:34:31 UTC - in response to Message 1793964. Here: http://lunatics.kwsn.info/index.php/topic,1808.msg60931.html#msg60931 I tried to explain some of peculiarities of VLAR task and options (in progress) of OpenCL MultiBeam app that could help to deal with them. If some clarifications needed please ask here and I will edit original text id deemed required. made link clickable. A person who won't read has no advantage over one who can't read. (Mark Twain) ID: 1793968 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1793972 - Posted: 6 Jun 2016, 14:29:32 UTC added -sbs N option description ID: 1793972 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1793991 - Posted: 6 Jun 2016, 15:54:58 UTC - in response to Message 1793972. added -use_sleep description ID: 1793991 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13732 Credit: 208,696,464 RAC: 304	Message 1794098 - Posted: 6 Jun 2016, 22:19:09 UTC - in response to Message 1793964. Here: http://lunatics.kwsn.info/index.php/topic,1808.msg60931.html#msg60931 I tried to explain some of peculiarities of VLAR task and options (in progress) of OpenCL MultiBeam app that could help to deal with them. Thank you! This is what I've been hoping for. Looking forward to the explanations of the other options. Knowing what they are & why, as well as an idea of why particular values are selected will make it a lot easier for people to tweak their settings intelligently, and not just randomly or asking Mike over & over again for values & settings to try. Grant Darwin NT ID: 1794098 ·

BilBg Volunteer tester Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0	Message 1794138 - Posted: 7 Jun 2016, 0:39:51 UTC - in response to Message 1793964. Last modified: 7 Jun 2016, 0:49:30 UTC -period_iterations_num N "splits single call to N subsequent calls each of those process only 1/Nth of all available periods" What is the max and min # of "all available periods"? (1/50 of 1000 can be easy on GPU and 1/50 of 1000000 can be hard (lags)) 1/Nth is relative to the # of all available periods Do we have a way to set the (absolute) max # of periods in single kernel call? e.g. something like: -max_num_of_periods_per_call 1000 -num_of_periods_per_call_max 1000 (i.e. if "1/Nth of all available periods" gives more than 1000 the N should be changed automatically (for this sequence of calls) so less than 1000 periods be computed in single kernel call. (I don't know if 1000 is anywhere near the reality.) ) Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â ID: 1794138 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1794139 - Posted: 7 Jun 2016, 0:44:11 UTC - in response to Message 1793964. Here: http://lunatics.kwsn.info/index.php/topic,1808.msg60931.html#msg60931 I tried to explain some of peculiarities of VLAR task and options (in progress) of OpenCL MultiBeam app that could help to deal with them. If some clarifications needed please ask here and I will edit original text id deemed required. (link corrected) . . Hi, . . Is it OK if I edit that and add it to the text file with the distribution package? ID: 1794139 ·

Gianfranco Lizzio Volunteer tester Send message Joined: 5 May 99 Posts: 39 Credit: 28,049,113 RAC: 87	Message 1794189 - Posted: 7 Jun 2016, 8:13:15 UTC - in response to Message 1793964. Raistmer, with Arecibo data the processing times were always the same for same AR. But now with data of Greenbank it is no longer so , and with the same AR times are very different, depending on whether it's blc2 , blc3 , blc5 or blc6 data . The question then is, what changes as blc changes? I asked this question to Eric without receiving any response from him... I don't want to believe, I want to know! ID: 1794189 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1794207 - Posted: 7 Jun 2016, 9:05:29 UTC - in response to Message 1794138. -period_iterations_num N "splits single call to N subsequent calls each of those process only 1/Nth of all available periods" What is the max and min # of "all available periods"? (1/50 of 1000 can be easy on GPU and 1/50 of 1000000 can be hard (lags)) You are right in that. Currently to separate between "easy" and "hard" cases another parameter of PulseFind algorithm used. So some of calls go undivided at all. Do we have a way to set the (absolute) max # of periods in single kernel call? e.g. something like: -max_num_of_periods_per_call 1000 -num_of_periods_per_call_max 1000 (i.e. if "1/Nth of all available periods" gives more than 1000 the N should be changed automatically (for this sequence of calls) so less than 1000 periods be computed in single kernel call. (I don't know if 1000 is anywhere near the reality.) ) It's possible to add such limitation (actually, such approach (namely, fixed number of periods per kernel call) used in original CUDA MB). But there is another issue that makes this approach IMHO as ineffective as one I chose: kernel execution length depends not only from number of periods. The array to fold length (btw, number of periods computed as fraction of this size) influences kernel time directly. So, being set at some particular number we again receive "easy" kernels and "hard" kernels just because PulsePoTLen (the size of array to fold) differs for different tasks and for different FFT sizes inside same task. So, if one would try to improve app's adaptation to different PulseFind kernels one should look how to devise/predict kernel call length from given PulsePoTLen size. Then indeed number of iterations for particular kernel call can be made function of given PulsePoTLen value. Worth to consider it as next possible optimization. P.S. I plan to embed smth like own profiler into app so some statistical data about kernel execution time on particular device can be collected. Then, having such statistics, additional improvements (and, most important, automatic improvements) in adaptation possible. Not true AI still, but step to it, LoL. ID: 1794207 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1794208 - Posted: 7 Jun 2016, 9:08:02 UTC - in response to Message 1794139. . . Is it OK if I edit that and add it to the text file with the distribution package? What distribution package? What you want (and where?) to distribute? ID: 1794208 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1794209 - Posted: 7 Jun 2016, 9:09:16 UTC - in response to Message 1794098. to tweak their settings intelligently, and not just randomly or asking Mike over & over again for values & settings to try. Yep, that exactly the goal I have sacrificing whole evening for those writings. ID: 1794209 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1794213 - Posted: 7 Jun 2016, 9:34:48 UTC - in response to Message 1794189. Last modified: 7 Jun 2016, 9:36:52 UTC Raistmer, with Arecibo data the processing times were always the same for same AR. But now with data of Greenbank it is no longer so , and with the same AR times are very different, depending on whether it's blc2 , blc3 , blc5 or blc6 data . The question then is, what changes as blc changes? I asked this question to Eric without receiving any response from him... I'm not familiar with these differencies (Joe, we miss you so much :( )... Well, I'll try to look and spot some pattern. First guess would be different number of so called icffts per task. It's chirp/fft pair. chirp component defines how many doppler shift corrections (to account for relative movement of Earth and possible signal source) will be made per task. fft part define how many (and what) frequency ranges/bands will be used in data processing per given doppler shift corrected data array. Number is vary cause if correction too big there is no sence to look at some bands as even putatively correct. BTW, this explains inherent (!) non-linear MultiBeam task progress. As I wrote somewhere before, it's inherent property of MB task. CPU experience same inhomogenity as GPU while processing task. What is different is the scale of reaction on this inhomogenity that quite low for CPU and has current maximum on NV SoG app. Processing advanced from "no correction" (zero shift) to highest one. Through this advance some searches switched off. How far correction goes and with what incremental step - depends on task header (btw, v6->v7 transition as far as I could recall decreased 2-fold chirp step size so number of icfft pairs roughly doubled). You can check this hypothesis by comparing <chirps> tag in those tasks. It looks like: <chirps> <chirp_parameter_t> <chirp_limit>30</chirp_limit> <fft_len_flags>262136</fft_len_flags> </chirp_parameter_t> <chirp_parameter_t> <chirp_limit>100</chirp_limit> <fft_len_flags>65528</fft_len_flags> </chirp_parameter_t> </chirps> Will parameters be the same or bigger in longer task? Also, you can look to state.sah file in its very begining: <ncfft>191816</ncfft> <cr>-9.673405e+001</cr> <fl>16384</fl> <prog>0.97944197</prog> <prog> tag will show progress made at last checkpoint. Wait when it will be near 100 (you need as last checkpoint as possible). In this example it's last checkpoint for some offline bench, ~98% of task done, good enough. Then look at very first tag <ncfft> it shows number of cfft pair where processing was at time of checkpont. With almost finished task it will be almost maximum possible number of icfft pairs for this task. And finally (perhaps most easy way ;) ) you can look into task's stderr for any of my builds: ar=0.411824 NumCfft=199713 NumGauss=1147531846 NumPulse=226386633766 NumTriplet=452774020430 It lists task's AR and number of chirp/fft pairs that task constitutes: NumCfft=199713 Compare these numbers. Also compare numbers of PoT searches to be done (other 3 values in that line). ID: 1794213 ·

Gianfranco Lizzio Volunteer tester Send message Joined: 5 May 99 Posts: 39 Credit: 28,049,113 RAC: 87	Message 1794237 - Posted: 7 Jun 2016, 11:42:07 UTC - in response to Message 1794213. And finally (perhaps most easy way ;) ) you can look into task's stderr for any of my builds: Raistmer as you suggested blc5 data reports ar=0.007159 NumCfft=123489 NumGauss=0 NumPulse=54509597824 NumTriplet=67492265376 blc6 data reports ar=0.006972 NumCfft=99877 NumGauss=0 NumPulse=29801966464 NumTriplet=42750031008 and so your assumptions were correct. I don't want to believe, I want to know! ID: 1794237 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1794302 - Posted: 8 Jun 2016, 0:37:51 UTC - in response to Message 1794208. . . Is it OK if I edit that and add it to the text file with the distribution package? What distribution package? What you want (and where?) to distribute? . . Sorry I wasn't clear. What I meant was can I add the info from those postings to the *.txt files that came with the BOINC/Lunatics distribution package? ID: 1794302 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1794340 - Posted: 8 Jun 2016, 4:35:08 UTC - in response to Message 1794302. . . Is it OK if I edit that and add it to the text file with the distribution package? What distribution package? What you want (and where?) to distribute? . . Sorry I wasn't clear. What I meant was can I add the info from those postings to the *.txt files that came with the BOINC/Lunatics distribution package? yes ID: 1794340 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1794682 - Posted: 9 Jun 2016, 11:32:49 UTC And now we have a new blc7 with (on a sample of one) <chirps> <chirp_parameter_t> <chirp_limit>30</chirp_limit> <fft_len_flags>262136</fft_len_flags> </chirp_parameter_t> <chirp_parameter_t> <chirp_limit>100</chirp_limit> <fft_len_flags>65528</fft_len_flags> </chirp_parameter_t> </chirps> ID: 1794682 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1794866 - Posted: 10 Jun 2016, 0:04:49 UTC - in response to Message 1794340. . . Is it OK if I edit that and add it to the text file with the distribution package? What distribution package? What you want (and where?) to distribute? . . Sorry I wasn't clear. What I meant was can I add the info from those postings to the *.txt files that came with the BOINC/Lunatics distribution package? yes Thanks ID: 1794866 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 1794875 - Posted: 10 Jun 2016, 0:22:28 UTC - in response to Message 1794682. And now we have a new blc7 with (on a sample of one) <chirps> <chirp_parameter_t> <chirp_limit>30</chirp_limit> <fft_len_flags>262136</fft_len_flags> </chirp_parameter_t> <chirp_parameter_t> <chirp_limit>100</chirp_limit> <fft_len_flags>65528</fft_len_flags> </chirp_parameter_t> </chirps> . . . . Well I am still working on understanding that but so far blc7 and blc6 seem to process in the same or similar runtimes, it is only blc5 that adds about 50% to runtimes. I am not seeing blc2 or blc3 in my allotment of WUs ID: 1794875 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.