OpenCL NV MultiBeam v8 SoG edition for Windows

Author	Message
Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1770397 - Posted: 8 Mar 2016, 9:52:23 UTC Also, please report AR of task where you see slowdown. I expect some changes for low and mid ARs between r3366 and r3401 but no changes for high ARs. If you see slowdown with high AR value - make it clear cause it's unexpected. ID: 1770397 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1770402 - Posted: 8 Mar 2016, 10:08:00 UTC @ all who has "alpha-tester" status on Lunatics boards and has NV FERMI+ hardware, please read this: http://lunatics.kwsn.info/index.php/topic,1777.msg60748.html#msg60748 and make conclusions. ID: 1770402 ·

Chris Adamek Volunteer tester Send message Joined: 15 May 99 Posts: 251 Credit: 434,772,072 RAC: 236	Message 1770441 - Posted: 8 Mar 2016, 15:04:54 UTC - in response to Message 1770402. Speaking of lunatics, is there a way to register over there anymore? I used to have an account and it got lost somewhere in the transition and now it says it no longer accepts new users. Thanks, Chris ID: 1770441 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 1770467 - Posted: 8 Mar 2016, 23:00:26 UTC - in response to Message 1770441. Speaking of lunatics, is there a way to register over there anymore? I used to have an account and it got lost somewhere in the transition and now it says it no longer accepts new users. Thanks, Chris Dunno. I tried for almost a year to get verified, finally gave it up and found another team to join. Seems like no one is minding the store. Went to Arkayn's place instead ... ID: 1770467 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1770534 - Posted: 9 Mar 2016, 10:26:55 UTC - in response to Message 1770467. There are issues with site management. ID: 1770534 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1770535 - Posted: 9 Mar 2016, 10:27:52 UTC All OpenCL Windows MultiBeam builds were updated on Beta project, please test there to speedup release to main as stock app. ID: 1770535 ·

Joe Januzzi Volunteer tester Send message Joined: 13 Apr 03 Posts: 54 Credit: 307,134,110 RAC: 492	Message 1770580 - Posted: 9 Mar 2016, 19:20:25 UTC - in response to Message 1770535. FYI When I used V. 3401 my CPU was running mostly at 100%. Screen lags for the first time. I tried different values for â€œ increase -period_iteration_numâ€ in mb_cmdline*.txt. No number worked on stopping the screen lags. When I used V. 3366 the CPU only hit 100% at times, and when it did I had no screen lags. V. 3401 worked my system to hard. On either version if I could throttle the CPU just a little, it would be real nice. Would using Tthrottle work? My RAC on my GTX 560 Ti running V. 3366 is going up, even with 1 CPU Wu running. When my CPU Wu's are done (like watching water to boil). I'll test with GPU only. After that I like to run V. 3401, because I have only one card in this system. All OpenCL Windows MultiBeam builds were updated on Beta project, please test there to speedup release to main as stock app. Raistmer, By the time I saw this post, I was running OpenCL Windows MultiBeam on main. I know you said â€œallâ€ OpenCL on Beta. Do you mean starting at V3401 and up? Joe Real Join Date: Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC Try to learn something new everyday. ID: 1770580 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1770581 - Posted: 9 Mar 2016, 19:24:29 UTC - in response to Message 1770580. Would using Tthrottle work? Joe It should. Regarding beta testing - the more hosts will be attached to beta and run at default settings the sooner most bugs will be catched and app released to main for all. ID: 1770581 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1770610 - Posted: 9 Mar 2016, 21:41:01 UTC - in response to Message 1770582. Last modified: 9 Mar 2016, 21:41:44 UTC User settings have priority. This option just disables auto-tuning. Currently it disables tuning to very high iterations num for low-end cards (for MB) and fetch and unroll auto-tuning for AP. If user setting detected it will be used instead. ID: 1770610 ·

Chris Adamek Volunteer tester Send message Joined: 15 May 99 Posts: 251 Credit: 434,772,072 RAC: 236	Message 1770630 - Posted: 9 Mar 2016, 22:43:52 UTC - in response to Message 1770614. Last modified: 9 Mar 2016, 23:18:01 UTC The new build has quite a bit less utilization (old version kept the GPU at about 97-98% vs 3401 bouncing between 71-84%) so I'm trying to adjust the -sbs and work group size as you described above. Its in beta, also using -v 8 so hopefully you can see a bit more about what's going on. Update: never did fine a combination of those two values that smoothed out the GPU utilization. Bounces all over the place pretty much regardless what those settings are. When I was running it on main, 3366 has an APR of around 350GFlops, 3401 dropped that to 280-290. Numbers In beta are a bit less than that, but there's a much smaller run of wu's at the moment. Chris ID: 1770630 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1770632 - Posted: 9 Mar 2016, 22:56:33 UTC - in response to Message 1770614. -sbs N option could change cause PulseFind behavior changed. ID: 1770632 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1770636 - Posted: 9 Mar 2016, 23:13:15 UTC - in response to Message 1770630. The new build has quite a bit less utilization (old version kept the GPU at about 97-98% vs 3401 bouncing between 71-84%) so I'm trying to adjust the -sbs and work group size as you described above. Its in beta, also using -v 8 so hopefully you can see a bit more about what's going on. Chris -v 8 has sense in offline runs. For full-scale live run it just overflows return buffer. ID: 1770636 ·

Chris Adamek Volunteer tester Send message Joined: 15 May 99 Posts: 251 Credit: 434,772,072 RAC: 236	Message 1770638 - Posted: 9 Mar 2016, 23:19:23 UTC - in response to Message 1770636. The new build has quite a bit less utilization (old version kept the GPU at about 97-98% vs 3401 bouncing between 71-84%) so I'm trying to adjust the -sbs and work group size as you described above. Its in beta, also using -v 8 so hopefully you can see a bit more about what's going on. Chris -v 8 has sense in offline runs. For full-scale live run it just overflows return buffer. So I see.=) I'll try to get things downloaded to do some offline testing tonight after the little one is asleep. Thanks, Chris ID: 1770638 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1770816 - Posted: 10 Mar 2016, 20:47:28 UTC - in response to Message 1770757. -sbs N option could change cause PulseFind behavior changed. OK, will take that into consideration, when I change app, the coming weekend. Offline test show speedup for both AMD and NV apps with default settings. ID: 1770816 ·

Joe Januzzi Volunteer tester Send message Joined: 13 Apr 03 Posts: 54 Credit: 307,134,110 RAC: 492	Message 1771199 - Posted: 12 Mar 2016, 16:37:16 UTC - in response to Message 1771000. FYI Running Version#3401 (SoG). I hope I can fine-tune each video card separately. Added â€œMultiBeam_NV_config.xmlâ€ file in the mix. Won't be able to make any adjustment until Sunday (fishing). Raistmer, Do you think that some day this <instances_per_device>N</instances_per_device>â€ could be added to the MultiBeam_NV_config.xml file? Still didn't use Tthrottle. CPU usage is higher than V. 1366 on my system. I still hit 100% at times, but with a lot less screen lags. I can live with that. Here's some Wu's. Device 0 http://setiathome.berkeley.edu/workunit.php?wuid=2089786243 Device 1 http://setiathome.berkeley.edu/workunit.php?wuid=2090191161 Device 2 http://setiathome.berkeley.edu/workunit.php?wuid=2089927129 Device 3 http://setiathome.berkeley.edu/workunit.php?wuid=2090899771 Joe Here's my â€œMultiBeam_NV_config.xmlâ€ file. ;;; GTX 980 <device0> <period_iterations_num>40</period_iterations_num> <spike_fft_thresh>4096</spike_fft_thresh> <sbs>192</sbs> <oclfft_plan> <size>256</size> <global_radix>256</global_radix> <local_radix>16</local_radix> <workgroup_size>256</workgroup_size> <max_local_size>512</max_local_size> <localmem_banks>64</localmem_banks> <localmem_coalesce_width>64</localmem_coalesce_width> </oclfft_plan> </device0> ;;; GTX 780 <device1> <spike_fft_thresh>4096</spike_fft_thresh> <sbs>192</sbs> <oclfft_plan> <size>256</size> <global_radix>256</global_radix> <local_radix>16</local_radix> <workgroup_size>256</workgroup_size> <max_local_size>512</max_local_size> <localmem_banks>64</localmem_banks> <localmem_coalesce_width>64</localmem_coalesce_width> </oclfft_plan> </device1> ;;; GTX 980 <device2> <spike_fft_thresh>4096</spike_fft_thresh> <sbs>192</sbs> <oclfft_plan> <size>256</size> <global_radix>256</global_radix> <local_radix>16</local_radix> <workgroup_size>256</workgroup_size> <max_local_size>512</max_local_size> <localmem_banks>64</localmem_banks> <localmem_coalesce_width>64</localmem_coalesce_width> </oclfft_plan> </device2> ;;; GTX 960 <device3> <spike_fft_thresh>4096</spike_fft_thresh> <sbs>192</sbs> <oclfft_plan> <size>256</size> <global_radix>256</global_radix> <local_radix>16</local_radix> <workgroup_size>256</workgroup_size> <max_local_size>512</max_local_size> <localmem_banks>64</localmem_banks> <localmem_coalesce_width>64</localmem_coalesce_width> </oclfft_plan> </device3> Here's my â€œmb_cmdline_win_x86_SSE3_OpenCL_NV.txtâ€ file. -instances_per_device 3 -tune 1 64 1 4 Here's my â€œapp_info.xmlâ€ file. I'm only showing the SoG portion that change. <app> <name>setiathome_v8</name> </app> <file_info> <name>MB8_win_x86_SSE3_OpenCL_NV_r3401_SoG.exe</name> <executable/> </file_info> <file_info> <name>libfftw3f-3-3-4_x86.dll</name> <executable/> </file_info> <file_ref> <file_name>MultiBeam_Kernels_r3401.cl</file_name> </file_ref> <file_info> <name>mb_cmdline_win_x86_SSE3_OpenCL_NV.txt</name> </file_info> <file_info> <name>MultiBeam_NV_config.xml</name> </file_info> <app_version> <app_name>setiathome_v8</app_name> <version_num>800</version_num> <platform>windows_intelx86</platform> <avg_ncpus>0.04</avg_ncpus> <max_ncpus>0.2</max_ncpus> <plan_class>opencl_nvidia_SoG</plan_class> <cmdline></cmdline> <coproc> <type>CUDA</type> <count>1</count> </coproc> <file_ref> <file_name>MB8_win_x86_SSE3_OpenCL_NV_r3401_SoG.exe</file_name> <main_program/> </file_ref> <file_ref> <file_name>libfftw3f-3-3-4_x86.dll</file_name> </file_ref> <file_ref> <file_name>mb_cmdline_win_x86_SSE3_OpenCL_NV.txt</file_name> <open_name>mb_cmdline.txt</open_name> </file_ref> <file_ref> <file_name>MultiBeam_NV_config.xml</file_name> </file_ref> </app_version> Real Join Date: Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC Try to learn something new everyday. ID: 1771199 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1771227 - Posted: 12 Mar 2016, 18:49:36 UTC - in response to Message 1771199. Raistmer, Do you think that some day this <instances_per_device>N</instances_per_device>â€ could be added to the MultiBeam_NV_config.xml file? Hardly. and definitely no sense to do that until BOINC will be able to run different number of tasks for different GPU of the same vendor. Can it? ID: 1771227 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1773210 - Posted: 22 Mar 2016, 7:51:13 UTC - in response to Message 1773039. Indeed, performance variation from AR change bigger than possible change in performance from switching 3/4 tasks per GPU. So really good statistics or some offline tests in controlled environment are required for that. ID: 1773210 ·

Joe Januzzi Volunteer tester Send message Joined: 13 Apr 03 Posts: 54 Credit: 307,134,110 RAC: 492	Message 1773305 - Posted: 22 Mar 2016, 21:55:46 UTC - in response to Message 1772694. Dropping -period_iterations_num to 20, from default 50, increased the speed considerably. Dropping it to 20, helped my speed too! Thanks Tutankhamon for the info. I had the -v 8 switch running without knowing it for about a week. The -v 8 switch was at the tale end of the commands, which I didn't see, because my screen was to small :-( So now I'm back tracking a little bit. Hopefully with better data this time. Here's my â€œmb_cmdline_win_x86_SSE3_OpenCL_NV.txtâ€ file. -sbs 192 -instances_per_device 3 -period_iterations_num 20 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 16 -oclfft_tune_cw 16 Real Join Date: Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC Try to learn something new everyday. ID: 1773305 ·

Rasputin42 Volunteer tester Send message Joined: 25 Jul 08 Posts: 412 Credit: 5,834,661 RAC: 0	Message 1773310 - Posted: 22 Mar 2016, 22:24:28 UTC The r3401 version is not working well for me. I guess, it is no good for cards with few Compute units (2 in my case) One of the test wus(from lunatics) does not even run at all(no error, but no cpu or gpu usage) I tried all sorts of tweaking, different drivers,but performance is bad. The r3366 works fine. ID: 1773310 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1773375 - Posted: 23 Mar 2016, 5:53:24 UTC - in response to Message 1773310. Last modified: 23 Mar 2016, 6:01:25 UTC I tried all sorts of tweaking What exactly you tried? r3401 currently is RC build so any usability degradation not solved on beta will remain after release. Did you check system log for driver restart events? EDIT: on beta I see such completed result: Defaults scaling is disabled, basic defaults will be used. Tuning on user's discretion. Number of period iterations for PulseFind set to:5 Such tuning definitely not correct for low-performance card with small number of CUs. You purposedly worse app usability with such tuning. For low-performance GPU default value is 500, for mid-range and high-level GPUs default is 50. So value of 5 can be complete no go for your device. Did you experience issues with default settings before such tune attempts? ID: 1773375 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.