OpenCL NV MultiBeam v8 SoG edition for Windows

Author	Message
Rasputin42 Volunteer tester Send message Joined: 25 Jul 08 Posts: 412 Credit: 5,834,661 RAC: 0	Message 1773418 - Posted: 23 Mar 2016, 10:52:55 UTC - in response to Message 1773375. Did you experience issues with default settings before such tune attempts? Yes, i did. The gpu utilization was very spiky and low. I discovered, it was classified as "low performance", which it is not. This set the period iterations num to 500, which made unbelievably slow. I got some performance out of it by setting P.I.N. nearly to zero. I also tried the most recent driver-- no change. No driver restarts. ID: 1773418 ·

Marco Franceschini Volunteer tester Send message Joined: 4 Jul 01 Posts: 54 Credit: 69,877,354 RAC: 135	Message 1773450 - Posted: 23 Mar 2016, 13:04:11 UTC - in response to Message 1773418. Last modified: 23 Mar 2016, 13:10:09 UTC Did you experience issues with default settings before such tune attempts? Yes, i did. The gpu utilization was very spiky and low. I discovered, it was classified as "low performance", which it is not. This set the period iterations num to 500, which made unbelievably slow. I got some performance out of it by setting P.I.N. nearly to zero. I also tried the most recent driver-- no change. No driver restarts. My GTX 660M/GT820M/GT 640/GT 730 all is classified as "low performance" too. ID: 1773450 ·

Rasputin42 Volunteer tester Send message Joined: 25 Jul 08 Posts: 412 Credit: 5,834,661 RAC: 0	Message 1773456 - Posted: 23 Mar 2016, 13:31:49 UTC My GTX 660M/GT820M/GT 640/GT 730 all is classified as "low performance" too. So what did you do? ID: 1773456 ·

Marco Franceschini Volunteer tester Send message Joined: 4 Jul 01 Posts: 54 Credit: 69,877,354 RAC: 135	Message 1773478 - Posted: 23 Mar 2016, 15:22:07 UTC So far i'm in the process of recompiling fftw 3.3 with Intel Compiler C++ and researching about this "issue" (my gpu all are under Tflops in single precision). ID: 1773478 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34255 Credit: 79,922,639 RAC: 80	Message 1773480 - Posted: 23 Mar 2016, 15:26:50 UTC - in response to Message 1773450. Did you experience issues with default settings before such tune attempts? Yes, i did. The gpu utilization was very spiky and low. I discovered, it was classified as "low performance", which it is not. This set the period iterations num to 500, which made unbelievably slow. I got some performance out of it by setting P.I.N. nearly to zero. I also tried the most recent driver-- no change. No driver restarts. My GTX 660M/GT820M/GT 640/GT 730 all is classified as "low performance" too. Because they are. With each crime and every kindness we birth our future. ID: 1773480 ·

Joe Januzzi Volunteer tester Send message Joined: 13 Apr 03 Posts: 54 Credit: 307,134,110 RAC: 492	Message 1773541 - Posted: 23 Mar 2016, 20:50:10 UTC - in response to Message 1773480. FYI and a little help: I been using the â€œmb_cmdline_win_x86_SSE3_OpenCL_NV.txtâ€ file only (V. 3366). -sbs 192 -instances_per_device 3 -period_iterations_num 20 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 16 -oclfft_tune_cw 16 WU's times are fast (for me). When I added the â€œMultiBeam_NV_config.xmlâ€ file to the mix, my WU's times decrease by 60+ secounds. My goal was to use the same parameters for both setup's, so there be minimum speed changes, if any (I think). I also had the same slow down on version 3401. So I have 2 questions: 1. Is my â€œMultiBeam_NV_config.xmlâ€ file or setup wrong? 2. Should there be any time difference in using the â€œMultiBeam_NV_config.xmlâ€ file? Thanks Joe WU's times before adding â€œMultiBeam_NV_config.xmlâ€ file shown on first line of data, second line with file added. Note: I could of shown more WU's for samples, but they were mostly in the same time different ratio for both versions. _______GTX 980(do)____________________GTX 980(d2)_____________________GTX 780 (d3) __AR____Elapsed_CPU_____WU#_______AR___Elapsed_CPU_____WU________#AR___Elapsed__CPU____WU# 0.4224____881___424__101999550___0.4225___878___463__2101726988___0.4222___1067___605__2102034291 0.4224____967___541__2102376628__0.4225___957___447__2102379836___0.4221___1129___727__2102368697 http://setiathome.berkeley.edu/workunit.php?wuid=2101999550 http://setiathome.berkeley.edu/workunit.php?wuid=2101726988 http://setiathome.berkeley.edu/workunit.php?wuid=2102034291 http://setiathome.berkeley.edu/workunit.php?wuid=2102376628 http://setiathome.berkeley.edu/workunit.php?wuid=2102379836 http://setiathome.berkeley.edu/workunit.php?wuid=2102368697 Note: I ran the file with and without using ";;; GTX xxx" in the file with no speed difference, so I left it in (shown in red). MultiBeam_NV_config.xml ;;; GTX 980 <device0> <period_iterations_num>20</period_iterations_num> <spike_fft_thresh>4096</spike_fft_thresh> <sbs>192</sbs> <oclfft_plan> <size>256</size> <global_radix>256</global_radix> <local_radix>16</local_radix> <workgroup_size>256</workgroup_size> <max_local_size>512</max_local_size> <localmem_banks>16</localmem_banks> <localmem_coalesce_width>16</localmem_coalesce_width> </oclfft_plan> </device0> ;;; GTX 780 <device1> <period_iterations_num>20</period_iterations_num> <spike_fft_thresh>4096</spike_fft_thresh> <sbs>192</sbs> <oclfft_plan> <size>256</size> <global_radix>256</global_radix> <local_radix>16</local_radix> <workgroup_size>256</workgroup_size> <max_local_size>512</max_local_size> <localmem_banks>16</localmem_banks> <localmem_coalesce_width>16</localmem_coalesce_width> </oclfft_plan> </device1> ;;; GTX 980 <device2> <period_iterations_num>20</period_iterations_num> <spike_fft_thresh>4096</spike_fft_thresh> <sbs>192</sbs> <oclfft_plan> <size>256</size> <global_radix>256</global_radix> <local_radix>16</local_radix> <workgroup_size>256</workgroup_size> <max_local_size>512</max_local_size> <localmem_banks>16</localmem_banks> <localmem_coalesce_width>16</localmem_coalesce_width> </oclfft_plan> </device2> ;;; GTX 960 <device3> <period_iterations_num>20</period_iterations_num> <spike_fft_thresh>4096</spike_fft_thresh> <sbs>192</sbs> <oclfft_plan> <size>256</size> <global_radix>256</global_radix> <local_radix>16</local_radix> <workgroup_size>256</workgroup_size> <max_local_size>512</max_local_size> <localmem_banks>16</localmem_banks> <localmem_coalesce_width>16</localmem_coalesce_width> </oclfft_plan> </device3> app_info.xml changes in red (only OpenCL_r3366_SoG shown) <app> <name>setiathome_v8</name> </app> <file_info> <name>MB8_win_x86_SSE3_OpenCL_NV_r3366_SoG.exe</name> <executable/> </file_info> <file_info> <name>libfftw3f-3-3-4_x86.dll</name> <executable/> </file_info> <file_ref> <file_name>MultiBeam_Kernels_r3366.cl</file_name> </file_ref> <file_info> <name>mb_cmdline_win_x86_SSE3_OpenCL_NV.txt</name> </file_info> <file_info> <name>MultiBeam_NV_config.xml</name> </file_info> <app_version> <app_name>setiathome_v8</app_name> <version_num>800</version_num> <platform>windows_intelx86</platform> <avg_ncpus>0.04</avg_ncpus> <max_ncpus>0.2</max_ncpus> <plan_class>opencl_nvidia_SoG</plan_class> <cmdline></cmdline> <coproc> <type>CUDA</type> <count>1</count> </coproc> <file_ref> <file_name>MB8_win_x86_SSE3_OpenCL_NV_r3366_SoG.exe</file_name> <main_program/> </file_ref> <file_ref> <file_name>libfftw3f-3-3-4_x86.dll</file_name> </file_ref> <file_ref> <file_name>mb_cmdline_win_x86_SSE3_OpenCL_NV.txt</file_name> <open_name>mb_cmdline.txt</open_name> </file_ref> <file_ref> <file_name>MultiBeam_NV_config.xml</file_name> </file_ref> </app_version> mb_cmdline_win_x86_SSE3_OpenCL_NV.txt -instances_per_device 3 -tune 1 64 1 4 [img][/img][img][/img] Real Join Date: Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC Try to learn something new everyday. ID: 1773541 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1773567 - Posted: 23 Mar 2016, 23:01:31 UTC - in response to Message 1773418. Did you experience issues with default settings before such tune attempts? Yes, i did. The gpu utilization was very spiky and low. I discovered, it was classified as "low performance", which it is not. This set the period iterations num to 500, which made unbelievably slow. I got some performance out of it by setting P.I.N. nearly to zero. I also tried the most recent driver-- no change. No driver restarts. don't you see some contradiction between your posts? "spiky and low GPU usage" is not usability issues. Lags and driver restarts are. Defaults chosen to allow operation w/o driver restarts in unattended mode for most cards. If user wants to optimize - no probs, he can do that (and then appearing lags on his own choice). So, if r3401 slower than prev build with defaults - well, because it's stock RC and should work on most cards available. Prev rev did not on ow-performance GPUs and that required separate tuning for them. And regarding peak app performance - do you see slower r3401 operation with same tuning line as for prev build? How you compare builds? ID: 1773567 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1773569 - Posted: 23 Mar 2016, 23:16:30 UTC - in response to Message 1773541. Last modified: 23 Mar 2016, 23:17:47 UTC When I added the â€œMultiBeam_NV_config.xmlâ€ file to the mix, my WU's times decrease by 60+ secounds. My goal was to use the same parameters for both setup's, so there be minimum speed changes, if any (I think). I also had the same slow down on version 3401. So I have 2 questions: 1. Is my â€œMultiBeam_NV_config.xmlâ€ file or setup wrong? 2. Should there be any time difference in using the â€œMultiBeam_NV_config.xmlâ€ file? "config" file provided for multi-GPU hosts to supply separate tuning for different devices. If provided tuning the same for all of them then "config" file usage is excessive and cmdline (or any another available method to supply command line params) is enough. After that, no matter wich way you supply param to app - app will react on particular param value, not on the way how param was passed to it (config has precedence though to override common param values). Regarding performance changes between builds: 1) defaults are changed. 2) work splitting for PulseFind is changed. If you see slowdown because of "low performance GPU detected" and know that GPU can perform OK with lower -period_iterations_num N value that set it lower and app will obey. In other case try to change -sbs N setting and values of newly added settings: -pref_wg_size N (was 128 before, changed to 64 with r3401 for ATi, leaved at 32 for NV) -pref_wg_num_per_cu N (default is 4; different algorithm used before so no direct comparison with older default). ID: 1773569 ·

Joe Januzzi Volunteer tester Send message Joined: 13 Apr 03 Posts: 54 Credit: 307,134,110 RAC: 492	Message 1773595 - Posted: 24 Mar 2016, 1:55:13 UTC - in response to Message 1773569. Raistmer, Thanks again for all your help. Joe Regarding performance changes between builds: 1) defaults are changed. 2) work splitting for PulseFind is changed. I'm guilty of this. Not draining cache before build changes. That could explain a lot. Will drain before going back to r3401. In other case try to change -sbs N setting and values of newly added settings: -pref_wg_size N (was 128 before, changed to 64 with r3401 for ATi, leaved at 32 for NV) -pref_wg_num_per_cu N (default is 4; different algorithm used before so no direct comparison with older default). Will make some changes, when I'm back on r3401. The config file will be the last one to be added, once I get r3401 tuned for best performance. The config file will be tested to see if I can fine tune my GTX 960 separably from my other video cards. No speed change = no config file. Real Join Date: Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC Try to learn something new everyday. ID: 1773595 ·

Joe Januzzi Volunteer tester Send message Joined: 13 Apr 03 Posts: 54 Credit: 307,134,110 RAC: 492	Message 1774262 - Posted: 26 Mar 2016, 21:57:17 UTC - in response to Message 1773595. -pref_wg_num_per_cu N (default is 4; different algorithm used before so no direct comparison with older default). I was wondering if there is a set range (like 4 to ??)? So far I tried 4, 6, 10 and 20. All the numbers so far, seem to make any difference in changes to speed. It's also harder to track any changes for â€œ-pref_wg_num_per_cu Nâ€, because it doesn't say anything in the stderr.txt file. I think after â€œ20â€, I'll go back to the default. So far this is my best set-up for version 3401_SoG. Still trying different params. Joe -sbs 192 -instances_per_device 3 -period_iterations_num 20 -pref_wg_size 32 -pref_wg_num_per_cu 4 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 16 -oclfft_tune_cw 16 http://setiathome.berkeley.edu/workunit.php?wuid=2105819825 Stderr output <core_client_version>7.4.42</core_client_version> <![CDATA[ <stderr_txt> Running on device number: 0 Maximum single buffer size set to:192MB Number of app instances per device set to:3 Number of period iterations for PulseFind set to:20 Preferred workgroup size set to 32. <--- Is this right? SpikeFind FFT size threshold override set to:4096 TUNE: kernel 1 now has workgroup size of (64,1,4) oclFFT global radix override set to:256 oclFFT local radix override set to:16 oclFFT max WG size override set to:256 oclFFT max local FFT size override set to:512 oclFFT number of local memory banks set to:16 oclFFT minimal memory coalesce width set to:16 Priority of worker thread raised successfully Priority of process adjusted successfully, below normal priority class used OpenCL platform detected: Intel(R) Corporation OpenCL platform detected: NVIDIA Corporation BOINC assigns device 0 Info: BOINC provided OpenCL device ID used Build features: SETI8 Non-graphics OpenCL USE_OPENCL_NV OCL_ZERO_COPY SIGNALS_ON_GPU OCL_CHIRP3 FFTW USE_SSE3 x86 CPUID: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz Cache: L1=64K L2=256K CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 SSSE3 SSE4.1 SSE4.2 AVX OpenCL-kernels filename : MultiBeam_Kernels_r3401.cl ar=0.427216 NumCfft=195899 NumGauss=1105829004 NumPulse=226440719403 NumTriplet=452848448667 Currently allocated 293 MB for GPU buffers In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768 Real Join Date: Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC Try to learn something new everyday. ID: 1774262 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34255 Credit: 79,922,639 RAC: 80	Message 1774267 - Posted: 26 Mar 2016, 22:16:29 UTC Changes in -pref_wg_num_per_cu N makes it just slower. At least on my GPU. It might be different on NV cards not sure atm. As soon i have more time left i will test it on NV also. With each crime and every kindness we birth our future. ID: 1774267 ·

Joe Januzzi Volunteer tester Send message Joined: 13 Apr 03 Posts: 54 Credit: 307,134,110 RAC: 492	Message 1774268 - Posted: 26 Mar 2016, 22:22:17 UTC - in response to Message 1774267. Thanks Mike for the info. Joe Real Join Date: Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC Try to learn something new everyday. ID: 1774268 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1774269 - Posted: 26 Mar 2016, 22:23:25 UTC - in response to Message 1774262. -pref_wg_num_per_cu N introduced in 3410 so didn't come in current RC ID: 1774269 ·

Sleepy Volunteer tester Send message Joined: 21 May 99 Posts: 219 Credit: 98,947,784 RAC: 28,360	Message 1775367 - Posted: 31 Mar 2016, 17:43:58 UTC During crunching, sometimes I get a driver reset. BOINC switches to another pair of WUs (I am crunching 2 WUs at a time) which follow their fate, whichever it is. When crunching comes back to the WU(s) that caused the reset, roughly at the same completion percentage driver resets and the cycle starts over. I eventually have to manually abort the "offending" WU(s) before the system efficiency degrades, especially if any other similar resetting WU comes along, increasing the oscillations. I am experiencing it on both my systems, on a GTX660 and GTX 650Ti. I am running the 3401 version, which should be the last one. I have also tried to use the default settings of the application, but the effects stay the same. I think, but I have not checked yet, that it is WU related in some form. But I cannot be more precise at the moment. Anyone having the same problems or a solution? Cheers, Sleepy ID: 1775367 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1775371 - Posted: 31 Mar 2016, 18:07:32 UTC - in response to Message 1775367. Sleepy, Have you checked to see what the true angle is on those work units? Just curious. ID: 1775371 ·

Sleepy Volunteer tester Send message Joined: 21 May 99 Posts: 219 Credit: 98,947,784 RAC: 28,360	Message 1775374 - Posted: 31 Mar 2016, 18:14:03 UTC - in response to Message 1775371. This is what I wanted to do, but I am quite busy ATM and I cannot go so deep. But yes, this was my idea as well, to be confirmed. But if you go inside my PCs you find till they are there my aborted MB WUs. Sleepy ID: 1775374 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1778532 - Posted: 12 Apr 2016, 23:23:28 UTC - in response to Message 1775374. Raistmer, will you app work for these new GBT data? ID: 1778532 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34255 Credit: 79,922,639 RAC: 80	Message 1778584 - Posted: 13 Apr 2016, 3:42:18 UTC - in response to Message 1778532. Raistmer, will you app work for these new GBT data? I`m not Raistmer but yes it will. With each crime and every kindness we birth our future. ID: 1778584 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1778773 - Posted: 13 Apr 2016, 17:52:26 UTC So was rewriting a portion of the app_info and afterwards got the following message in start up log 4/13/2016 11:10:28 AM \| SETI@home \| [error] State file error: duplicate app version: setiathome_v8 windows_x86_64 800 4/13/2016 11:10:28 AM \| SETI@home \| [error] State file error: duplicate app version: setiathome_v8 windows_intelx86 800 When I went looking saw these 2 areas and am thinking they are causing the error report? Should I remove 1 of them? <app_version> <app_name>setiathome_v8</app_name> <version_num>800</version_num> <platform>windows_x86_64</platform> <api_version>7.5.0</api_version> <file_ref> <file_name>MB8_win_x64_AVX_VS2010_r3330.exe</file_name> <main_program/> </file_ref> <file_ref> <file_name>libfftw3f-3-3-4_x64.dll</file_name> </file_ref> <file_ref> <file_name>mb_cmdline_win_x64_AVX_VS2010.txt</file_name> <open_name>mb_cmdline.txt</open_name> </file_ref> </app_version> <app_version> <app_name>setiathome_v8</app_name> <version_num>800</version_num> <platform>windows_intelx86</platform> <api_version>7.5.0</api_version> <file_ref> <file_name>MB8_win_x64_AVX_VS2010_r3330.exe</file_name> <main_program/> </file_ref> <file_ref> <file_name>libfftw3f-3-3-4_x64.dll</file_name> </file_ref> <file_ref> <file_name>mb_cmdline_win_x64_AVX_VS2010.txt</file_name> <open_name>mb_cmdline.txt</open_name> </file_ref> </app_version> ID: 1778773 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 1778789 - Posted: 13 Apr 2016, 19:29:52 UTC - in response to Message 1778773. So was rewriting a portion of the app_info and afterwards got the following message in start up log 4/13/2016 11:10:28 AM \| SETI@home \| [error] State file error: duplicate app version: setiathome_v8 windows_x86_64 800 4/13/2016 11:10:28 AM \| SETI@home \| [error] State file error: duplicate app version: setiathome_v8 windows_intelx86 800 When I went looking saw these 2 areas and am thinking they are causing the error report? Should I remove 1 of them? Yeah, I'd think so. Error message seems pretty clear ... ID: 1778789 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.