Message boards :
Number crunching :
OpenCL NV MultiBeam v8 SoG edition for Windows
Message board moderation
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 21 · Next
Author | Message |
---|---|
![]() ![]() Send message Joined: 16 Jun 01 Posts: 6324 Credit: 106,370,077 RAC: 121 ![]() ![]() |
Hardly. and definitely no sense to do that until BOINC will be able to run different number of tasks for different GPU of the same vendor. Can it? |
Grumpy Swede (I stand with Ukraine) ![]() Send message Joined: 1 Nov 08 Posts: 8923 Credit: 49,849,242 RAC: 65 ![]() ![]() |
Well, I really do not have the time any longer to really analyze this latest SoG version (r3401). However, after having run around 400 tasks the last 24 hours or so, I think I can say that this version is indeed faster than the one I run before (r3366), for around 13000 WU's. Same settings as with r3366. Maybe some day (probably not until I retire in 5 years or so), will I have the time to do some serious testing. Edit: I will continue to let it run. |
Grumpy Swede (I stand with Ukraine) ![]() Send message Joined: 1 Nov 08 Posts: 8923 Credit: 49,849,242 RAC: 65 ![]() ![]() |
Best settings so far, for my GTX980: -cpu_lock -sbs 256 -period_iterations_num 20 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -instances_per_device 4 Dropping -period_iterations_num to 20, from default 50, increased the speed considerably. ![]() |
Grumpy Swede (I stand with Ukraine) ![]() Send message Joined: 1 Nov 08 Posts: 8923 Credit: 49,849,242 RAC: 65 ![]() ![]() |
I will now try 3 SoGs at a time, instead of 4, and add one more CPU core/thread. Making it 3 SoGs at a time, and 3 CPU WUs. Same settings as before, except for "-instances_per_device 3" ![]() |
Grumpy Swede (I stand with Ukraine) ![]() Send message Joined: 1 Nov 08 Posts: 8923 Credit: 49,849,242 RAC: 65 ![]() ![]() |
Geeze, very hard to say if 3 SoG's and 3 CPU's is faster or slower than 4 SoG's and 2 CPU's. I need to let this run for a week or so, before I can say for sure. Yeah well, who's in a hurry? |
![]() ![]() Send message Joined: 16 Jun 01 Posts: 6324 Credit: 106,370,077 RAC: 121 ![]() ![]() |
Indeed, performance variation from AR change bigger than possible change in performance from switching 3/4 tasks per GPU. So really good statistics or some offline tests in controlled environment are required for that. |
Joe Januzzi ![]() Send message Joined: 13 Apr 03 Posts: 54 Credit: 307,134,110 RAC: 492 ![]() ![]() |
Dropping -period_iterations_num to 20, from default 50, increased the speed considerably. Dropping it to 20, helped my speed too! Thanks Tutankhamon for the info. I had the -v 8 switch running without knowing it for about a week. The -v 8 switch was at the tale end of the commands, which I didn't see, because my screen was to small :-( So now I'm back tracking a little bit. Hopefully with better data this time. Here's my “mb_cmdline_win_x86_SSE3_OpenCL_NV.txt†file. -sbs 192 -instances_per_device 3 -period_iterations_num 20 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 16 -oclfft_tune_cw 16 ![]() Real Join Date: Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC Try to learn something new everyday. |
Rasputin42 Send message Joined: 25 Jul 08 Posts: 412 Credit: 5,834,661 RAC: 0 ![]() |
The r3401 version is not working well for me. I guess, it is no good for cards with few Compute units (2 in my case) One of the test wus(from lunatics) does not even run at all(no error, but no cpu or gpu usage) I tried all sorts of tweaking, different drivers,but performance is bad. The r3366 works fine. |
![]() ![]() Send message Joined: 16 Jun 01 Posts: 6324 Credit: 106,370,077 RAC: 121 ![]() ![]() |
What exactly you tried? r3401 currently is RC build so any usability degradation not solved on beta will remain after release. Did you check system log for driver restart events? EDIT: on beta I see such completed result: Defaults scaling is disabled, basic defaults will be used. Tuning on user's discretion. Number of period iterations for PulseFind set to:5 Such tuning definitely not correct for low-performance card with small number of CUs. You purposedly worse app usability with such tuning. For low-performance GPU default value is 500, for mid-range and high-level GPUs default is 50. So value of 5 can be complete no go for your device. Did you experience issues with default settings before such tune attempts? |
Rasputin42 Send message Joined: 25 Jul 08 Posts: 412 Credit: 5,834,661 RAC: 0 ![]() |
Did you experience issues with default settings before such tune attempts? Yes, i did. The gpu utilization was very spiky and low. I discovered, it was classified as "low performance", which it is not. This set the period iterations num to 500, which made unbelievably slow. I got some performance out of it by setting P.I.N. nearly to zero. I also tried the most recent driver-- no change. No driver restarts. |
Marco Franceschini ![]() Send message Joined: 4 Jul 01 Posts: 54 Credit: 69,877,354 RAC: 135 ![]() ![]() |
Did you experience issues with default settings before such tune attempts? My GTX 660M/GT820M/GT 640/GT 730 all is classified as "low performance" too. |
Rasputin42 Send message Joined: 25 Jul 08 Posts: 412 Credit: 5,834,661 RAC: 0 ![]() |
My GTX 660M/GT820M/GT 640/GT 730 all is classified as "low performance" too. So what did you do? |
Marco Franceschini ![]() Send message Joined: 4 Jul 01 Posts: 54 Credit: 69,877,354 RAC: 135 ![]() ![]() |
So far i'm in the process of recompiling fftw 3.3 with Intel Compiler C++ and researching about this "issue" (my gpu all are under Tflops in single precision). |
![]() ![]() ![]() Send message Joined: 17 Feb 01 Posts: 33258 Credit: 79,922,639 RAC: 80 ![]() ![]() |
Did you experience issues with default settings before such tune attempts? Because they are. With each crime and every kindness we birth our future. |
Joe Januzzi ![]() Send message Joined: 13 Apr 03 Posts: 54 Credit: 307,134,110 RAC: 492 ![]() ![]() |
FYI and a little help: I been using the “mb_cmdline_win_x86_SSE3_OpenCL_NV.txt†file only (V. 3366). -sbs 192 -instances_per_device 3 -period_iterations_num 20 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 16 -oclfft_tune_cw 16 WU's times are fast (for me). When I added the “MultiBeam_NV_config.xml†file to the mix, my WU's times decrease by 60+ secounds. My goal was to use the same parameters for both setup's, so there be minimum speed changes, if any (I think). I also had the same slow down on version 3401. So I have 2 questions: 1. Is my “MultiBeam_NV_config.xml†file or setup wrong? 2. Should there be any time difference in using the “MultiBeam_NV_config.xml†file? Thanks Joe WU's times before adding “MultiBeam_NV_config.xml†file shown on first line of data, second line with file added. Note: I could of shown more WU's for samples, but they were mostly in the same time different ratio for both versions. _______GTX 980(do)____________________GTX 980(d2)_____________________GTX 780 (d3) __AR____Elapsed_CPU_____WU#_______AR___Elapsed_CPU_____WU________#AR___Elapsed__CPU____WU# 0.4224____881___424__101999550___0.4225___878___463__2101726988___0.4222___1067___605__2102034291 0.4224____967___541__2102376628__0.4225___957___447__2102379836___0.4221___1129___727__2102368697 http://setiathome.berkeley.edu/workunit.php?wuid=2101999550 http://setiathome.berkeley.edu/workunit.php?wuid=2101726988 http://setiathome.berkeley.edu/workunit.php?wuid=2102034291 http://setiathome.berkeley.edu/workunit.php?wuid=2102376628 http://setiathome.berkeley.edu/workunit.php?wuid=2102379836 http://setiathome.berkeley.edu/workunit.php?wuid=2102368697 Note: I ran the file with and without using ";;; GTX xxx" in the file with no speed difference, so I left it in (shown in red). MultiBeam_NV_config.xml ;;; GTX 980 <device0> <period_iterations_num>20</period_iterations_num> <spike_fft_thresh>4096</spike_fft_thresh> <sbs>192</sbs> <oclfft_plan> <size>256</size> <global_radix>256</global_radix> <local_radix>16</local_radix> <workgroup_size>256</workgroup_size> <max_local_size>512</max_local_size> <localmem_banks>16</localmem_banks> <localmem_coalesce_width>16</localmem_coalesce_width> </oclfft_plan> </device0> ;;; GTX 780 <device1> <period_iterations_num>20</period_iterations_num> <spike_fft_thresh>4096</spike_fft_thresh> <sbs>192</sbs> <oclfft_plan> <size>256</size> <global_radix>256</global_radix> <local_radix>16</local_radix> <workgroup_size>256</workgroup_size> <max_local_size>512</max_local_size> <localmem_banks>16</localmem_banks> <localmem_coalesce_width>16</localmem_coalesce_width> </oclfft_plan> </device1> ;;; GTX 980 <device2> <period_iterations_num>20</period_iterations_num> <spike_fft_thresh>4096</spike_fft_thresh> <sbs>192</sbs> <oclfft_plan> <size>256</size> <global_radix>256</global_radix> <local_radix>16</local_radix> <workgroup_size>256</workgroup_size> <max_local_size>512</max_local_size> <localmem_banks>16</localmem_banks> <localmem_coalesce_width>16</localmem_coalesce_width> </oclfft_plan> </device2> ;;; GTX 960 <device3> <period_iterations_num>20</period_iterations_num> <spike_fft_thresh>4096</spike_fft_thresh> <sbs>192</sbs> <oclfft_plan> <size>256</size> <global_radix>256</global_radix> <local_radix>16</local_radix> <workgroup_size>256</workgroup_size> <max_local_size>512</max_local_size> <localmem_banks>16</localmem_banks> <localmem_coalesce_width>16</localmem_coalesce_width> </oclfft_plan> </device3> app_info.xml changes in red (only OpenCL_r3366_SoG shown) <app> <name>setiathome_v8</name> </app> <file_info> <name>MB8_win_x86_SSE3_OpenCL_NV_r3366_SoG.exe</name> <executable/> </file_info> <file_info> <name>libfftw3f-3-3-4_x86.dll</name> <executable/> </file_info> <file_ref> <file_name>MultiBeam_Kernels_r3366.cl</file_name> </file_ref> <file_info> <name>mb_cmdline_win_x86_SSE3_OpenCL_NV.txt</name> </file_info> <file_info> <name>MultiBeam_NV_config.xml</name> </file_info> <app_version> <app_name>setiathome_v8</app_name> <version_num>800</version_num> <platform>windows_intelx86</platform> <avg_ncpus>0.04</avg_ncpus> <max_ncpus>0.2</max_ncpus> <plan_class>opencl_nvidia_SoG</plan_class> <cmdline></cmdline> <coproc> <type>CUDA</type> <count>1</count> </coproc> <file_ref> <file_name>MB8_win_x86_SSE3_OpenCL_NV_r3366_SoG.exe</file_name> <main_program/> </file_ref> <file_ref> <file_name>libfftw3f-3-3-4_x86.dll</file_name> </file_ref> <file_ref> <file_name>mb_cmdline_win_x86_SSE3_OpenCL_NV.txt</file_name> <open_name>mb_cmdline.txt</open_name> </file_ref> <file_ref> <file_name>MultiBeam_NV_config.xml</file_name> </file_ref> </app_version> mb_cmdline_win_x86_SSE3_OpenCL_NV.txt -instances_per_device 3 -tune 1 64 1 4 [img][/img][img][/img] ![]() Real Join Date: Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC Try to learn something new everyday. |
![]() ![]() Send message Joined: 16 Jun 01 Posts: 6324 Credit: 106,370,077 RAC: 121 ![]() ![]() |
Did you experience issues with default settings before such tune attempts? don't you see some contradiction between your posts? "spiky and low GPU usage" is not usability issues. Lags and driver restarts are. Defaults chosen to allow operation w/o driver restarts in unattended mode for most cards. If user wants to optimize - no probs, he can do that (and then appearing lags on his own choice). So, if r3401 slower than prev build with defaults - well, because it's stock RC and should work on most cards available. Prev rev did not on ow-performance GPUs and that required separate tuning for them. And regarding peak app performance - do you see slower r3401 operation with same tuning line as for prev build? How you compare builds? |
![]() ![]() Send message Joined: 16 Jun 01 Posts: 6324 Credit: 106,370,077 RAC: 121 ![]() ![]() |
"config" file provided for multi-GPU hosts to supply separate tuning for different devices. If provided tuning the same for all of them then "config" file usage is excessive and cmdline (or any another available method to supply command line params) is enough. After that, no matter wich way you supply param to app - app will react on particular param value, not on the way how param was passed to it (config has precedence though to override common param values). Regarding performance changes between builds: 1) defaults are changed. 2) work splitting for PulseFind is changed. If you see slowdown because of "low performance GPU detected" and know that GPU can perform OK with lower -period_iterations_num N value that set it lower and app will obey. In other case try to change -sbs N setting and values of newly added settings: -pref_wg_size N (was 128 before, changed to 64 with r3401 for ATi, leaved at 32 for NV) -pref_wg_num_per_cu N (default is 4; different algorithm used before so no direct comparison with older default). |
Joe Januzzi ![]() Send message Joined: 13 Apr 03 Posts: 54 Credit: 307,134,110 RAC: 492 ![]() ![]() |
Raistmer, Thanks again for all your help. Joe Regarding performance changes between builds: I'm guilty of this. Not draining cache before build changes. That could explain a lot. Will drain before going back to r3401. In other case try to change -sbs N setting and values of newly added settings: Will make some changes, when I'm back on r3401. The config file will be the last one to be added, once I get r3401 tuned for best performance. The config file will be tested to see if I can fine tune my GTX 960 separably from my other video cards. No speed change = no config file. ![]() Real Join Date: Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC Try to learn something new everyday. |
Joe Januzzi ![]() Send message Joined: 13 Apr 03 Posts: 54 Credit: 307,134,110 RAC: 492 ![]() ![]() |
-pref_wg_num_per_cu N (default is 4; different algorithm used before so no direct comparison with older default). I was wondering if there is a set range (like 4 to ??)? So far I tried 4, 6, 10 and 20. All the numbers so far, seem to make any difference in changes to speed. It's also harder to track any changes for “-pref_wg_num_per_cu Nâ€, because it doesn't say anything in the stderr.txt file. I think after “20â€, I'll go back to the default. So far this is my best set-up for version 3401_SoG. Still trying different params. Joe -sbs 192 -instances_per_device 3 -period_iterations_num 20 -pref_wg_size 32 -pref_wg_num_per_cu 4 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 16 -oclfft_tune_cw 16 http://setiathome.berkeley.edu/workunit.php?wuid=2105819825 Stderr output <core_client_version>7.4.42</core_client_version> <![CDATA[ <stderr_txt> Running on device number: 0 Maximum single buffer size set to:192MB Number of app instances per device set to:3 Number of period iterations for PulseFind set to:20 Preferred workgroup size set to 32. <--- Is this right? SpikeFind FFT size threshold override set to:4096 TUNE: kernel 1 now has workgroup size of (64,1,4) oclFFT global radix override set to:256 oclFFT local radix override set to:16 oclFFT max WG size override set to:256 oclFFT max local FFT size override set to:512 oclFFT number of local memory banks set to:16 oclFFT minimal memory coalesce width set to:16 Priority of worker thread raised successfully Priority of process adjusted successfully, below normal priority class used OpenCL platform detected: Intel(R) Corporation OpenCL platform detected: NVIDIA Corporation BOINC assigns device 0 Info: BOINC provided OpenCL device ID used Build features: SETI8 Non-graphics OpenCL USE_OPENCL_NV OCL_ZERO_COPY SIGNALS_ON_GPU OCL_CHIRP3 FFTW USE_SSE3 x86 CPUID: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz Cache: L1=64K L2=256K CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 SSSE3 SSE4.1 SSE4.2 AVX OpenCL-kernels filename : MultiBeam_Kernels_r3401.cl ar=0.427216 NumCfft=195899 NumGauss=1105829004 NumPulse=226440719403 NumTriplet=452848448667 Currently allocated 293 MB for GPU buffers In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768 ![]() Real Join Date: Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC Try to learn something new everyday. |
![]() ![]() ![]() Send message Joined: 17 Feb 01 Posts: 33258 Credit: 79,922,639 RAC: 80 ![]() ![]() |
Changes in -pref_wg_num_per_cu N makes it just slower. At least on my GPU. It might be different on NV cards not sure atm. As soon i have more time left i will test it on NV also. With each crime and every kindness we birth our future. |
©2022 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.