Message boards :
Number crunching :
OpenCL NV MultiBeam v8 SoG edition for Windows
Message board moderation
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 18 · Next
Author | Message |
---|---|
Rasputin42 Send message Joined: 25 Jul 08 Posts: 412 Credit: 5,834,661 RAC: 0 |
Did you experience issues with default settings before such tune attempts? Yes, i did. The gpu utilization was very spiky and low. I discovered, it was classified as "low performance", which it is not. This set the period iterations num to 500, which made unbelievably slow. I got some performance out of it by setting P.I.N. nearly to zero. I also tried the most recent driver-- no change. No driver restarts. |
Marco Franceschini Send message Joined: 4 Jul 01 Posts: 54 Credit: 69,877,354 RAC: 135 |
Did you experience issues with default settings before such tune attempts? My GTX 660M/GT820M/GT 640/GT 730 all is classified as "low performance" too. |
Rasputin42 Send message Joined: 25 Jul 08 Posts: 412 Credit: 5,834,661 RAC: 0 |
My GTX 660M/GT820M/GT 640/GT 730 all is classified as "low performance" too. So what did you do? |
Marco Franceschini Send message Joined: 4 Jul 01 Posts: 54 Credit: 69,877,354 RAC: 135 |
So far i'm in the process of recompiling fftw 3.3 with Intel Compiler C++ and researching about this "issue" (my gpu all are under Tflops in single precision). |
Mike Send message Joined: 17 Feb 01 Posts: 34346 Credit: 79,922,639 RAC: 80 |
Did you experience issues with default settings before such tune attempts? Because they are. With each crime and every kindness we birth our future. |
Joe Januzzi Send message Joined: 13 Apr 03 Posts: 54 Credit: 307,134,110 RAC: 492 |
FYI and a little help: I been using the “mb_cmdline_win_x86_SSE3_OpenCL_NV.txt†file only (V. 3366). -sbs 192 -instances_per_device 3 -period_iterations_num 20 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 16 -oclfft_tune_cw 16 WU's times are fast (for me). When I added the “MultiBeam_NV_config.xml†file to the mix, my WU's times decrease by 60+ secounds. My goal was to use the same parameters for both setup's, so there be minimum speed changes, if any (I think). I also had the same slow down on version 3401. So I have 2 questions: 1. Is my “MultiBeam_NV_config.xml†file or setup wrong? 2. Should there be any time difference in using the “MultiBeam_NV_config.xml†file? Thanks Joe WU's times before adding “MultiBeam_NV_config.xml†file shown on first line of data, second line with file added. Note: I could of shown more WU's for samples, but they were mostly in the same time different ratio for both versions. _______GTX 980(do)____________________GTX 980(d2)_____________________GTX 780 (d3) __AR____Elapsed_CPU_____WU#_______AR___Elapsed_CPU_____WU________#AR___Elapsed__CPU____WU# 0.4224____881___424__101999550___0.4225___878___463__2101726988___0.4222___1067___605__2102034291 0.4224____967___541__2102376628__0.4225___957___447__2102379836___0.4221___1129___727__2102368697 http://setiathome.berkeley.edu/workunit.php?wuid=2101999550 http://setiathome.berkeley.edu/workunit.php?wuid=2101726988 http://setiathome.berkeley.edu/workunit.php?wuid=2102034291 http://setiathome.berkeley.edu/workunit.php?wuid=2102376628 http://setiathome.berkeley.edu/workunit.php?wuid=2102379836 http://setiathome.berkeley.edu/workunit.php?wuid=2102368697 Note: I ran the file with and without using ";;; GTX xxx" in the file with no speed difference, so I left it in (shown in red). MultiBeam_NV_config.xml ;;; GTX 980 <device0> <period_iterations_num>20</period_iterations_num> <spike_fft_thresh>4096</spike_fft_thresh> <sbs>192</sbs> <oclfft_plan> <size>256</size> <global_radix>256</global_radix> <local_radix>16</local_radix> <workgroup_size>256</workgroup_size> <max_local_size>512</max_local_size> <localmem_banks>16</localmem_banks> <localmem_coalesce_width>16</localmem_coalesce_width> </oclfft_plan> </device0> ;;; GTX 780 <device1> <period_iterations_num>20</period_iterations_num> <spike_fft_thresh>4096</spike_fft_thresh> <sbs>192</sbs> <oclfft_plan> <size>256</size> <global_radix>256</global_radix> <local_radix>16</local_radix> <workgroup_size>256</workgroup_size> <max_local_size>512</max_local_size> <localmem_banks>16</localmem_banks> <localmem_coalesce_width>16</localmem_coalesce_width> </oclfft_plan> </device1> ;;; GTX 980 <device2> <period_iterations_num>20</period_iterations_num> <spike_fft_thresh>4096</spike_fft_thresh> <sbs>192</sbs> <oclfft_plan> <size>256</size> <global_radix>256</global_radix> <local_radix>16</local_radix> <workgroup_size>256</workgroup_size> <max_local_size>512</max_local_size> <localmem_banks>16</localmem_banks> <localmem_coalesce_width>16</localmem_coalesce_width> </oclfft_plan> </device2> ;;; GTX 960 <device3> <period_iterations_num>20</period_iterations_num> <spike_fft_thresh>4096</spike_fft_thresh> <sbs>192</sbs> <oclfft_plan> <size>256</size> <global_radix>256</global_radix> <local_radix>16</local_radix> <workgroup_size>256</workgroup_size> <max_local_size>512</max_local_size> <localmem_banks>16</localmem_banks> <localmem_coalesce_width>16</localmem_coalesce_width> </oclfft_plan> </device3> app_info.xml changes in red (only OpenCL_r3366_SoG shown) <app> <name>setiathome_v8</name> </app> <file_info> <name>MB8_win_x86_SSE3_OpenCL_NV_r3366_SoG.exe</name> <executable/> </file_info> <file_info> <name>libfftw3f-3-3-4_x86.dll</name> <executable/> </file_info> <file_ref> <file_name>MultiBeam_Kernels_r3366.cl</file_name> </file_ref> <file_info> <name>mb_cmdline_win_x86_SSE3_OpenCL_NV.txt</name> </file_info> <file_info> <name>MultiBeam_NV_config.xml</name> </file_info> <app_version> <app_name>setiathome_v8</app_name> <version_num>800</version_num> <platform>windows_intelx86</platform> <avg_ncpus>0.04</avg_ncpus> <max_ncpus>0.2</max_ncpus> <plan_class>opencl_nvidia_SoG</plan_class> <cmdline></cmdline> <coproc> <type>CUDA</type> <count>1</count> </coproc> <file_ref> <file_name>MB8_win_x86_SSE3_OpenCL_NV_r3366_SoG.exe</file_name> <main_program/> </file_ref> <file_ref> <file_name>libfftw3f-3-3-4_x86.dll</file_name> </file_ref> <file_ref> <file_name>mb_cmdline_win_x86_SSE3_OpenCL_NV.txt</file_name> <open_name>mb_cmdline.txt</open_name> </file_ref> <file_ref> <file_name>MultiBeam_NV_config.xml</file_name> </file_ref> </app_version> mb_cmdline_win_x86_SSE3_OpenCL_NV.txt -instances_per_device 3 -tune 1 64 1 4 [img][/img][img][/img] Real Join Date: Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC Try to learn something new everyday. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Did you experience issues with default settings before such tune attempts? don't you see some contradiction between your posts? "spiky and low GPU usage" is not usability issues. Lags and driver restarts are. Defaults chosen to allow operation w/o driver restarts in unattended mode for most cards. If user wants to optimize - no probs, he can do that (and then appearing lags on his own choice). So, if r3401 slower than prev build with defaults - well, because it's stock RC and should work on most cards available. Prev rev did not on ow-performance GPUs and that required separate tuning for them. And regarding peak app performance - do you see slower r3401 operation with same tuning line as for prev build? How you compare builds? |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
"config" file provided for multi-GPU hosts to supply separate tuning for different devices. If provided tuning the same for all of them then "config" file usage is excessive and cmdline (or any another available method to supply command line params) is enough. After that, no matter wich way you supply param to app - app will react on particular param value, not on the way how param was passed to it (config has precedence though to override common param values). Regarding performance changes between builds: 1) defaults are changed. 2) work splitting for PulseFind is changed. If you see slowdown because of "low performance GPU detected" and know that GPU can perform OK with lower -period_iterations_num N value that set it lower and app will obey. In other case try to change -sbs N setting and values of newly added settings: -pref_wg_size N (was 128 before, changed to 64 with r3401 for ATi, leaved at 32 for NV) -pref_wg_num_per_cu N (default is 4; different algorithm used before so no direct comparison with older default). |
Joe Januzzi Send message Joined: 13 Apr 03 Posts: 54 Credit: 307,134,110 RAC: 492 |
Raistmer, Thanks again for all your help. Joe Regarding performance changes between builds: I'm guilty of this. Not draining cache before build changes. That could explain a lot. Will drain before going back to r3401. In other case try to change -sbs N setting and values of newly added settings: Will make some changes, when I'm back on r3401. The config file will be the last one to be added, once I get r3401 tuned for best performance. The config file will be tested to see if I can fine tune my GTX 960 separably from my other video cards. No speed change = no config file. Real Join Date: Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC Try to learn something new everyday. |
Joe Januzzi Send message Joined: 13 Apr 03 Posts: 54 Credit: 307,134,110 RAC: 492 |
-pref_wg_num_per_cu N (default is 4; different algorithm used before so no direct comparison with older default). I was wondering if there is a set range (like 4 to ??)? So far I tried 4, 6, 10 and 20. All the numbers so far, seem to make any difference in changes to speed. It's also harder to track any changes for “-pref_wg_num_per_cu Nâ€, because it doesn't say anything in the stderr.txt file. I think after “20â€, I'll go back to the default. So far this is my best set-up for version 3401_SoG. Still trying different params. Joe -sbs 192 -instances_per_device 3 -period_iterations_num 20 -pref_wg_size 32 -pref_wg_num_per_cu 4 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 16 -oclfft_tune_cw 16 http://setiathome.berkeley.edu/workunit.php?wuid=2105819825 Stderr output <core_client_version>7.4.42</core_client_version> <![CDATA[ <stderr_txt> Running on device number: 0 Maximum single buffer size set to:192MB Number of app instances per device set to:3 Number of period iterations for PulseFind set to:20 Preferred workgroup size set to 32. <--- Is this right? SpikeFind FFT size threshold override set to:4096 TUNE: kernel 1 now has workgroup size of (64,1,4) oclFFT global radix override set to:256 oclFFT local radix override set to:16 oclFFT max WG size override set to:256 oclFFT max local FFT size override set to:512 oclFFT number of local memory banks set to:16 oclFFT minimal memory coalesce width set to:16 Priority of worker thread raised successfully Priority of process adjusted successfully, below normal priority class used OpenCL platform detected: Intel(R) Corporation OpenCL platform detected: NVIDIA Corporation BOINC assigns device 0 Info: BOINC provided OpenCL device ID used Build features: SETI8 Non-graphics OpenCL USE_OPENCL_NV OCL_ZERO_COPY SIGNALS_ON_GPU OCL_CHIRP3 FFTW USE_SSE3 x86 CPUID: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz Cache: L1=64K L2=256K CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 SSSE3 SSE4.1 SSE4.2 AVX OpenCL-kernels filename : MultiBeam_Kernels_r3401.cl ar=0.427216 NumCfft=195899 NumGauss=1105829004 NumPulse=226440719403 NumTriplet=452848448667 Currently allocated 293 MB for GPU buffers In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768 Real Join Date: Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC Try to learn something new everyday. |
Mike Send message Joined: 17 Feb 01 Posts: 34346 Credit: 79,922,639 RAC: 80 |
Changes in -pref_wg_num_per_cu N makes it just slower. At least on my GPU. It might be different on NV cards not sure atm. As soon i have more time left i will test it on NV also. With each crime and every kindness we birth our future. |
Joe Januzzi Send message Joined: 13 Apr 03 Posts: 54 Credit: 307,134,110 RAC: 492 |
|
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
-pref_wg_num_per_cu N introduced in 3410 so didn't come in current RC |
Sleepy Send message Joined: 21 May 99 Posts: 219 Credit: 98,947,784 RAC: 28,360 |
During crunching, sometimes I get a driver reset. BOINC switches to another pair of WUs (I am crunching 2 WUs at a time) which follow their fate, whichever it is. When crunching comes back to the WU(s) that caused the reset, roughly at the same completion percentage driver resets and the cycle starts over. I eventually have to manually abort the "offending" WU(s) before the system efficiency degrades, especially if any other similar resetting WU comes along, increasing the oscillations. I am experiencing it on both my systems, on a GTX660 and GTX 650Ti. I am running the 3401 version, which should be the last one. I have also tried to use the default settings of the application, but the effects stay the same. I think, but I have not checked yet, that it is WU related in some form. But I cannot be more precise at the moment. Anyone having the same problems or a solution? Cheers, Sleepy |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
Sleepy, Have you checked to see what the true angle is on those work units? Just curious. |
Sleepy Send message Joined: 21 May 99 Posts: 219 Credit: 98,947,784 RAC: 28,360 |
This is what I wanted to do, but I am quite busy ATM and I cannot go so deep. But yes, this was my idea as well, to be confirmed. But if you go inside my PCs you find till they are there my aborted MB WUs. Sleepy |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
Raistmer, will you app work for these new GBT data? |
Mike Send message Joined: 17 Feb 01 Posts: 34346 Credit: 79,922,639 RAC: 80 |
Raistmer, will you app work for these new GBT data? I`m not Raistmer but yes it will. With each crime and every kindness we birth our future. |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
So was rewriting a portion of the app_info and afterwards got the following message in start up log 4/13/2016 11:10:28 AM | SETI@home | [error] State file error: duplicate app version: setiathome_v8 windows_x86_64 800 4/13/2016 11:10:28 AM | SETI@home | [error] State file error: duplicate app version: setiathome_v8 windows_intelx86 800 When I went looking saw these 2 areas and am thinking they are causing the error report? Should I remove 1 of them?
|
Jimbocous Send message Joined: 1 Apr 13 Posts: 1856 Credit: 268,616,081 RAC: 1,349 |
So was rewriting a portion of the app_info and afterwards got the following message in start up log Yeah, I'd think so. Error message seems pretty clear ... |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.