Tuning up twin GTX 750 Ti's

Author	Message
Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1847460 - Posted: 9 Feb 2017, 4:35:20 UTC Ok, I now have two GTX 750 Ti's running on my HP Z400 Workstation. The GPU's are reportedly pulling 99% One is full height/full length and the new one is a low profile and basically tiny version of the first one I bought. I read about "command line" parameters and so forth. I am not clear about exactly what goes where in what named text files? I am also wondering if I can approach the relative performance of GTX 750 Ti under Linux while running a Windows box? So far, I get work unit turn around of 5-12 minutes for some types of data and up to 30 minutes on another type of data. I keep running into the implication that my GPU's could run 2 or even 3 WU's in parallel with little or no loss of put through. My experiments a couple of years ago showed 2 WU's at the same time on my GTX 750 Ti, ran half as fast. Is this basically only something you can do with the Lunatics setup? How? What parameters in what named files located in what directories? Or failing that, a URL that is clearer than some of the threads out here have been :( Thanks, Tom A proud member of the OFA (Old Farts Association). ID: 1847460 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 1847544 - Posted: 9 Feb 2017, 15:51:30 UTC Last modified: 9 Feb 2017, 16:01:45 UTC Goes into the mb_cmdline_win_.txt file for OpenCL Multibeam. Probably called something like mb_cmdline-8.22_windows_intel__opencl_nvidia_SoG.txt or similar. Default directory: C:\Programdata\BOINC\projects\setiathome.berkeley.edu\ MultiBeam OpenCL application is intended to process SETI@home MultiBeam v6, v7, v8, and forthcoming "large" tasks. Source code repository: https://setisvn.ssl.berkeley.edu/svn/branches/sah_v7_opt Build from SVN revision: 3330 Date of revision commit: 2016/01/09 14:55:56 Available command line switches: -v N : sets level of verbosity of app. N - integer number. Default corresponds to -v 1. -v 0 disables almost all output. Levels from 2 to 5 reserved for increasing verbosity, higher levels reserved for specific usage. -v 2 enables all signals output. -v 6 enables delays printing where sleep loops used. -v 7 enables oclFFT config printing for oclFFT fine tune. -period_iterations_num N : Splits single PulseFind kernel call to N calls for longest PulseFind calls. Can be used to reduce GUI lags or to prevent driver restarts. Can affect performance. Experimentation required. The default value for v6/v7/v8 tasks is N=20. N should be positive integer. -spike_fft_thresh N : Sets threshold FFT size where switch between different SpikeFind algorithms occurs. -sbs N : Sets maximum single buffer size for GPU memory allocations. N should be positive integer and means bigger size in Mbytes. Can affect performance and total memory requirements for application to run. Experimentation required. -hp : Results in bigger priority for application process (normal priority class and above normal thread priority). Can be used to increase GPU load, experimentation required for particular GPU/CPU/GPU driver combo. -cpu_lock : Results in CPUs number limitation for particular app instance. Also attempts to bind different instances to different CPU cores will be made. Can be used to increase performance under some specific conditions. Can decrease performance in other cases though. Experimentation required. -total_GPU_instances_num N : To use together with -cpu_lock on multi-vendor GPU hosts. Set N to total number of simultaneously running GPU OpenCL SETI apps for host (total among all used GPU of all vendors). App needs to know this number to properl select logical CPU for execution in affinity-management (-cpu_lock) mode. Should not exceed 64. -cpu_lock_fixed_cpu N : Will enable CPUlock too but will bind all app instances to the same N-th CPU (N=0,1,.., number of CPUs-1). -no_cpu_lock : To disable affinity management (opposite to -cpu_lock option). For ATi version CPUlock affinity management enabled by default. -use_sleep : Results in additional Sleep() calls to yield CPU to other processes. Can affect performance. Experimentation required. -use_sleep_ex N : enables use_sleep; sets argument of Sleep() call to N: Sleep(N) -no_caching : Disables CL files binary caching Here some already obsolete options are listed. They are not tested for proper operation with latest builds and are only listed for completeness: -gpu_lock : Old way GPU lock enabled. Use -instances_per_device N switch to provide number of instances to run. -instances_per_device N : Sets allowed number of simultaneously executed GPU app instances (shared with AstroPulse app instances). N - integer number of allowed instances. Should not exceed 64. These 2 options used together provide BOINC-independent way to limit number of simultaneously executing GPU apps. Each SETI OpenCL GPU application with these switches enabled will create/check global Mutexes and suspend its process execution if limit is reached. Awaiting process will consume zero CPU/GPU and rather low amount of memory awaiting when it can continue execution. -tune N Mx My Mz : to make app more tunable this param allows user to fine tune kernel launch sizes of most important kernels. N - kernel ID (see below) Mxyz - workgroup size of kernel. For 1D workgroups Mx will be size of first dimension and My=Mz=1 should be 2 other ones. N should be one of values from this list: TRIPLET_HD5_WG=1, For best tuning results its recommended to launch app under profiler to see how particular WG size choice affects particular kernel. This option mostly for developers and hardcore optimization enthusiasts wanting absolute max from their setups. No big changes in speed expected but if you see big positive change over default please report. Usage example: -tune 1 2 1 64 (sets workgroup size of 128 (2x1x64) for TripletFind_HD5 kernels). This class of options tunes oclFFT performance -oclfft_tune_gr N : Global radix -oclfft_tune_lr N : Local radix -oclfft_tune_wg N : Workgroup size -oclfft_tune_ls N : Max size of local memory FFT -oclfft_tune_bn N : Number of local memory banks -oclfft_tune_cw N : Memory coalesce width For examples of app_info.xml entries look into text file with .aistub extension provided in corresponding package. Command line switches can be used either in app_info.xml or mb_cmdline.txt. Params in mb_cmdline.txt will override switches in <cmdline> tag of app_info.xml. For device-specific settings in multi-GPU systems it's possible to override some of command-line options via application config file. Name of this config file: MultiBeam_<vendor>_config.xml where vendor can be ATi, NV or iGPU. File structure: <deviceN> <period_iterations_num>N</period_iterations_num> <spike_fft_thresh>N</spike_fft_thresh> <sbs>N</sbs> <oclfft_plan> <size>N</size> <global_radix>N</global_radix> <local_radix>N</local_radix> <workgroup_size>N</workgroup_size> <max_local_size>N</max_local_size> <localmem_banks>N</localmem_banks> <localmem_coalesce_width>N</localmem_coalesce_width> </oclfft_plan> <no_caching> </deviceN> where deviceN - particular OpenCL device N starting with 0, multiple sections allowed, one per each device. other fields - corresponding command-line options to override for this particular device. All or some sections can be omitted. Don't forget to re-check device number to physical card relation after driver update and physical slot change. Both these actions can change cards enumeration order. Best usage tips: For best performance it is important to free 2 CPU cores running multiple instances. Freeing at least 1 CPU core is necessity to get enough GPU usage. If you experience screen lags or driver restarts increase -period_iteration_num in app_info.xml or mb_cmdline.txt It is more important to free CPU core(s) so more instances are running. ============= ATi specific info =============== Known issues: - With 12.x Catalyst drivers GPU usage can be low if CPU fully used with another loads. App performance can be increased by using -cpu_lock switch in this case. - Catalyst 12.11 beta and 13.1 have broken OpenCL compiler that will result in driver restarts or invalid results. But these drivers can be used still if the kernels binaries are precompiled under an older Catalyst driver. That is, delete all .bin* files from SETI project directory, revert to Catalyst 12.8 or 12.10, or: upgrade to Catalyst 13.2 or later, process at least one task (check that those .bin files were generated again) and (if needed) update to Catalyst 13.1. - New builds with versioned binary caches will require additional step: to rename old bin_* files to match name (driver version part of name) of newly generated ones. App instances: On high end cards HD 5850/5870, 6950/6970, 7950/7970, R9 280X/290X running 3 instances should be fastest. HD 7950/7970 and R9 280X/290X can handle 4 instances very well. Testing required. Beware free CPU cores. On mid range cards HD 5770, 6850/6870, 7850/7870 best performance should be running 2 instances. Suggested command line switches: _______________________________ Running 3 instances set -sbs 192 is best option for speed on 1GB GPU. If only 2 instances are running set -sbs 256 for max speed up. Users using GPU with 3 GB RAM set -sbs 244 - 280 for best speed also running 3 or more instances. This might require some fine tuning. One instance requires approx. 500 MB VRAM (depends on -sbs value used) Entry level cards HD x3xx / x4xx R7 230/240/250 -spike_fft_thresh 2048 -tune 1 2 1 16 Mid range cards x5xx / x6xx / x7xx / R9 260 / 270 -spike_fft_thresh 2048 -tune 1 64 1 4 High end cards x8xx / x9xx / R9 280x / 290x -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 ============= Intel specific info ============= Suggested command line switches: HD 2500 -spike_fft_thresh 2048 -tune 1 2 1 16 (requires testing) HD 4000 -spike_fft_thresh 2048 -tune 1 64 1 4 (requires testing) HD 4200 / HD 4600 / HD 5xxx -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 512 (requires testing) ============= NV specific info ================ Known issues: With NV drivers past 267.xx GPU usage can be low if CPU fully used with another loads. App performance can be increased by using -cpu_lock switch in this case, and CPU time savings are possible with -use_sleep switch. Suggested command line switches: Entry Level cards NV x20 x30 x40 -sbs 128 -spike_fft_thresh 2048 -tune 1 2 1 16 (requires testing) Mid range cards x50 x60 x70 -sbs 192 -spike_fft_thresh 2048 -tune 1 64 1 4 (requires testing) High end cards x8x 780TI Titan / Titan Z -sbs 256 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 (requires testing) =============================================== FAQ: To reduce screen lags especially running VLARs it is important to add -period_iterations_num and -sbs value into mb_cmdline*.txt file. Especially on slow GPUs you need to INCREASE -period_iterations_num to REDUCE screen lag. Note each GPU has different characteristics so experimentation is required. Example: -period_iterations_num 80 -sbs 128 -period_iterations_num 100 -sbs 192 -period_iterations_num 150 -sbs 256 -period_iterations_num 200 -sbs 256 -period_iterations_num can be increased up to 1000 but this will significantly slow down processing. This value increases wait loop between Pulse find Kernels. -sbs value increases VRAM consumption but should also speed up computing. ID: 1847544 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 1847546 - Posted: 9 Feb 2017, 15:55:51 UTC Last modified: 9 Feb 2017, 16:01:33 UTC Goes into the ap_cmdline_win_.txt file for OpenCL Astropulse. Probably called something like ap_cmdline-710_windows_intel__opencl_nvidia.txt or similar. Default directory: C:\Programdata\BOINC\projects\setiathome.berkeley.edu\ AstroPulse OpenCL application currently available in 3 editions: for AMD/ATi, nVidia and Intel GPUs. It's intended to process SETI@home AstroPulse v7 tasks. Source code repository: https://setisvn.ssl.berkeley.edu/svn/branches/sah_v7_opt Build revision:2742 Date of revision commit: 2014/11/02 19:58:47 Available command line switches: -v N :sets level of verbosity of app. N - integer number. -v 0 disables almost all output. Default corresponds to -v 1. Levels from 2 to 5 reserved for increasing verbosity, higher levels reserved for specific usage. -v 2 enables all signals output. -v 3 additionally to level 2 enables output of simulated signals corresponding current threshold level (to easely detect near-threshold validation issues). -v 6 enables delays printing where sleep loops used. -v 7 enables oclFFT config printing for oclFFT fine tune. -v 8 enables allocated GPU memory printing for different parts of algorithm -ffa_block N :sets how many FFA's different period iterations will be processed per kernel call. N should be integer even number less than 32768. Increase in this param's value will increase app's GPU memory consumption. -ffa_block_fetch N: sets how many FFA's different period iterations will be processed per "fetch" kernel call (longest kernel in FFA). N should be positive integer number, should be divisor of ffa_block_N. -unroll N :sets number of data chunks processed per kernel call in main application loop. N should be integer number, minimal possible value is 2. Increase in this param's value will increase app's GPU memory consumption. -skip_ffa_precompute : Results in skipping FFA pre-compute kernel call. Affects performance. Experimentation required if it will increase or decrease performance on particular GPU/CPU combo. -exit_check :Results in more often check for exit requests from BOINC. If you experience problems with long app suspend/exit use this option. Can decrease performance though. -use_sleep :Results in additional Sleep() calls to yield CPU to other processes. Can affect performance. Experimentation required. -initial_ffa_sleep N M: In PC-FFA will sleep N ms for short and M ms for large one before looking for results. Can decrease CPU usage. Affects performance. Experimentation required for particular CPU/GPU/GPU driver combo. N and M should be integer non-negative numbers. Approximation of useful values can be received via running app with -v 6 and -use_sleep switches enabled and analyzing stderr.txt log file. -initial_single_pulse_sleep N : In SingleFind search will sleep N ms before looking for results. Can decrease CPU usage. Affects performance. Experimentation required for particular CPU/GPU/GPU driver combo. N should be integer positive number. Approximation of useful values can be received via running app with -v 6 and -use_sleep switches enabled and analyzing stderr.txt log file. -sbs N :Sets maximum single buffer size for GPU memory allocations. N should be positive integer and means bigger size in Mbytes. For now if other options require bigger buffer than this option allows warning will be issued but memory allocation attempt will be made. -hp : Results in bigger priority for application process (normal priority class and above normal thread priority). Can be used to increase GPU load, experimentation required for particular GPU/CPU/GPU driver combo. -cpu_lock : Enables CPUlock feature. Results in CPUs number limitation for particular app instance. Also attempt to bind different instances to different CPU cores will be made. Can be used to increase performance under some specific conditions. Can decrease performance in other cases though. Experimentation required. Now this option allows GPU app to use only single logical CPU. Different instances will use different CPUs as long as there is enough of CPU in the system. To use CPUlock in round-robin mode GPUlock feature will be enabled. Use -instances_per_device N option if few instances per GPU device are needed. -cpu_lock_fixed_cpu N : Will enable CPUlock too but will bind all app instances to the same N-th CPU (N=0,1,.., number of CPUs-1). -gpu_lock :Old way GPU lock enabled. Use -instances_per_device N switch to provide number of instances to run. -instances_per_device N :Sets allowed number of simultaneously executed GPU app instances per GPU device (shared with MultiBeam app instances). N - integer number of allowed instances. These 2 options used together provide BOINC-independent way to limit number of simultaneously executing GPU apps. Each SETI OpenCL GPU application with these switches enabled will create/check global Mutexes and suspend its process execution if limit is reached. Awaiting process will consume zero CPU/GPU and rather low amount of memory awaiting when it can continue execution. -disable_slot N: Can be used to exclude N-th GPU (starting from zero) from usage. Not tested and obsolete feature, use BOINC abilities to exclude GPUs instead. Advanced level options for developers (some app code reading and understanding of algorithms used is recommended before use, not fool-proof even in same degree as options above): -tune N Mx My Mz : to make app more tunable this param allows user to fine tune kernel launch sizes of most important kernels. N - kernel ID (see below) Mxyz - workgroup size of kernel. For 1D workgroups Mx will be size of first dimension and My=Mz=1 should be 2 other ones. N should be one of values from this list: FFA_FETCH_WG=1, FFA_COMPARE_WG=2 For best tuning results its recommended to launch app under profiler to see how particular WG size choice affects particular kernel. This option mostly for developers and hardcore optimization enthusiasts wanting absolute max from their setups. No big changes in speed expected but if you see big positive change over default please report. Usage example: -tune 2 32 1 1 (set workgroup size of 32 for 1D FFA comparison kernel). -oclFFT_plan A B C : to override defaults for FFT 32k plan generation. Read oclFFT code and explanations in comments before any tweaking. A - global radix B - local radix C - max size of workgroup used by oclFFT kernel generation algorithm Usage example: -oclFFT_plan 64 8 256 (this corresponds to old defaults); -oclFFT_plan 0 0 0 (this effectively means this option not used, hardwired defaults in play). These switches can be placed into the file called ap_cmdline.txt also. For device-specific settings in multi-GPU systems it's possible to override some of command-line options via application config file. Name of this config file: AstroPulse_<vendor>_config.xml where vendor can be ATi, NV or iGPU. File structure: <deviceN> <unroll>N</unroll> <ffa_block>N</ffa_block> <ffa_block_fetch>N</ffa_block_fetch> <oclfft_plan> <size>32768</size> <global_radix>N</global_radix> <local_radix>N</local_radix> <workgroup_size>N</workgroup_size> </oclfft_plan> <tune> <tune_kernel_index>N</tune_kernel_index> <tune_workgroup_size_x>N</tune_workgroup_size_x> <tune_workgroup_size_y>N</tune_workgroup_size_y> <tune_workgroup_size_z>N</tune_workgroup_size_z> </tune> <initial_ffa_sleep_short>N</initial_ffa_sleep_short> <initial_ffa_sleep_large>N</initial_ffa_sleep_large> <initial_single_pulse_sleep>N</initial_single_pulse_sleep> <sbs>N</sbs> <skip_ffa_precompute> <no_defaults_scaling> </deviceN> where deviceN - particular OpenCL device N starting with 0, multiple sections allowed, one per each device. other fields - corresponding command-line options to override for this particular device. All or some sections can be omitted. AstroPulse uses only one FFT size in GPU-based calculations so <size> field in <oclfft_plan> reserved to 32768 and can be omitted. To be used in MultiBeam. For examples of app_info.xml entries look into text file with .aistub extension provided in corresponding package. Known issues: - With 12.x Catalyst drivers GPU usage can be low if CPU fully used with another loads. If you see low GPU usage of zero blanked tasks try to free one or more CPU cores. - For overflowed tasks found signal sequence not always match CPU version. Best usage tips: For best performance it is important to free 2 CPU cores running multiple instances. Freeing at least 1 CPU core is necessity to get enough GPU usage.* * As alternate solution try to use -cpu_lock / -cpu_lock_fixed_cpu N options. This might only work on fast multicore CPU`s. command line parameters. Command line switches can be used either in app_info.xml or ap_cmdline_win_x86_SSE2_OpenCL_ATI.txt. Params in ap_cmdline.txt will override switches in <cmdline> tag of app_info.xml. _______________________ High end cards (more than 30 compute units) -unroll 18 -ffa_block 16384 -ffa_block_fetch 8192 Bigger unroll values < 20 doesn`t necessarily result in better run times. Mid range cards (l12 - 24 compute units) -unroll 12 -ffa_block 12288 -ffa_block_fetch 6144 entry level GPU (less than 6 compute units) -unroll 4 -ffa_block 2048 -ffa_block_fetch 1024 -tune switch possible values: -tune 1 256 1 1 -tune 1 128 2 1 -tune 1 64 4 1 -tune 1 32 8 1 -tune 1 16 16 1 Intensive testing highlighted -tune 1 64 4 1 and -tune 1 32 8 1 to be fastest on HD 7970 and R9 280X. Further testing required for other GPU`s. -oclFFT_plan switch Use at your own risk ! ------------------------ FFT kernels are processed in 8 point fft kernels by default. Using different fft kernel planning can speed up processing significantly. In most cases 16 point fft kernels are fastest for Astropulse V7. -oclFFT_plan 256 16 256 Example: High end cards -unroll 18 -oclFFT_plan 256 16 256 -ffa_block 16384 -ffa_block_fetch 8192 -tune 1 64 4 1 -tune 2 64 4 1 Mid range cards -unroll 12 -oclFFT_plan 256 16 256 -ffa_block 12288 -ffa_block_fetch 6144 -tune 1 64 4 1 -tune 2 64 4 1 Your mileage might vary. ----------------------------------------------------- App instances. ______________ On HD 7950/7970, R9 280X running 2 instances should be fastest. On R9 290X running 3 instances should be easily possible. Further testing required. On mid range cards HD 5770, 6850/6870, 7850/7870 and R8 best performance should be running 2 instances. If you experience screen lags reduce unroll factor and ffa_block_fetch value. Addendum: _________ Running multiple cards in a system requires freeing another CPU core. ID: 1847546 ·

Tom M Volunteer tester Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462	Message 1847563 - Posted: 9 Feb 2017, 17:36:49 UTC Thank you for the information. I also received a PM on this from someone running multiple GXT 750 Ti's. He is using: =============================================================== "app_config.xml" <app_config> <app> <name>setiathome_v8</name> <gpu_versions> <gpu_usage>0.5</gpu_usage> <cpu_usage>0.34</cpu_usage> </gpu_versions> </app> <app> <name>astropulse_v7</name> <max_concurrent>1</max_concurrent> <gpu_versions> <gpu_usage>0.5</gpu_usage> <cpu_usage>1</cpu_usage> </gpu_versions> </app> </app_config> ========================================================================= in "mb_cmdline-8.22_windows_intel__opencl_nvidia_SoG.txt" -use_sleep_ex 60 -sleep_quantum 5 -high_prec_timer -high_perf -tune 1 64 1 4 -tune 2 64 1 4 -period_iterations_num 15 -sbs 384 -spike_fft_thresh 4096 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp ======================================================================= These will run 2 WU's in each GPU, but will not dedicate the CPU's to them. The CPU time I give to other projects. Purists will say I am not fully using the GPU's for SETI, and they are correct: I only get 95-99% busy on the GPUs for SETI. Each 3 WU's will need 1 CPU or so (depends on the CPU speed), so watch a while and see what YOUR system needs. ------------------------------------------------------------------------------------------------------------------------------------ So I have now installed the above and my GPU's are running 2 processes. The question is will they process 2 WU's at a time at the same pace that they "used" to run 1 WU (which has been in the 5-15 minute range). Thank you. Tom Miller A proud member of the OFA (Old Farts Association). ID: 1847563 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22189 Credit: 416,307,556 RAC: 380	Message 1847571 - Posted: 9 Feb 2017, 17:55:19 UTC If you are getting >90% GPU use then you are doing about as good as us possible without really hitting computational performance of the GPU. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1847571 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.