OpenCL NV MultiBeam v8 SoG edition for Windows

Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 21 · Next

AuthorMessage
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6324
Credit: 106,370,077
RAC: 121
Russia
Message 1771227 - Posted: 12 Mar 2016, 18:49:36 UTC - in response to Message 1771199.  


Raistmer,
Do you think that some day this <instances_per_device>N</instances_per_device>” could be added to the MultiBeam_NV_config.xml file?

Hardly.
and definitely no sense to do that until BOINC will be able to run different number of tasks for different GPU of the same vendor. Can it?
ID: 1771227 · Report as offensive
Grumpy Swede (I stand with Ukraine)
Volunteer tester
Avatar

Send message
Joined: 1 Nov 08
Posts: 8922
Credit: 49,849,242
RAC: 65
Sweden
Message 1771284 - Posted: 13 Mar 2016, 0:12:29 UTC
Last modified: 13 Mar 2016, 0:25:08 UTC

Well, I really do not have the time any longer to really analyze this latest SoG version (r3401). However, after having run around 400 tasks the last 24 hours or so, I think I can say that this version is indeed faster than the one I run before (r3366), for around 13000 WU's.

Same settings as with r3366.

Maybe some day (probably not until I retire in 5 years or so), will I have the time to do some serious testing.

Edit: I will continue to let it run.
ID: 1771284 · Report as offensive
Grumpy Swede (I stand with Ukraine)
Volunteer tester
Avatar

Send message
Joined: 1 Nov 08
Posts: 8922
Credit: 49,849,242
RAC: 65
Sweden
Message 1772694 - Posted: 19 Mar 2016, 22:00:40 UTC

Best settings so far, for my GTX980:

-cpu_lock -sbs 256 -period_iterations_num 20 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -instances_per_device 4

Dropping -period_iterations_num to 20, from default 50, increased the speed considerably.
ID: 1772694 · Report as offensive
Grumpy Swede (I stand with Ukraine)
Volunteer tester
Avatar

Send message
Joined: 1 Nov 08
Posts: 8922
Credit: 49,849,242
RAC: 65
Sweden
Message 1772834 - Posted: 20 Mar 2016, 16:31:19 UTC

I will now try 3 SoGs at a time, instead of 4, and add one more CPU core/thread.
Making it 3 SoGs at a time, and 3 CPU WUs.

Same settings as before, except for "-instances_per_device 3"
ID: 1772834 · Report as offensive
Grumpy Swede (I stand with Ukraine)
Volunteer tester
Avatar

Send message
Joined: 1 Nov 08
Posts: 8922
Credit: 49,849,242
RAC: 65
Sweden
Message 1773039 - Posted: 21 Mar 2016, 14:33:19 UTC

Geeze, very hard to say if 3 SoG's and 3 CPU's is faster or slower than 4 SoG's and 2 CPU's.

I need to let this run for a week or so, before I can say for sure. Yeah well, who's in a hurry?
ID: 1773039 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6324
Credit: 106,370,077
RAC: 121
Russia
Message 1773210 - Posted: 22 Mar 2016, 7:51:13 UTC - in response to Message 1773039.  

Indeed, performance variation from AR change bigger than possible change in performance from switching 3/4 tasks per GPU. So really good statistics or some offline tests in controlled environment are required for that.
ID: 1773210 · Report as offensive
Joe Januzzi
Volunteer tester
Avatar

Send message
Joined: 13 Apr 03
Posts: 54
Credit: 307,134,110
RAC: 492
United States
Message 1773305 - Posted: 22 Mar 2016, 21:55:46 UTC - in response to Message 1772694.  

Dropping -period_iterations_num to 20, from default 50, increased the speed considerably.

Dropping it to 20, helped my speed too! Thanks Tutankhamon for the info.

I had the -v 8 switch running without knowing it for about a week. The -v 8 switch was at the tale end of the commands, which I didn't see, because my screen was to small :-(
So now I'm back tracking a little bit. Hopefully with better data this time.

Here's my “mb_cmdline_win_x86_SSE3_OpenCL_NV.txt” file.

-sbs 192 -instances_per_device 3 -period_iterations_num 20 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 16 -oclfft_tune_cw 16

Real Join Date:
Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC
Try to learn something new everyday.
ID: 1773305 · Report as offensive
Rasputin42
Volunteer tester

Send message
Joined: 25 Jul 08
Posts: 412
Credit: 5,834,661
RAC: 0
United States
Message 1773310 - Posted: 22 Mar 2016, 22:24:28 UTC

The r3401 version is not working well for me.
I guess, it is no good for cards with few Compute units (2 in my case)
One of the test wus(from lunatics) does not even run at all(no error, but no cpu or gpu usage)
I tried all sorts of tweaking, different drivers,but performance is bad.

The r3366 works fine.
ID: 1773310 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6324
Credit: 106,370,077
RAC: 121
Russia
Message 1773375 - Posted: 23 Mar 2016, 5:53:24 UTC - in response to Message 1773310.  
Last modified: 23 Mar 2016, 6:01:25 UTC


I tried all sorts of tweaking

What exactly you tried?
r3401 currently is RC build so any usability degradation not solved on beta will remain after release. Did you check system log for driver restart events?
EDIT: on beta I see such completed result:

Defaults scaling is disabled, basic defaults will be used. Tuning on user's discretion.
Number of period iterations for PulseFind set to:5

Such tuning definitely not correct for low-performance card with small number of CUs. You purposedly worse app usability with such tuning.
For low-performance GPU default value is 500, for mid-range and high-level GPUs default is 50.
So value of 5 can be complete no go for your device.
Did you experience issues with default settings before such tune attempts?
ID: 1773375 · Report as offensive
Rasputin42
Volunteer tester

Send message
Joined: 25 Jul 08
Posts: 412
Credit: 5,834,661
RAC: 0
United States
Message 1773418 - Posted: 23 Mar 2016, 10:52:55 UTC - in response to Message 1773375.  

Did you experience issues with default settings before such tune attempts?


Yes, i did. The gpu utilization was very spiky and low.

I discovered, it was classified as "low performance", which it is not. This set the period iterations num to 500, which made unbelievably slow.

I got some performance out of it by setting P.I.N. nearly to zero.

I also tried the most recent driver-- no change.

No driver restarts.
ID: 1773418 · Report as offensive
Marco Franceschini
Volunteer tester
Avatar

Send message
Joined: 4 Jul 01
Posts: 54
Credit: 69,877,354
RAC: 135
Italy
Message 1773450 - Posted: 23 Mar 2016, 13:04:11 UTC - in response to Message 1773418.  
Last modified: 23 Mar 2016, 13:10:09 UTC

Did you experience issues with default settings before such tune attempts?


Yes, i did. The gpu utilization was very spiky and low.

I discovered, it was classified as "low performance", which it is not. This set the period iterations num to 500, which made unbelievably slow.

I got some performance out of it by setting P.I.N. nearly to zero.

I also tried the most recent driver-- no change.

No driver restarts.


My GTX 660M/GT820M/GT 640/GT 730 all is classified as "low performance" too.
ID: 1773450 · Report as offensive
Rasputin42
Volunteer tester

Send message
Joined: 25 Jul 08
Posts: 412
Credit: 5,834,661
RAC: 0
United States
Message 1773456 - Posted: 23 Mar 2016, 13:31:49 UTC

My GTX 660M/GT820M/GT 640/GT 730 all is classified as "low performance" too.


So what did you do?
ID: 1773456 · Report as offensive
Marco Franceschini
Volunteer tester
Avatar

Send message
Joined: 4 Jul 01
Posts: 54
Credit: 69,877,354
RAC: 135
Italy
Message 1773478 - Posted: 23 Mar 2016, 15:22:07 UTC

So far i'm in the process of recompiling fftw 3.3 with Intel Compiler C++ and researching about this "issue" (my gpu all are under Tflops in single precision).
ID: 1773478 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 33256
Credit: 79,922,639
RAC: 80
Germany
Message 1773480 - Posted: 23 Mar 2016, 15:26:50 UTC - in response to Message 1773450.  

Did you experience issues with default settings before such tune attempts?


Yes, i did. The gpu utilization was very spiky and low.

I discovered, it was classified as "low performance", which it is not. This set the period iterations num to 500, which made unbelievably slow.

I got some performance out of it by setting P.I.N. nearly to zero.

I also tried the most recent driver-- no change.

No driver restarts.


My GTX 660M/GT820M/GT 640/GT 730 all is classified as "low performance" too.


Because they are.
With each crime and every kindness we birth our future.
ID: 1773480 · Report as offensive
Joe Januzzi
Volunteer tester
Avatar

Send message
Joined: 13 Apr 03
Posts: 54
Credit: 307,134,110
RAC: 492
United States
Message 1773541 - Posted: 23 Mar 2016, 20:50:10 UTC - in response to Message 1773480.  

FYI and a little help:
I been using the “mb_cmdline_win_x86_SSE3_OpenCL_NV.txt” file only (V. 3366).
-sbs 192 -instances_per_device 3 -period_iterations_num 20 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 16 -oclfft_tune_cw 16

WU's times are fast (for me).

When I added the “MultiBeam_NV_config.xml” file to the mix, my WU's times decrease by 60+ secounds.
My goal was to use the same parameters for both setup's, so there be minimum speed changes, if any (I think). I also had the same slow down on version 3401.

So I have 2 questions:
1. Is my “MultiBeam_NV_config.xml” file or setup wrong?
2. Should there be any time difference in using the “MultiBeam_NV_config.xml” file?
Thanks
Joe

WU's times before adding “MultiBeam_NV_config.xml” file shown on first line of data, second line with file added.
Note: I could of shown more WU's for samples, but they were mostly in the same time different ratio for both versions.

_______GTX 980(do)____________________GTX 980(d2)_____________________GTX 780 (d3)
__AR____Elapsed_CPU_____WU#_______AR___Elapsed_CPU_____WU________#AR___Elapsed__CPU____WU#
0.4224____881___424__101999550___0.4225___878___463__2101726988___0.4222___1067___605__2102034291
0.4224____967___541__2102376628__0.4225___957___447__2102379836___0.4221___1129___727__2102368697

http://setiathome.berkeley.edu/workunit.php?wuid=2101999550
http://setiathome.berkeley.edu/workunit.php?wuid=2101726988
http://setiathome.berkeley.edu/workunit.php?wuid=2102034291
http://setiathome.berkeley.edu/workunit.php?wuid=2102376628
http://setiathome.berkeley.edu/workunit.php?wuid=2102379836
http://setiathome.berkeley.edu/workunit.php?wuid=2102368697

Note: I ran the file with and without using ";;; GTX xxx" in the file with no speed difference, so I left it in (shown in red).
MultiBeam_NV_config.xml
;;; GTX 980
<device0>
<period_iterations_num>20</period_iterations_num>
<spike_fft_thresh>4096</spike_fft_thresh>
<sbs>192</sbs>
<oclfft_plan>
<size>256</size>
<global_radix>256</global_radix>
<local_radix>16</local_radix>
<workgroup_size>256</workgroup_size>
<max_local_size>512</max_local_size>
<localmem_banks>16</localmem_banks>
<localmem_coalesce_width>16</localmem_coalesce_width>
</oclfft_plan>
</device0>
;;; GTX 780
<device1>
<period_iterations_num>20</period_iterations_num>
<spike_fft_thresh>4096</spike_fft_thresh>
<sbs>192</sbs>
<oclfft_plan>
<size>256</size>
<global_radix>256</global_radix>
<local_radix>16</local_radix>
<workgroup_size>256</workgroup_size>
<max_local_size>512</max_local_size>
<localmem_banks>16</localmem_banks>
<localmem_coalesce_width>16</localmem_coalesce_width>
</oclfft_plan>
</device1>
;;; GTX 980
<device2>
<period_iterations_num>20</period_iterations_num>
<spike_fft_thresh>4096</spike_fft_thresh>
<sbs>192</sbs>
<oclfft_plan>
<size>256</size>
<global_radix>256</global_radix>
<local_radix>16</local_radix>
<workgroup_size>256</workgroup_size>
<max_local_size>512</max_local_size>
<localmem_banks>16</localmem_banks>
<localmem_coalesce_width>16</localmem_coalesce_width>
</oclfft_plan>
</device2>
;;; GTX 960
<device3>
<period_iterations_num>20</period_iterations_num>
<spike_fft_thresh>4096</spike_fft_thresh>
<sbs>192</sbs>
<oclfft_plan>
<size>256</size>
<global_radix>256</global_radix>
<local_radix>16</local_radix>
<workgroup_size>256</workgroup_size>
<max_local_size>512</max_local_size>
<localmem_banks>16</localmem_banks>
<localmem_coalesce_width>16</localmem_coalesce_width>
</oclfft_plan>
</device3>


app_info.xml changes in red (only OpenCL_r3366_SoG shown)
<app>
<name>setiathome_v8</name>
</app>
<file_info>
<name>MB8_win_x86_SSE3_OpenCL_NV_r3366_SoG.exe</name>
<executable/>
</file_info>
<file_info>
<name>libfftw3f-3-3-4_x86.dll</name>
<executable/>
</file_info>
<file_ref>
<file_name>MultiBeam_Kernels_r3366.cl</file_name>
</file_ref>
<file_info>
<name>mb_cmdline_win_x86_SSE3_OpenCL_NV.txt</name>
</file_info>
<file_info>
<name>MultiBeam_NV_config.xml</name>
</file_info>

<app_version>
<app_name>setiathome_v8</app_name>
<version_num>800</version_num>
<platform>windows_intelx86</platform>
<avg_ncpus>0.04</avg_ncpus>
<max_ncpus>0.2</max_ncpus>
<plan_class>opencl_nvidia_SoG</plan_class>
<cmdline></cmdline>
<coproc>
<type>CUDA</type>
<count>1</count>
</coproc>
<file_ref>
<file_name>MB8_win_x86_SSE3_OpenCL_NV_r3366_SoG.exe</file_name>
<main_program/>
</file_ref>
<file_ref>
<file_name>libfftw3f-3-3-4_x86.dll</file_name>
</file_ref>
<file_ref>
<file_name>mb_cmdline_win_x86_SSE3_OpenCL_NV.txt</file_name>
<open_name>mb_cmdline.txt</open_name>
</file_ref>
<file_ref>
<file_name>MultiBeam_NV_config.xml</file_name>
</file_ref>

</app_version>

mb_cmdline_win_x86_SSE3_OpenCL_NV.txt
-instances_per_device 3 -tune 1 64 1 4





[img][/img][img][/img]

Real Join Date:
Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC
Try to learn something new everyday.
ID: 1773541 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6324
Credit: 106,370,077
RAC: 121
Russia
Message 1773567 - Posted: 23 Mar 2016, 23:01:31 UTC - in response to Message 1773418.  

Did you experience issues with default settings before such tune attempts?


Yes, i did. The gpu utilization was very spiky and low.

I discovered, it was classified as "low performance", which it is not. This set the period iterations num to 500, which made unbelievably slow.

I got some performance out of it by setting P.I.N. nearly to zero.

I also tried the most recent driver-- no change.

No driver restarts.


don't you see some contradiction between your posts?
"spiky and low GPU usage" is not usability issues. Lags and driver restarts are.
Defaults chosen to allow operation w/o driver restarts in unattended mode for most cards. If user wants to optimize - no probs, he can do that (and then appearing lags on his own choice). So, if r3401 slower than prev build with defaults - well, because it's stock RC and should work on most cards available.
Prev rev did not on ow-performance GPUs and that required separate tuning for them.
And regarding peak app performance - do you see slower r3401 operation with same tuning line as for prev build? How you compare builds?
ID: 1773567 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6324
Credit: 106,370,077
RAC: 121
Russia
Message 1773569 - Posted: 23 Mar 2016, 23:16:30 UTC - in response to Message 1773541.  
Last modified: 23 Mar 2016, 23:17:47 UTC


When I added the “MultiBeam_NV_config.xml” file to the mix, my WU's times decrease by 60+ secounds.
My goal was to use the same parameters for both setup's, so there be minimum speed changes, if any (I think). I also had the same slow down on version 3401.

So I have 2 questions:
1. Is my “MultiBeam_NV_config.xml” file or setup wrong?
2. Should there be any time difference in using the “MultiBeam_NV_config.xml” file?

"config" file provided for multi-GPU hosts to supply separate tuning for different devices. If provided tuning the same for all of them then "config" file usage is excessive and cmdline (or any another available method to supply command line params) is enough.

After that, no matter wich way you supply param to app - app will react on particular param value, not on the way how param was passed to it (config has precedence though to override common param values).

Regarding performance changes between builds:
1) defaults are changed.
2) work splitting for PulseFind is changed.

If you see slowdown because of "low performance GPU detected" and know that GPU can perform OK with lower -period_iterations_num N value that set it lower and app will obey.

In other case try to change -sbs N setting and values of newly added settings:
-pref_wg_size N (was 128 before, changed to 64 with r3401 for ATi, leaved at 32 for NV)
-pref_wg_num_per_cu N (default is 4; different algorithm used before so no direct comparison with older default).
ID: 1773569 · Report as offensive
Joe Januzzi
Volunteer tester
Avatar

Send message
Joined: 13 Apr 03
Posts: 54
Credit: 307,134,110
RAC: 492
United States
Message 1773595 - Posted: 24 Mar 2016, 1:55:13 UTC - in response to Message 1773569.  

Raistmer,
Thanks again for all your help.
Joe

Regarding performance changes between builds:
1) defaults are changed.
2) work splitting for PulseFind is changed.

I'm guilty of this. Not draining cache before build changes. That could explain a lot. Will drain before going back to r3401.

In other case try to change -sbs N setting and values of newly added settings:
-pref_wg_size N (was 128 before, changed to 64 with r3401 for ATi, leaved at 32 for NV)
-pref_wg_num_per_cu N (default is 4; different algorithm used before so no direct comparison with older default).

Will make some changes, when I'm back on r3401.

The config file will be the last one to be added, once I get r3401 tuned for best performance. The config file will be tested to see if I can fine tune my GTX 960 separably from my other video cards. No speed change = no config file.

Real Join Date:
Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC
Try to learn something new everyday.
ID: 1773595 · Report as offensive
Joe Januzzi
Volunteer tester
Avatar

Send message
Joined: 13 Apr 03
Posts: 54
Credit: 307,134,110
RAC: 492
United States
Message 1774262 - Posted: 26 Mar 2016, 21:57:17 UTC - in response to Message 1773595.  

-pref_wg_num_per_cu N (default is 4; different algorithm used before so no direct comparison with older default).

I was wondering if there is a set range (like 4 to ??)?
So far I tried 4, 6, 10 and 20. All the numbers so far, seem to make any difference in changes to speed. It's also harder to track any changes for “-pref_wg_num_per_cu N”, because it doesn't say anything in the stderr.txt file. I think after “20”, I'll go back to the default.

So far this is my best set-up for version 3401_SoG. Still trying different params.
Joe

-sbs 192 -instances_per_device 3 -period_iterations_num 20 -pref_wg_size 32 -pref_wg_num_per_cu 4 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 16 -oclfft_tune_cw 16


http://setiathome.berkeley.edu/workunit.php?wuid=2105819825
Stderr output

<core_client_version>7.4.42</core_client_version>
<![CDATA[
<stderr_txt>
Running on device number: 0
Maximum single buffer size set to:192MB
Number of app instances per device set to:3
Number of period iterations for PulseFind set to:20
Preferred workgroup size set to 32. <--- Is this right?
SpikeFind FFT size threshold override set to:4096
TUNE: kernel 1 now has workgroup size of (64,1,4)
oclFFT global radix override set to:256
oclFFT local radix override set to:16
oclFFT max WG size override set to:256
oclFFT max local FFT size override set to:512
oclFFT number of local memory banks set to:16
oclFFT minimal memory coalesce width set to:16
Priority of worker thread raised successfully
Priority of process adjusted successfully, below normal priority class used
OpenCL platform detected: Intel(R) Corporation
OpenCL platform detected: NVIDIA Corporation
BOINC assigns device 0
Info: BOINC provided OpenCL device ID used

Build features: SETI8 Non-graphics OpenCL USE_OPENCL_NV OCL_ZERO_COPY SIGNALS_ON_GPU OCL_CHIRP3 FFTW USE_SSE3 x86
CPUID: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz

Cache: L1=64K L2=256K

CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3 SSSE3 SSE4.1 SSE4.2 AVX
OpenCL-kernels filename : MultiBeam_Kernels_r3401.cl
ar=0.427216 NumCfft=195899 NumGauss=1105829004 NumPulse=226440719403 NumTriplet=452848448667
Currently allocated 293 MB for GPU buffers
In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768

Real Join Date:
Joe Januzzi (ID 253343) 29 Sep 1999, 22:30:36 UTC
Try to learn something new everyday.
ID: 1774262 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 33256
Credit: 79,922,639
RAC: 80
Germany
Message 1774267 - Posted: 26 Mar 2016, 22:16:29 UTC

Changes in -pref_wg_num_per_cu N makes it just slower.
At least on my GPU.
It might be different on NV cards not sure atm.
As soon i have more time left i will test it on NV also.
With each crime and every kindness we birth our future.
ID: 1774267 · Report as offensive
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 21 · Next

Message boards : Number crunching : OpenCL NV MultiBeam v8 SoG edition for Windows


 
©2022 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.