Vega Frontier Edition - MB Options Tuning

Message boards : Number crunching : Vega Frontier Edition - MB Options Tuning
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile RueiKeProject Donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 270
Credit: 104,999,559
RAC: 230,352
Taiwan
Message 1894772 - Posted: 12 Oct 2017, 4:57:47 UTC

I have decided to take some time on my new system to study the effects of MB command line options on compute performance. I will be starting with app revision r3584 and will use a newer long version guppi WU.
WU: blc04_2bit_blc04_guppi_57898_17662_DIAG_KIC8462852_OFF_0020.12892.818.17.26.125.vlar
Initial Command Options:  MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3584.exe -v 1 -instances_per_device 1 -sbs 1024 -period_iterations_num 1 -tt 500 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp -high_perf -no_use_sleep -no_defaults_scaling

The results obtained are likely to be relevant for only the VegaFE or similar GPU. Any suggestions on the approach taken are welcome.
YouTube Channel: Rick's Performance Computing
ID: 1894772 · Report as offensive     Reply Quote
Profile RueiKeProject Donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 270
Credit: 104,999,559
RAC: 230,352
Taiwan
Message 1894773 - Posted: 12 Oct 2017, 5:01:54 UTC

Here are my first DOE results exploring the effects of tt and sbs. It also includes a test of the effect of -no_use_sleep and -no_defaults_scaling options.

YouTube Channel: Rick's Performance Computing
ID: 1894773 · Report as offensive     Reply Quote
Profile RueiKeProject Donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 270
Credit: 104,999,559
RAC: 230,352
Taiwan
Message 1894774 - Posted: 12 Oct 2017, 5:06:51 UTC

Here as a sensitivity analysis of FFT related parameters.

YouTube Channel: Rick's Performance Computing
ID: 1894774 · Report as offensive     Reply Quote
Profile MikeProject Donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 30617
Credit: 57,715,808
RAC: 30,059
Germany
Message 1894800 - Posted: 12 Oct 2017, 9:41:38 UTC

You need to check your results better Rick.

[b]TUNE:incorrect tune params: 1 (128,1,4[/b])


That would be work group size 512 your GPU only has 256.
One need to understand params first to make such tests.

oclfft_tune_bn 128 would mean 128 memory banks whilst the GPU only has 64.
With each crime and every kindness we birth our future.
ID: 1894800 · Report as offensive     Reply Quote
Profile RueiKeProject Donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 270
Credit: 104,999,559
RAC: 230,352
Taiwan
Message 1894805 - Posted: 12 Oct 2017, 11:17:12 UTC - in response to Message 1894800.  

Thanks for the feedback. This one was only a sensitivity analysis, using + and - for each nominal. It was what I could do without understanding the parameters. I think it is useful in that it does show that -tune is not optimal. Ideally, a DOE design with full understanding would be great. If you could help me understand how to better plan the DOE or where I could get additional details, I am willing to dedicate system time to fully explore all parameters. I would especially like to explore -tune.

You need to check your results better Rick.

[b]TUNE:incorrect tune params: 1 (128,1,4[/b])


That would be work group size 512 your GPU only has 256.
One need to understand params first to make such tests.

oclfft_tune_bn 128 would mean 128 memory banks whilst the GPU only has 64.

YouTube Channel: Rick's Performance Computing
ID: 1894805 · Report as offensive     Reply Quote
Profile MikeProject Donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 30617
Credit: 57,715,808
RAC: 30,059
Germany
Message 1894817 - Posted: 12 Oct 2017, 12:08:15 UTC
Last modified: 12 Oct 2017, 12:19:58 UTC

There is no such thing like optimal.
I provide best settings in the read me which of course can be optimized, especially with kernel target time tuning.
But you need to understand that each GPU/host combination reacts a little different with tuning params.
Also you will get different results with each different task you are testing.
Just test the same task on different days you will get slightly different results.
It took me month to understand oclfft tuning, you can just try to fine tune some params.
Each test will be different with diffeent type of task ie guppi or arecibo and angle range.

The tune switch is rather easy.

-tune 1 64 1 4 means kernel 1 will be split in in chunks of 64 x 1 x 4 = WG size 256

So a big number of combination is possible

1 1 16
1 2 16
1 3 16
1 4 16
.......... 128 2 1 or what ever it must not bigger than 256 for AMD.

Maybe your card acts better in big chunks.

Like 1 128 1 2 or 1 1 2 128 or alike.

Needless to say i have tested them all and provide best config already.
You wont find anything new in this case.
Also i have to admit that your times are very good already, you just still use wrong -tt value.
With each crime and every kindness we birth our future.
ID: 1894817 · Report as offensive     Reply Quote
Profile Brent Norman
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 1824
Credit: 108,261,061
RAC: 457,612
Canada
Message 1894831 - Posted: 12 Oct 2017, 12:55:31 UTC - in response to Message 1894817.  

Also i have to admit that your times are very good already
I had to go take a look at my times. My 1060s and 980 are in that time frame, but they are 'sauced' up. I know it not apples to apples, but my cards are a whole lot cheaper, and so is the OS :D
ID: 1894831 · Report as offensive     Reply Quote
Profile Shaggie76Project Donor
Avatar

Send message
Joined: 9 Oct 09
Posts: 243
Credit: 86,989,423
RAC: 232,363
Canada
Message 1894832 - Posted: 12 Oct 2017, 13:02:38 UTC

Fascinating analysis! Thank you!
ID: 1894832 · Report as offensive     Reply Quote
Profile RueiKeProject Donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 270
Credit: 104,999,559
RAC: 230,352
Taiwan
Message 1894837 - Posted: 12 Oct 2017, 13:23:33 UTC - in response to Message 1894831.  

Also i have to admit that your times are very good already
I had to go take a look at my times. My 1060s and 980 are in that time frame, but they are 'sauced' up. I know it not apples to apples, but my cards are a whole lot cheaper, and so is the OS :D


Yep, this build is certainly not cost effective for SETI. It is actually my main workstation and just does SETI/LHC part time.
YouTube Channel: Rick's Performance Computing
ID: 1894837 · Report as offensive     Reply Quote
Profile RueiKeProject Donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 270
Credit: 104,999,559
RAC: 230,352
Taiwan
Message 1894838 - Posted: 12 Oct 2017, 13:25:40 UTC - in response to Message 1894832.  

Fascinating analysis! Thank you!

Thanks for the feedback! Still lots of work to do. Each DOE takes many hours to run. I am running a 3 parameter interaction DOE now and it is looking interesting. I should be able to post those results in my morning.
YouTube Channel: Rick's Performance Computing
ID: 1894838 · Report as offensive     Reply Quote
Profile RueiKeProject Donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 270
Credit: 104,999,559
RAC: 230,352
Taiwan
Message 1894952 - Posted: 12 Oct 2017, 23:15:35 UTC

Here are the results for period_iterations_num vs pref_wg_size vs pref_wg_size


YouTube Channel: Rick's Performance Computing
ID: 1894952 · Report as offensive     Reply Quote
Profile RueiKeProject Donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 270
Credit: 104,999,559
RAC: 230,352
Taiwan
Message 1895028 - Posted: 13 Oct 2017, 13:55:50 UTC

Here is my first look at the "-tune" parameter. I need some help understanding what is going on but what I suspect is that tuning of kernel=1 is the only one that has any effect. The results for tuning other parameters leaves kernel=1 unspecified, and probably best due to the current tune parameters not being optimized for my GPU. I plan a follow up DOE focusing on more cells with higher Mz values for kernel=1. Let me know of any recommendations.


YouTube Channel: Rick's Performance Computing
ID: 1895028 · Report as offensive     Reply Quote
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 5824
Credit: 76,288,795
RAC: 54,939
Russia
Message 1895052 - Posted: 13 Oct 2017, 16:02:56 UTC - in response to Message 1895028.  

As I recall few -tune lines can be provided one for each kernel. But not sure more than 1 implemented for MB.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1895052 · Report as offensive     Reply Quote
Profile RueiKeProject Donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 270
Credit: 104,999,559
RAC: 230,352
Taiwan
Message 1895182 - Posted: 14 Oct 2017, 0:37:58 UTC - in response to Message 1895052.  

As I recall few -tune lines can be provided one for each kernel. But not sure more than 1 implemented for MB.


Thanks for the feedback. It looks like the application is accepting multiple kernel settings, but with no effect for those above 1.
YouTube Channel: Rick's Performance Computing
ID: 1895182 · Report as offensive     Reply Quote
Profile RueiKeProject Donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 270
Credit: 104,999,559
RAC: 230,352
Taiwan
Message 1895183 - Posted: 14 Oct 2017, 0:42:40 UTC

Here is my next level look into "-tune" parameters. I focused on large Mz values based on the results of the first DOE. Here is a sample of the command line arguments used in the BenchCfg file:
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3584.exe -v 1 -instances_per_device 1 -sbs 2048 -period_iterations_num 1 -tt 500 -spike_fft_thresh 4096 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp -high_perf -no_defaults_scaling -pref_wg_size 256 -pref_wg_num_per_cu 4 -tune 1 1 1 256



YouTube Channel: Rick's Performance Computing
ID: 1895183 · Report as offensive     Reply Quote
Profile RueiKeProject Donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 270
Credit: 104,999,559
RAC: 230,352
Taiwan
Message 1895404 - Posted: 15 Oct 2017, 0:06:21 UTC

Here are the results of a verification run on optimization results so far using 3 conditions:

Original Command Line Options:
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3584.exe -v 1 -instances_per_device 1 -sbs 1024 -period_iterations_num 1 -tt 500 -no_defaults_scaling -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp -high_perf -no_use_sleep
Optimized Command Line Options:
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3584.exe -v 1 -instances_per_device 1 -sbs 2048 -period_iterations_num 1 -tt 500 -spike_fft_thresh 4096 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp -high_perf -no_defaults_scaling -pref_wg_size 256 -pref_wg_num_per_cu 4 -tune 1 4 1 64
Optimized without -tune Command Line Options:
MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3584.exe -v 1 -instances_per_device 1 -sbs 2048 -period_iterations_num 1 -tt 500 -spike_fft_thresh 4096 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp -high_perf -pref_wg_size 256 -pref_wg_num_per_cu 4




Seems like the optimization on a guppi unit resulted in a degradation for Arecibo WUs. I will redo the optimization DOEs using both types of WUs. Let me know of any other recommendations.
YouTube Channel: Rick's Performance Computing
ID: 1895404 · Report as offensive     Reply Quote
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 5824
Credit: 76,288,795
RAC: 54,939
Russia
Message 1895473 - Posted: 15 Oct 2017, 7:30:12 UTC - in response to Message 1895404.  


Seems like the optimization on a guppi unit resulted in a degradation for Arecibo WUs. I will redo the optimization DOEs using both types of WUs. Let me know of any other recommendations.


As processing chain relating from AR it's recommended to use PG* set of tasks for benchmarking.
Maybe, with additional inclusion of GUPPI VLAR.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1895473 · Report as offensive     Reply Quote
Profile RueiKeProject Donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 270
Credit: 104,999,559
RAC: 230,352
Taiwan
Message 1895482 - Posted: 15 Oct 2017, 8:53:46 UTC - in response to Message 1895473.  


Seems like the optimization on a guppi unit resulted in a degradation for Arecibo WUs. I will redo the optimization DOEs using both types of WUs. Let me know of any other recommendations.


As processing chain relating from AR it's recommended to use PG* set of tasks for benchmarking.
Maybe, with additional inclusion of GUPPI VLAR.


Thanks for the recommendation. Is that the set of 4 WUs beginning with PG that was downloaded with MB_Bench? I will give it a try after my current effort. Currently, I am using the most degraded Arecibo WU with the most improved guppi to find the best condition.

After this, I plan to analyze the SOG version of r3584. Does SOG change the strategy for optimization?
YouTube Channel: Rick's Performance Computing
ID: 1895482 · Report as offensive     Reply Quote
Profile RueiKeProject Donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 270
Credit: 104,999,559
RAC: 230,352
Taiwan
Message 1895483 - Posted: 15 Oct 2017, 9:01:55 UTC

Here is the first Arecibo/GreenBanks combined DOE. SBS vs TT shows no significant difference in optimal conditions.


YouTube Channel: Rick's Performance Computing
ID: 1895483 · Report as offensive     Reply Quote
Profile MikeProject Donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 30617
Credit: 57,715,808
RAC: 30,059
Germany
Message 1895486 - Posted: 15 Oct 2017, 10:12:39 UTC
Last modified: 15 Oct 2017, 10:13:32 UTC

After this, I plan to analyze the SOG version of r3584. Does SOG change the strategy for optimization?


No, on my tests SoG was always slower on AMD GPU`s but you have got a much faster GPU so worth a try.
With each crime and every kindness we birth our future.
ID: 1895486 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Vega Frontier Edition - MB Options Tuning


 
©2017 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.