Vega Frontier Edition - MB Options Tuning

Author	Message
RueiKe Volunteer tester Send message Joined: 14 Feb 16 Posts: 492 Credit: 378,512,430 RAC: 785	Message 1894772 - Posted: 12 Oct 2017, 4:57:47 UTC I have decided to take some time on my new system to study the effects of MB command line options on compute performance. I will be starting with app revision r3584 and will use a newer long version guppi WU. WU: blc04_2bit_blc04_guppi_57898_17662_DIAG_KIC8462852_OFF_0020.12892.818.17.26.125.vlar Initial Command Options: MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3584.exe -v 1 -instances_per_device 1 -sbs 1024 -period_iterations_num 1 -tt 500 -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp -high_perf -no_use_sleep -no_defaults_scaling The results obtained are likely to be relevant for only the VegaFE or similar GPU. Any suggestions on the approach taken are welcome. GitHub: Ricks-Lab Instagram: ricks_labs ID: 1894772 ·

RueiKe Volunteer tester Send message Joined: 14 Feb 16 Posts: 492 Credit: 378,512,430 RAC: 785	Message 1894773 - Posted: 12 Oct 2017, 5:01:54 UTC Here are my first DOE results exploring the effects of tt and sbs. It also includes a test of the effect of -no_use_sleep and -no_defaults_scaling options. GitHub: Ricks-Lab Instagram: ricks_labs ID: 1894773 ·

RueiKe Volunteer tester Send message Joined: 14 Feb 16 Posts: 492 Credit: 378,512,430 RAC: 785	Message 1894774 - Posted: 12 Oct 2017, 5:06:51 UTC Here as a sensitivity analysis of FFT related parameters. GitHub: Ricks-Lab Instagram: ricks_labs ID: 1894774 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1894800 - Posted: 12 Oct 2017, 9:41:38 UTC You need to check your results better Rick. [b]TUNE:incorrect tune params: 1 (128,1,4[/b]) That would be work group size 512 your GPU only has 256. One need to understand params first to make such tests. oclfft_tune_bn 128 would mean 128 memory banks whilst the GPU only has 64. With each crime and every kindness we birth our future. ID: 1894800 ·

RueiKe Volunteer tester Send message Joined: 14 Feb 16 Posts: 492 Credit: 378,512,430 RAC: 785	Message 1894805 - Posted: 12 Oct 2017, 11:17:12 UTC - in response to Message 1894800. Thanks for the feedback. This one was only a sensitivity analysis, using + and - for each nominal. It was what I could do without understanding the parameters. I think it is useful in that it does show that -tune is not optimal. Ideally, a DOE design with full understanding would be great. If you could help me understand how to better plan the DOE or where I could get additional details, I am willing to dedicate system time to fully explore all parameters. I would especially like to explore -tune. You need to check your results better Rick. [b]TUNE:incorrect tune params: 1 (128,1,4[/b]) That would be work group size 512 your GPU only has 256. One need to understand params first to make such tests. oclfft_tune_bn 128 would mean 128 memory banks whilst the GPU only has 64. GitHub: Ricks-Lab Instagram: ricks_labs ID: 1894805 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1894817 - Posted: 12 Oct 2017, 12:08:15 UTC Last modified: 12 Oct 2017, 12:19:58 UTC There is no such thing like optimal. I provide best settings in the read me which of course can be optimized, especially with kernel target time tuning. But you need to understand that each GPU/host combination reacts a little different with tuning params. Also you will get different results with each different task you are testing. Just test the same task on different days you will get slightly different results. It took me month to understand oclfft tuning, you can just try to fine tune some params. Each test will be different with diffeent type of task ie guppi or arecibo and angle range. The tune switch is rather easy. -tune 1 64 1 4 means kernel 1 will be split in in chunks of 64 x 1 x 4 = WG size 256 So a big number of combination is possible 1 1 16 1 2 16 1 3 16 1 4 16 .......... 128 2 1 or what ever it must not bigger than 256 for AMD. Maybe your card acts better in big chunks. Like 1 128 1 2 or 1 1 2 128 or alike. Needless to say i have tested them all and provide best config already. You wont find anything new in this case. Also i have to admit that your times are very good already, you just still use wrong -tt value. With each crime and every kindness we birth our future. ID: 1894817 ·

Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1894831 - Posted: 12 Oct 2017, 12:55:31 UTC - in response to Message 1894817. Also i have to admit that your times are very good already I had to go take a look at my times. My 1060s and 980 are in that time frame, but they are 'sauced' up. I know it not apples to apples, but my cards are a whole lot cheaper, and so is the OS :D ID: 1894831 ·

Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196	Message 1894832 - Posted: 12 Oct 2017, 13:02:38 UTC Fascinating analysis! Thank you! ID: 1894832 ·

RueiKe Volunteer tester Send message Joined: 14 Feb 16 Posts: 492 Credit: 378,512,430 RAC: 785	Message 1894837 - Posted: 12 Oct 2017, 13:23:33 UTC - in response to Message 1894831. Also i have to admit that your times are very good already I had to go take a look at my times. My 1060s and 980 are in that time frame, but they are 'sauced' up. I know it not apples to apples, but my cards are a whole lot cheaper, and so is the OS :D Yep, this build is certainly not cost effective for SETI. It is actually my main workstation and just does SETI/LHC part time. GitHub: Ricks-Lab Instagram: ricks_labs ID: 1894837 ·

RueiKe Volunteer tester Send message Joined: 14 Feb 16 Posts: 492 Credit: 378,512,430 RAC: 785	Message 1894838 - Posted: 12 Oct 2017, 13:25:40 UTC - in response to Message 1894832. Fascinating analysis! Thank you! Thanks for the feedback! Still lots of work to do. Each DOE takes many hours to run. I am running a 3 parameter interaction DOE now and it is looking interesting. I should be able to post those results in my morning. GitHub: Ricks-Lab Instagram: ricks_labs ID: 1894838 ·

RueiKe Volunteer tester Send message Joined: 14 Feb 16 Posts: 492 Credit: 378,512,430 RAC: 785	Message 1894952 - Posted: 12 Oct 2017, 23:15:35 UTC Here are the results for period_iterations_num vs pref_wg_size vs pref_wg_size GitHub: Ricks-Lab Instagram: ricks_labs ID: 1894952 ·

RueiKe Volunteer tester Send message Joined: 14 Feb 16 Posts: 492 Credit: 378,512,430 RAC: 785	Message 1895028 - Posted: 13 Oct 2017, 13:55:50 UTC Here is my first look at the "-tune" parameter. I need some help understanding what is going on but what I suspect is that tuning of kernel=1 is the only one that has any effect. The results for tuning other parameters leaves kernel=1 unspecified, and probably best due to the current tune parameters not being optimized for my GPU. I plan a follow up DOE focusing on more cells with higher Mz values for kernel=1. Let me know of any recommendations. GitHub: Ricks-Lab Instagram: ricks_labs ID: 1895028 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1895052 - Posted: 13 Oct 2017, 16:02:56 UTC - in response to Message 1895028. As I recall few -tune lines can be provided one for each kernel. But not sure more than 1 implemented for MB. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1895052 ·

RueiKe Volunteer tester Send message Joined: 14 Feb 16 Posts: 492 Credit: 378,512,430 RAC: 785	Message 1895182 - Posted: 14 Oct 2017, 0:37:58 UTC - in response to Message 1895052. As I recall few -tune lines can be provided one for each kernel. But not sure more than 1 implemented for MB. Thanks for the feedback. It looks like the application is accepting multiple kernel settings, but with no effect for those above 1. GitHub: Ricks-Lab Instagram: ricks_labs ID: 1895182 ·

RueiKe Volunteer tester Send message Joined: 14 Feb 16 Posts: 492 Credit: 378,512,430 RAC: 785	Message 1895183 - Posted: 14 Oct 2017, 0:42:40 UTC Here is my next level look into "-tune" parameters. I focused on large Mz values based on the results of the first DOE. Here is a sample of the command line arguments used in the BenchCfg file: MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3584.exe -v 1 -instances_per_device 1 -sbs 2048 -period_iterations_num 1 -tt 500 -spike_fft_thresh 4096 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp -high_perf -no_defaults_scaling -pref_wg_size 256 -pref_wg_num_per_cu 4 -tune 1 1 1 256 GitHub: Ricks-Lab Instagram: ricks_labs ID: 1895183 ·

RueiKe Volunteer tester Send message Joined: 14 Feb 16 Posts: 492 Credit: 378,512,430 RAC: 785	Message 1895404 - Posted: 15 Oct 2017, 0:06:21 UTC Here are the results of a verification run on optimization results so far using 3 conditions: Original Command Line Options: MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3584.exe -v 1 -instances_per_device 1 -sbs 1024 -period_iterations_num 1 -tt 500 -no_defaults_scaling -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp -high_perf -no_use_sleep Optimized Command Line Options: MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3584.exe -v 1 -instances_per_device 1 -sbs 2048 -period_iterations_num 1 -tt 500 -spike_fft_thresh 4096 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp -high_perf -no_defaults_scaling -pref_wg_size 256 -pref_wg_num_per_cu 4 -tune 1 4 1 64 Optimized without -tune Command Line Options: MB8_win_x86_SSE2_OpenCL_ATi_HD5_r3584.exe -v 1 -instances_per_device 1 -sbs 2048 -period_iterations_num 1 -tt 500 -spike_fft_thresh 4096 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 -hp -high_perf -pref_wg_size 256 -pref_wg_num_per_cu 4 Seems like the optimization on a guppi unit resulted in a degradation for Arecibo WUs. I will redo the optimization DOEs using both types of WUs. Let me know of any other recommendations. GitHub: Ricks-Lab Instagram: ricks_labs ID: 1895404 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1895473 - Posted: 15 Oct 2017, 7:30:12 UTC - in response to Message 1895404. Seems like the optimization on a guppi unit resulted in a degradation for Arecibo WUs. I will redo the optimization DOEs using both types of WUs. Let me know of any other recommendations. As processing chain relating from AR it's recommended to use PG* set of tasks for benchmarking. Maybe, with additional inclusion of GUPPI VLAR. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1895473 ·

RueiKe Volunteer tester Send message Joined: 14 Feb 16 Posts: 492 Credit: 378,512,430 RAC: 785	Message 1895482 - Posted: 15 Oct 2017, 8:53:46 UTC - in response to Message 1895473. Seems like the optimization on a guppi unit resulted in a degradation for Arecibo WUs. I will redo the optimization DOEs using both types of WUs. Let me know of any other recommendations. As processing chain relating from AR it's recommended to use PG* set of tasks for benchmarking. Maybe, with additional inclusion of GUPPI VLAR. Thanks for the recommendation. Is that the set of 4 WUs beginning with PG that was downloaded with MB_Bench? I will give it a try after my current effort. Currently, I am using the most degraded Arecibo WU with the most improved guppi to find the best condition. After this, I plan to analyze the SOG version of r3584. Does SOG change the strategy for optimization? GitHub: Ricks-Lab Instagram: ricks_labs ID: 1895482 ·

RueiKe Volunteer tester Send message Joined: 14 Feb 16 Posts: 492 Credit: 378,512,430 RAC: 785	Message 1895483 - Posted: 15 Oct 2017, 9:01:55 UTC Here is the first Arecibo/GreenBanks combined DOE. SBS vs TT shows no significant difference in optimal conditions. GitHub: Ricks-Lab Instagram: ricks_labs ID: 1895483 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1895486 - Posted: 15 Oct 2017, 10:12:39 UTC Last modified: 15 Oct 2017, 10:13:32 UTC After this, I plan to analyze the SOG version of r3584. Does SOG change the strategy for optimization? No, on my tests SoG was always slower on AMD GPU`s but you have got a much faster GPU so worth a try. With each crime and every kindness we birth our future. ID: 1895486 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.