RX 480 OpenCL Question

Author	Message
Darrell Volunteer tester Send message Joined: 14 Mar 03 Posts: 267 Credit: 1,418,681 RAC: 0	Message 1861461 - Posted: 14 Apr 2017, 16:18:55 UTC Upgraded my HD5850 to a new MSI RX480 8gb Armor OC. Boinc correctly reports that the card has 8 gigs, but the OpenCL from the amd drivers only report that it has 3 gigs. This is the current command line parameters I am using: -v 1 -pref_wg_size 256 -sbs 1280 -hp -instances_per_device 1 -no_cpu_lock -high_perf -tune 1 64 1 4 -period_iterations_num 15 -tt 500 -spike_fft_thresh 4096 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 1024 -oclfft_tune_bn 64 -oclfft_tune_cw 64 I have tried pushing sbs to 1536, but it causes the units to be postponed due to an error in the program. Has anyone found a driver version that correctly reports the amount of memory? Any recommendations on tuning this beast would be greatly appreciated. P.S. I am only running one instance at a time because my CPU is just an Athlon II X2 250. I have a Phenom II X6 1100T on its way via a slow boat from China. ID: 1861461 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 1861474 - Posted: 14 Apr 2017, 17:36:09 UTC - in response to Message 1861461. I doubt it has much impact and it isn't a problem with the drivers you use, more like the GPU application that does this. It does it for my RX470 as well. Not that you can ever fill 8GB of videoRAM here at Seti... ID: 1861474 ·

Darrell Volunteer tester Send message Joined: 14 Mar 03 Posts: 267 Credit: 1,418,681 RAC: 0	Message 1861505 - Posted: 14 Apr 2017, 23:41:42 UTC - in response to Message 1861474. With the sbs set at 1280 each task takes 1.57 gigs of VRAM, but if you turn on verbose logging there are a few pulse finds that give the following type of message: WARNING: total WG number (119) less than optimal (612) for complete CUs load. Try to increase -sbs N value PulseFind geometry: NDRange={4,17,448}, WG={4,1,64},single_period_size=2.61MB, WG num=119, CU num=36 But increasing sbs in steps of 256 to the next level of 1536 gets the following message: ERROR: OpenCL kernel/call 'RepackInput_kernel' call failed (-4) in file ..\autocorr.cpp near line 694. Waiting 30 sec before restart... Increasing the sbs has increased the percentage of core time used. When I get the Phenom processor I'll be able to run more tasks at a time on the GPU, hopefully without as big an impact as running multiple tasks on a single core does now. Of course being attached to five projects that use the GPU, if you set BOINC to two tasks at a time, it doesn't give you two tasks from one project, it gives you two tasks, each from a different project. ID: 1861505 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1861762 - Posted: 16 Apr 2017, 0:38:42 UTC With my 8GB R9 390x I am currently using -sbs 2048 with no issues. I did go as high as -sbs 3072 without any issues, but there were no apparent gains after 2048. When I tried -sbs 4096 the app failed to start. Likely because it is only a 32-bit app. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1861762 ·

Darrell Volunteer tester Send message Joined: 14 Mar 03 Posts: 267 Credit: 1,418,681 RAC: 0	Message 1861983 - Posted: 17 Apr 2017, 4:32:01 UTC - in response to Message 1861762. Last modified: 17 Apr 2017, 4:53:33 UTC This is the current command line parameters: -v 1 -pref_wg_size 256 -sbs 1408 -hp -instances_per_device 2 -no_cpu_lock -high_perf -no_use_sleep -tune 1 256 1 1 -period_iterations_num 9 -tt 400 -spike_fft_thresh 4096 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 1024 -oclfft_tune_bn 64 -oclfft_tune_cw 64 The adjustments have allowed increased CPU percentage when running two tasks on a single core and are now using 3.3gbs of VRAM. I know that increasing sbs another 128 to 1536 will cause errors in the program, but still might try bumping it another 64 to 1472. There's a little bit of screen lag, especially when a task starts, but I can live with it. The major benefit from the adjustments is that the AMD video driver has stopped resetting. The Mx, My, and Mz of the tune command seems like it is setting the dimensions of a three-dimensional array, so that is why I went with the 256x1x1. 4/16/2017 8:26:37 PM \| \| OpenCL: AMD/ATI GPU 0: Radeon (TM) RX 480 Graphics (driver version 2348.3, device version OpenCL 2.0 AMD-APP (2348.3), 8192MB, 8192MB available, 5949 GFLOPS peak) 4/16/2017 8:26:37 PM \| \| [coproc] No NVIDIA library found 4/16/2017 8:26:37 PM \| \| [coproc] calInit() returned 1 4/16/2017 8:26:37 PM \| \| [coproc] clGetDeviceInfo failed to get CL_DEVICE_SIMD_PER_COMPUTE_UNIT_AMD for device 0 Is the above BOINC message anything to be concerned about? ID: 1861983 ·

Gasper Sedej Volunteer tester Send message Joined: 11 Apr 17 Posts: 1 Credit: 446,275 RAC: 0	Message 1862332 - Posted: 19 Apr 2017, 12:29:43 UTC - in response to Message 1861983. Hello. I am trying to figure out, where to put all those parameters. boinc, boinccmd and boincmgr does not accept those parameters. Also, is this for windows or linux? Trying to use my RX 480 on ubuntu linux (opencl is working also), the log says "Requesting new tasks for CPU and AMD/ATI GPU", but i only get cpu tasks. ID: 1862332 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1862366 - Posted: 19 Apr 2017, 15:38:09 UTC - in response to Message 1862332. Last modified: 19 Apr 2017, 15:39:10 UTC Hello. I am trying to figure out, where to put all those parameters. boinc, boinccmd and boincmgr does not accept those parameters. Also, is this for windows or linux? Trying to use my RX 480 on ubuntu linux (opencl is working also), the log says "Requesting new tasks for CPU and AMD/ATI GPU", but i only get cpu tasks. As per ReadMe_MultiBeam_OpenCL.txt Command line switches can be used either in app_info.xml or mb_cmdline.txt. Params in mb_cmdline.txt will override switches in <cmdline> tag of app_info.xml. I believe it is the same for both Windows and Linux. SETI@home classic workunits: 93,865 CPU time: 863,447 hours* Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1862366 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34257 Credit: 79,922,639 RAC: 80	Message 1862390 - Posted: 19 Apr 2017, 20:41:13 UTC You can also put them in app_config.xml. With each crime and every kindness we birth our future. ID: 1862390 ·

Darrell Volunteer tester Send message Joined: 14 Mar 03 Posts: 267 Credit: 1,418,681 RAC: 0	Message 1862883 - Posted: 22 Apr 2017, 10:47:55 UTC - in response to Message 1861983. Latest parameters: -v 1 -pref_wg_size 256 -sbs 1408 -hp -instances_per_device 2 -no_cpu_lock -high_perf -no_use_sleep -tune 1 256 1 1 -period_iterations_num 9 -tt 400 -spike_fft_thresh 4096 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 2048 -oclfft_tune_bn 64 -oclfft_tune_cw 64 A strange anomaly has appeared, BOINC is now reporting this: 4/22/2017 12:02:00 AM \| \| OpenCL: AMD/ATI GPU 0: Radeon (TM) RX 480 Graphics (driver version 2348.3, device version OpenCL 2.0 AMD-APP (2348.3), 7536MB, 7536MB available, 5949 GFLOPS peak) 4/22/2017 12:02:00 AM \| \| [coproc] No NVIDIA library found 4/22/2017 12:02:00 AM \| \| [coproc] calInit() returned 1 4/22/2017 12:02:00 AM \| \| [coproc] clGetDeviceInfo failed to get CL_DEVICE_SIMD_PER_COMPUTE_UNIT_AMD for device 0 For some reason, I have lost 656MB. Upgraded driver from 17.4.1 to 17.4.3. but the OpenCL driver version is the same in both. ID: 1862883 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1863029 - Posted: 22 Apr 2017, 20:28:21 UTC - in response to Message 1862883. Latest parameters: -v 1 -pref_wg_size 256 -sbs 1408 -hp -instances_per_device 2 -no_cpu_lock -high_perf -no_use_sleep -tune 1 256 1 1 -period_iterations_num 9 -tt 400 -spike_fft_thresh 4096 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 2048 -oclfft_tune_bn 64 -oclfft_tune_cw 64 A strange anomaly has appeared, BOINC is now reporting this: 4/22/2017 12:02:00 AM \| \| OpenCL: AMD/ATI GPU 0: Radeon (TM) RX 480 Graphics (driver version 2348.3, device version OpenCL 2.0 AMD-APP (2348.3), 7536MB, 7536MB available, 5949 GFLOPS peak) 4/22/2017 12:02:00 AM \| \| [coproc] No NVIDIA library found 4/22/2017 12:02:00 AM \| \| [coproc] calInit() returned 1 4/22/2017 12:02:00 AM \| \| [coproc] clGetDeviceInfo failed to get CL_DEVICE_SIMD_PER_COMPUTE_UNIT_AMD for device 0 For some reason, I have lost 656MB. Upgraded driver from 17.4.1 to 17.4.3. but the OpenCL driver version is the same in both. I have seen the amount of memory reported by BOINC change with the driver several times over the years. Since it didn't seem to actually effect anything I didn't worry about it. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1863029 ·

Karsten Vinding Volunteer tester Send message Joined: 18 May 99 Posts: 239 Credit: 25,201,931 RAC: 11	Message 1866074 - Posted: 7 May 2017, 18:55:22 UTC - in response to Message 1863029. I also own a RX480 (XFX RX480 RS). Its running with an FX 8150 @ 4.6Ghz on an ASUS Sabertooth 990 FX. I have been trying to find good settings for it, but have found it hard to actually find hard info in that regard. I saw your settings and decided to try them. But sadly I found that they seem to make my crunching time longer than with my previous settings. Only by about 30 seconds or so, but that is something, when it takes 6-7 minutes to crunch a WU. My normal settings are these: -sbs 1408 -period_iterations_num 1 -tt 300 -hp -high_prec_timer -high_perf -no_cpu_lock -spike_fft_thresh 4096 -tune 1 64 1 4 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 512 -oclfft_tune_bn 64 -oclfft_tune_cw 64 Would it be possible that you try these, and see what results they give? I must point out that I normally only crunch one WU at a time, as I find crunching two at a time, more than doubles the time pr WU, and thats no good for throughput. I am hoping that some other peoble chime in, so that we can find the best settings for this particular GPU. ID: 1866074 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1866101 - Posted: 7 May 2017, 20:44:17 UTC - in response to Message 1866074. have found it hard to actually find hard info in that regard. ReadMe files in project directory + http://lunatics.kwsn.info/index.php/board,1.0.html SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1866101 ·

Karsten Vinding Volunteer tester Send message Joined: 18 May 99 Posts: 239 Credit: 25,201,931 RAC: 11	Message 1866246 - Posted: 8 May 2017, 14:43:00 UTC - in response to Message 1866101. Thanks Raistmer. The settings I use are very much based upon the suggested settings for a high end AMD/ATi card, in the readme files in project dir. But they are not very uptodate, and I dont know where exactly my RX480 stands in comparison to e.g. a R9 290X. I also tried following your "Some considerations regarding OpenCL MultiBeam app tuning from algorithm view" posts, and that made me change some settings. And I have brought the crunching time down somewhat. But I'm certain there are even better settings, that could be based upon the cards capabilities, if one knows the limitations / strenghts of a given cards architecture. For me it will mostly be trial and error. ID: 1866246 ·

Karsten Vinding Volunteer tester Send message Joined: 18 May 99 Posts: 239 Credit: 25,201,931 RAC: 11	Message 1866247 - Posted: 8 May 2017, 15:47:23 UTC - in response to Message 1866246. Last modified: 8 May 2017, 15:48:46 UTC I have been playing a bit with the settings, trying to run more than one WU at a time. Every time I try to do this, the performance takes a huge dive. A WU that normally takes 7 minuttes, suddenly takes 45 minuttes (22,5 min each on average for 2 WU at a time). And I have allocated a whole CPU pr. GPU WU, and even tried running with only GPU crunching. The GPU shows full utilisation, but the memory load on the card is 0 most of the time. As soon as I go back to 1 WU at a time, my performance goes back to the normal times. @Raistmer, is this due to the scheduling bug you sometimes refer to? I wonder why it seems to work for others with the same card, but not for me. ID: 1866247 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34257 Credit: 79,922,639 RAC: 80	Message 1866256 - Posted: 8 May 2017, 16:52:33 UTC How many CPU cores are free on your FX CPU ? With each crime and every kindness we birth our future. ID: 1866256 ·

Darrell Volunteer tester Send message Joined: 14 Mar 03 Posts: 267 Credit: 1,418,681 RAC: 0	Message 1866265 - Posted: 8 May 2017, 18:10:41 UTC Hi Karsten, I think I know why your running multiple units at a time on your RX480 is not working as you would expect. Looking at your task details, the optimized app that you are running is overriding the -sbs 1408 setting and instead of setting the single buffer allocation size to 1408MB, it is setting it to 3072MB. I.E. it it is using all of the OpenCL memory on each task, thus when you try to run more than one task at a time, the GPU has to do a complete swap out of memory when switching between tasks thus causing the increased run-times you have noticed. The 3072MB OpenCL memory is the max amount total for all tasks combined. If you look at my tasks, please understand that I'm using the stock apps supplied by Seti, also I'm attached to multiple GPU projects and that when this occurs, if you try to run more than one task at a time BOINC will very seldom run two or more tasks from the same project at a time and some projects tasks do not share the GPU very well. Currently I'm running just one task at a time with the following command line parameters: -v 1 -pref_wg_size 256 -sbs 1472 -hp -instances_per_device 1 -no_cpu_lock -high_perf -no_use_sleep -tune 1 256 1 1 -period_iterations_num 8 -tt 300 -spike_fft_thresh 4096 -oclfft_tune_gr 256 -oclfft_tune_lr 16 -oclfft_tune_wg 256 -oclfft_tune_ls 4096 -oclfft_tune_bn 64 -oclfft_tune_cw 64 With your optimized app overriding -sbs, this just leaves you with the -tune, -spike_fft_threshold, and -oclfft_tune parameters to fine tune you tasks. And since there is no real good information about what they do and what to expect when you modify them, it is going to be just a matter of chance when you do, and mostly likely will result in just a few seconds of change which may not be noticeable. P.S. Please note that everything on my system except the GPU is nine year old technology. ID: 1866265 ·

Karsten Vinding Volunteer tester Send message Joined: 18 May 99 Posts: 239 Credit: 25,201,931 RAC: 11	Message 1866267 - Posted: 8 May 2017, 18:27:23 UTC - in response to Message 1866256. I leave one free for every GPU task, + one extra. So when running 2 GPU tasks, 3 cores are free. But as I said, I have tried running with only GPU tasks (by setting the "On multiprocessor systems, use at most" to 25%, so that only the GPU tasks are running), and it does not help. The GPU tasks are crawling ahead, even though the CPU is completely idle besides the two GPU apps. There is something funny/weird going on :) ID: 1866267 ·

Karsten Vinding Volunteer tester Send message Joined: 18 May 99 Posts: 239 Credit: 25,201,931 RAC: 11	Message 1866269 - Posted: 8 May 2017, 18:29:50 UTC - in response to Message 1866265. Last modified: 8 May 2017, 18:32:08 UTC Thanks for your answer Darrel. I dont think your answer is right, but cant be sure. I have GPU-z set up so I can monitor mem usage on the GPU, and I can see it go up and down as I increase / decrease -sbs. But you could be on to something, I just dont know enough about it to be absolutely certain. From one of the slow WU's: Used GPU device parameters are: Number of compute units: 36 Single buffer allocation size: 1408MB Total device global memory: 3072MB max WG size: 256 local mem type: Real LotOfMem path: yes LowPerformanceGPU path: no HighPerformanceGPU path: yes period_iterations_num=1 It looks like its using the 1408MB setting to me. ID: 1866269 ·

Darrell Volunteer tester Send message Joined: 14 Mar 03 Posts: 267 Credit: 1,418,681 RAC: 0	Message 1866274 - Posted: 8 May 2017, 19:12:38 UTC - in response to Message 1866269. Looking at your most recent completed task (after scrolling thru four pages of cued ones): Maximum single buffer size set to:1408MB Number of period iterations for PulseFind set to:1 Target kernel sequence time set to 300ms System timer will be set in high resolution mode High-performance path selected. If GUI lags occur consider to remove -high_perf option from tuning line CPU affinity adjustment disabled SpikeFind FFT size threshold override set to:4096 TUNE: kernel 1 now has workgroup size of (64,1,4) oclFFT global radix override set to:256 oclFFT local radix override set to:16 oclFFT max WG size override set to:256 oclFFT max local FFT size override set to:512 oclFFT number of local memory banks set to:64 oclFFT minimal memory coalesce width set to:64 Maximum single buffer size set to:2560MB <---------- ok for one task at a time, not for two at a time Priority of worker thread raised successfully Also later down in the task description: Credit multiplier is : 2.85 WU true angle range is : 0.010252 Used GPU device parameters are: Number of compute units: 36 Single buffer allocation size: 3072MB <-- something is setting this to use all OpenCL memory, fine for one task only, maybe. Leaves no overhead GPU OpenCL system(?) Total device global memory: 3072MB max WG size: 256 local mem type: Real LotOfMem path: yes LowPerformanceGPU path: no HighPerformanceGPU path: yes The only real difference I can see is that you are using an optimized app, and I'm using a stock app, other than the Number of Period iterations setting. ID: 1866274 ·

Darrell Volunteer tester Send message Joined: 14 Mar 03 Posts: 267 Credit: 1,418,681 RAC: 0	Message 1866278 - Posted: 8 May 2017, 19:19:39 UTC Last modified: 8 May 2017, 19:23:28 UTC I'm trying your command line settings, will see what happens. may take awhile. Have wait for a Seti task to come up in rotation. {Note: attached to multiple GPU projects] ID: 1866278 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.