RX 480 OpenCL Question

Message boards : Number crunching : RX 480 OpenCL Question
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1866281 - Posted: 8 May 2017, 19:28:27 UTC - in response to Message 1866274.  
Last modified: 8 May 2017, 19:32:49 UTC

Looking at your most recent completed task (after scrolling thru four pages of cued ones):
It's much easier to look at 'Vailid' tasks (EDIT: or 'Pending'), not sorted by name, then the newer ones are usually near the top.
ID: 1866281 · Report as offensive
Profile Darrell
Volunteer tester
Avatar

Send message
Joined: 14 Mar 03
Posts: 267
Credit: 1,418,681
RAC: 0
United States
Message 1866283 - Posted: 8 May 2017, 20:05:52 UTC

Ok, got a Seti_Beta Sog task to try your settings on first, was watching the GPU D3D memory dedicated in HWinfo64 which is where the opencl memory lies. The task had a ruff time starting, thought it was due to Number of period iterations being set sol low, but turns out that it was just having to due some binary kernel compiling. From the task with your command line settings:

Running on device number: 0
Maximum single buffer size set to:1408MB
Number of period iterations for PulseFind set to:1
Target kernel sequence time set to 300ms
System timer will be set in high resolution mode
High-performance path selected. If GUI lags occur consider to remove -high_perf option from tuning line
CPU affinity adjustment disabled
SpikeFind FFT size threshold override set to:4096
TUNE: kernel 1 now has workgroup size of (64,1,4)
oclFFT global radix override set to:256
oclFFT local radix override set to:16
oclFFT max WG size override set to:256
oclFFT max local FFT size override set to:512
oclFFT number of local memory banks set to:64
oclFFT minimal memory coalesce width set to:64
Priority of worker thread raised successfully
Priority of process adjusted successfully, high priority class used

WARNING: can't open binary kernel file for oclFFT plan: C:\ProgramData\BOINC/projects/setiweb.ssl.berkeley.edu_beta\MB_clFFTplan_Ellesmere_65536_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3584.bin_23483, continue with recompile...
WARNING: can't open binary kernel file for oclFFT plan: C:\ProgramData\BOINC/projects/setiweb.ssl.berkeley.edu_beta\MB_clFFTplan_Ellesmere_131072_gr256_lr16_wg256_tw0_ls512_bn64_cw64_r3584.bin_23483, continue with recompile...
ar=0.446658 NumCfft=191499 NumGauss=1057770764 NumPulse=226404193095 NumTriplet=452842013561
Currently allocated 1509 MB for GPU buffers <--------- ok for two at a time ----------I wonder, is this on the CPU for feeding the GPU?

Credit multiplier is : 2.85
WU true angle range is : 0.446658
Used GPU device parameters are:
Number of compute units: 36
Single buffer allocation size: 1408MB <-----Stock app uses -sbs setting, so I could easily run two tasks
Total device global memory: 3072MB
max WG size: 256
local mem type: Real
LotOfMem path: yes
LowPerformanceGPU path: no
HighPerformanceGPU path: yes
period_iterations_num=1

Task took 480.17 seconds on GPU and 380.97 seconds of CPU time which is about the same as other other Sogs using my parameters.

So there must be something in the anonymous optimized app that your are using that is setting the single buffer allocation to use all of the Opencl memory, which may be fine for one task at time, but no more than that.
ID: 1866283 · Report as offensive
Profile Darrell
Volunteer tester
Avatar

Send message
Joined: 14 Mar 03
Posts: 267
Credit: 1,418,681
RAC: 0
United States
Message 1866284 - Posted: 8 May 2017, 20:07:19 UTC - in response to Message 1866281.  

Looking at your most recent completed task (after scrolling thru four pages of cued ones):
It's much easier to look at 'Vailid' tasks (EDIT: or 'Pending'), not sorted by name, then the newer ones are usually near the top.


You are right, I keep forgetting about that, thank you.
ID: 1866284 · Report as offensive
Profile Darrell
Volunteer tester
Avatar

Send message
Joined: 14 Mar 03
Posts: 267
Credit: 1,418,681
RAC: 0
United States
Message 1866286 - Posted: 8 May 2017, 20:47:52 UTC

Ok Karsten, got a Seti Sog and with your command line parameters got the same readings as on the Seti_beta Sog task. Really wasn't sure my system would handle the low Number of Period iterations setting but it is, maybe, four of six Milkyway tasks have erred out since using them, might have to restart BOINC, and /or system.

Just notice this with the two tasks:

Fftlength=32,pass=3:Tune: sum=11719.9(ms); min=168.7(ms); max=171.1(ms); mean=169.9(ms); s_mean=170; sleep=165(ms); delta=6005; N=69; usual

So with your settings max, mean, and s_mean are now really close, with my setting they weren't.
ID: 1866286 · Report as offensive
Profile Karsten Vinding
Volunteer tester

Send message
Joined: 18 May 99
Posts: 239
Credit: 25,201,931
RAC: 11
Denmark
Message 1866287 - Posted: 8 May 2017, 20:50:34 UTC - in response to Message 1866274.  
Last modified: 8 May 2017, 21:06:55 UTC

OK, I see where you are getting at now.

I did set sbs to 2560 after experimenting today, to see if it brought more performance, and I thought it did.

There is something going on with mem, could it be this:
LotOfMem path: yes
???

But if I look at this, slightly older WU, which I ran with my previous settings, but with the same application as now, I dont see it setting 3072Mb mem for the task. And I couldn't run two apps at the same time back then either.

https://setiathome.berkeley.edu/result.php?resultid=5681716921

Snippets from the result, if it gets removed from server before you see it.

Maximum single buffer size set to:768MB

Used GPU device parameters are:
Number of compute units: 36
Single buffer allocation size: 768MB
Total device global memory: 3072MB
max WG size: 256
local mem type: Real
LotOfMem path: yes
LowPerformanceGPU path: no
HighPerformanceGPU path: yes
period_iterations_num=1

This is from one that was run yesterday, with slightly updated settings:

https://setiathome.berkeley.edu/result.php?resultid=5717671910

Maximum single buffer size set to:1408MB

Used GPU device parameters are:
Number of compute units: 36
Single buffer allocation size: 1408MB
Total device global memory: 3072MB
max WG size: 256
local mem type: Real
LotOfMem path: yes
LowPerformanceGPU path: no
HighPerformanceGPU path: yes
period_iterations_num=1

It doesn't seem quite logical what is happening.
ID: 1866287 · Report as offensive
Profile Karsten Vinding
Volunteer tester

Send message
Joined: 18 May 99
Posts: 239
Credit: 25,201,931
RAC: 11
Denmark
Message 1866288 - Posted: 8 May 2017, 20:55:42 UTC - in response to Message 1866286.  

Quote:
So with your settings max, mean, and s_mean are now really close, with my setting they weren't.
Quote end.

Is this a good thing or a bad thing?
ID: 1866288 · Report as offensive
Profile Darrell
Volunteer tester
Avatar

Send message
Joined: 14 Mar 03
Posts: 267
Credit: 1,418,681
RAC: 0
United States
Message 1866295 - Posted: 8 May 2017, 22:16:33 UTC - in response to Message 1866288.  

From reading what Rastimer has has said on his link, once mean and s_mean are basically equal increasing the difference between the -num_period_iterations and -tts will have no more effect, you either try to increase -sbs, or run more than one task at at time. So with the -sbs set to 1408MB, we can increase it to 1472MB (it likes increases of 64MB). However that is the max because another increase of 64MB to 1536MB will cause app to crash due to exceeding a programming limit. The next thing that he recommends then in his link is to try and run more that one task at a time, which you can not do until you find out what is causing the single allocation buffer to be set 3072MB (ie all of the opencl memory). I have never used an optimized app via the anonymous app_info file, so I do not know if there is a setting in there that is causing the single allocation buffer to get set to 3072. Until we find out what is doing that, you cannot run more than one task at a time. That just leaves the -spike_threshold, -tune, and -oclfft settings to play with.
ID: 1866295 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1866309 - Posted: 8 May 2017, 23:04:56 UTC - in response to Message 1866295.  

Quote: I have never used an optimized app via the anonymous app_info file, so I do not know if there is a setting in there that is causing the single allocation buffer to get set to 3072.

FYI: The Anonyoums apps/platform is just different versions of the stock apps and use the same tuning parameters.
ID: 1866309 · Report as offensive
Profile Darrell
Volunteer tester
Avatar

Send message
Joined: 14 Mar 03
Posts: 267
Credit: 1,418,681
RAC: 0
United States
Message 1866316 - Posted: 9 May 2017, 0:22:13 UTC - in response to Message 1866309.  
Last modified: 9 May 2017, 0:37:34 UTC

Hi Brent, no they are not. They take the stock apps and recompile them using all the compiler optimization settings possible to get the fastest code possible. That is why they are called optimized. Then you bring them in using the app_info.xml where you can fine tune them using the command line parameters. There is nothing wrong with this, that is what the anonymous platform was put in for. Karten's problem is that something is overriding the -sbs setting of 1408MB and causes his tasks to use 3072MB in the single allocation buffer, i.e. it is using all available OpenCL memory, preventing him from running more than one task at a time. I was just pointing out at that time the only basic difference is that he is using an optimized app, and I'm not and I did not know if there is a setting in the app_info.xml that is causing this. The major difference between our systems is that he has a newer Amd FX processor and I have an Amd Phenom II. We both have a RX480 card, though probably not the same brand and model. His processor causes the Multi_beam Kernel r3557.cl class to used and mine uses Multi_beam Kernel r3584.cl to be used, this should not be the cause of the problem. And until we find the cause of a single task using all of the OpenCL memory, he will not be able to run more than one task at a time.

P.S. Checking again, he has an eight core processor, I have only only six, oops, really big difference.
ID: 1866316 · Report as offensive
Profile Darrell
Volunteer tester
Avatar

Send message
Joined: 14 Mar 03
Posts: 267
Credit: 1,418,681
RAC: 0
United States
Message 1866330 - Posted: 9 May 2017, 1:37:51 UTC
Last modified: 9 May 2017, 1:41:09 UTC

Karsten, lets check our app_config.xmls, mine for one task at a time for a six core processor is:

<app_config>
<project_max_concurrent>1</project_max_concurrent>
<app>
<name>astropulse_v7</name>
<max_concurrent>1</max_concurrent>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>0.17</cpu_usage>
</gpu_version>
</app>
<app>
<name>setiathome_v8</name>
<max_concurrent>1</max_concurrent>
<gpu_versions
<gpu_usage>1</gpu_usage>
<cpu_usage>0.17</cpu_usage>
</gpu_versions>
</app>
</app_config>

I take it that for you eight core processor the cpu usage line is set to 0.125 or rounded to 0.13?
ID: 1866330 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1866368 - Posted: 9 May 2017, 6:25:06 UTC - in response to Message 1866330.  

Karsten, lets check our app_config.xmls, mine for one task at a time for a six core processor is:

<app_config>
<project_max_concurrent>1</project_max_concurrent>
<app>
<name>astropulse_v7</name>
<max_concurrent>1</max_concurrent>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>0.17</cpu_usage>
</gpu_version>
</app>
<app>
<name>setiathome_v8</name>
<max_concurrent>1</max_concurrent>
<gpu_versions
<gpu_usage>1</gpu_usage>
<cpu_usage>0.17</cpu_usage>
</gpu_versions>
</app>
</app_config>

I take it that for you eight core processor the cpu usage line is set to 0.125 or rounded to 0.13?

I'm not an expert in AMD/ATi but these should be set at 1.

Cheers.
ID: 1866368 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1866369 - Posted: 9 May 2017, 6:42:27 UTC - in response to Message 1866368.  

Karsten, lets check our app_config.xmls, mine for one task at a time for a six core processor is:

<app_config>
<project_max_concurrent>1</project_max_concurrent>
<app>
<name>astropulse_v7</name>
<max_concurrent>1</max_concurrent>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>0.17</cpu_usage>
</gpu_version>
</app>
<app>
<name>setiathome_v8</name>
<max_concurrent>1</max_concurrent>
<gpu_versions
<gpu_usage>1</gpu_usage>
<cpu_usage>0.17</cpu_usage>
</gpu_versions>
</app>
</app_config>

I take it that for you eight core processor the cpu usage line is set to 0.125 or rounded to 0.13?

I'm not an expert in AMD/ATi but these should be set at 1.

Cheers.

For 1 CPU thread per GPU WU, which is best for SoG.
0.17 means with 6 GPU WUs running, they only get 1 CPU thread to share between them. With less than 6 GPU WUs running, no CPU cores are reserved specifically for GPU WUs; they'll continue to crunch CPU WUs as well as try to meet the GPU WU demands.
Grant
Darwin NT
ID: 1866369 · Report as offensive
Profile Karsten Vinding
Volunteer tester

Send message
Joined: 18 May 99
Posts: 239
Credit: 25,201,931
RAC: 11
Denmark
Message 1866386 - Posted: 9 May 2017, 10:18:11 UTC - in response to Message 1866330.  

I run mine with <cpu_usage>1</cpu_usage>, as this makes the boinc client allocate one whole CPU core pr GPU thread, which will speed up crunching times in many cases.

I'm at work now (lunch break), but will follow up when I get home later in the day.

It would be nice to get to the bottom of this :)
ID: 1866386 · Report as offensive
Profile Darrell
Volunteer tester
Avatar

Send message
Joined: 14 Mar 03
Posts: 267
Credit: 1,418,681
RAC: 0
United States
Message 1866426 - Posted: 10 May 2017, 3:40:42 UTC

Hi Wiggo and Grant please check my understanding of how the settings work. In the computing preferences, under usage limits, "Use at most X% of the CPUs" is for non-GPU tasks and is calculated as:

For my 6 core processor:
100% / 6 = 16.66667% rounded to 17%, this means each CPU task gets 1 core.
17% x 3 = 51%, setting the usage limit to this means 3 CPU tasks running, each getting 1 core.

Now for GPU tasks, the setting of GPU and CPU from the BOINC wiki:

gpu_usage
The number of GPU instances (possibly fractional) used by GPU versions of this app. For example, .5 means that two jobs of this application can run at once on a single GPU.
cpu_usage
The number of CPU instances (possibly fractional) used by GPU versions of this app.

So this calculates as:
1 / 1 = 1 one task on the GPU 1 / 6 = 0.17 each GPU task gets 1 CPU core
1 / 2 = 0.5 two tasks on the GPU 1 / 6 = 0.17 each GPU task gets 1 CPU core
1 / 3 = 0.33 three tasks on the GPU 1 / 6 = 0.17 each GPU task gets 1 CPU core

So with the preference set to 51% and gpu_usage set to 0.5 and cpu_usage set to 0.17 means that I have 3 CPU tasks running, each on its own core and 2 GPU tasks each sharing half of the GPU each being feed by 1 core. Thus BOINC is using 5 cores leaving one core open to run everything else.

This is my understanding of the settings and is what I observe happening on the CPU using Sysinternal's - Process Explorer.

This still does not explain what is causing some of Karsten's tasks to override the -sbs setting and use all of the OpenCL memory. After comparing our app_config xmls to make sure they are correct, I was going to ask him to post his app_info xml to see if someone with the knowledge of its correct settings could check to see if there is something in that xml that might be causing his problem. Because until he gets a single task from sucking up all of the OpenCL memory, he won't be able to run more than one task at a time on his RX480.

P.S. Will the bbs formatting codes that show in the preview box go away and leave your posts looking nice when you hit the post button?
ID: 1866426 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1866437 - Posted: 10 May 2017, 4:20:59 UTC - in response to Message 1866426.  

"Use at most X% of the CPUs" means total amount, including GPU tasks using CPU.
50% means use at most 3 threads (cores) of the 6 available.

1.0 GPU / 1.0 CPU is 1 task per card and reserve 1 CPU thread
1.0 GPU / 1.5 CPU is 1 task per card and reserve 1.5 CPU thread
* That is 1 thread for 1 card (since it is < 2), 3 threads for 2 cards (1.5 + 1.5 = 3)
0.5 GPU / 1.0 CPU is 2 task per card and reserve 2 CPU threads
0.3 GPU / 1.0 CPU is 3 task per card and reserve 3 CPU threads

Forget about thee 1/6 of CPU. BOINC looks at threads not the complete CPU.

I find it easier to drop the CPU percentage and just use the app_config. Say 0.5 GPU / 1.5 CPU is 2 GPU tasks, and 3 CPU tasks concurrent.
ID: 1866437 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22200
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1866439 - Posted: 10 May 2017, 4:53:08 UTC

But there is no real point in having more than 1 CPU thread dedicated to a GPU task as the CPU part of the application is not multi-threaded.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1866439 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1866442 - Posted: 10 May 2017, 4:59:50 UTC - in response to Message 1866439.  

That is true Rob, I'm just saying that a CPU reservation of over 1.0 will leave some room for your OS and whatever else you might have/want.
ID: 1866442 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1866448 - Posted: 10 May 2017, 5:48:14 UTC

I'm sorry Darrell (it's good to see someone else spell that name properly), but you're confusing 1 setting for another.

Try it my way and see what happens (as this it what I use for my Nvidia cards). ;-)

Cheers.
ID: 1866448 · Report as offensive
Profile Darrell
Volunteer tester
Avatar

Send message
Joined: 14 Mar 03
Posts: 267
Credit: 1,418,681
RAC: 0
United States
Message 1866452 - Posted: 10 May 2017, 6:45:39 UTC - in response to Message 1866448.  

I did try cpu_usage set to 1, and boinc started shutting down cpu tasks to try and run the gpu tasks. If I can get my screen capture to work and then figure out how to show the image in a post i can show you that with the preference setting set to 51% and gpu_usage set to 0.5 and cpu_usage set to 0.17 that Process Explorer will show that the three CPU tasks are each taking ~16.5% of the CPU(i.e. 1 core each) and two GPU tasks are each taking ~16.5% of the CPU {i.e. 1 core each} (Note: The GPU tasks only use a full core - 16.5% when starting and ending their run, during their run they usually are less than 5%).

Einstein, Milkyway, Seti, and Seti_Beta are set for two tasks at a time gpu_usage 0.5, cpu_usage 0.17 because they share the gpu well.

Collatz, MooWraper, and PrimeGrid are set for one task at a time gpu_usage 1, cpu_usage 0.17 because Collatz and PrimeGrid tasks do not share the GPU very well, and Moo is set to one because it stresses any GPU that it runs on.

But this distracts from the problem I was trying to help Karsten with and that is figuring out what is causing his each of his tasks to use all of the available OpenCL memory.
ID: 1866452 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1866456 - Posted: 10 May 2017, 7:25:40 UTC

I'll try to explain this a different way then.

The BOINC Manager's "Use at most X% of the CPUs" here is set to 100% (to use all 4 of my CPU cores), but this is an entirely different setting to what goes into the app_config.xml (or in my case the app_info.xml), now each of my 3 rigs have dual Nvidia GPU's (I don't use my on die Intel GPU's) and while each of my GPU's are doing a SoG or OpenCL AP task (with the settings I provided) my CPU's are running 2 CPU tasks while the other 2 cores support each GPU task.

When I run out of GPU work during the weekly outage my CPU will revert to running 4 CPU tasks until I get more GPU tasks.

Now if I were to run 2 tasks per GPU then my CPU wouldn't run any CPU tasks until I ran out of GPU tasks.

If I were to use your settings then I'd be over committing my CPU (trying to run 3, or maybe even 4, tasks) and then those tasks running on it would greatly slow down, it could even cause lockups.

I hope that this helps explains the difference between those 2 settings.

Cheers.
ID: 1866456 · Report as offensive
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : RX 480 OpenCL Question


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.