Script for Affinity & Priority Management


log in

Advanced search

Message boards : Number crunching : Script for Affinity & Priority Management

Previous · 1 · 2 · 3 · Next
Author Message
Profile Karsten Vinding
Volunteer tester
Send message
Joined: 18 May 99
Posts: 140
Credit: 16,681,905
RAC: 2,874
Denmark
Message 1346212 - Posted: 13 Mar 2013, 16:37:55 UTC - in response to Message 1345984.
Last modified: 13 Mar 2013, 17:10:53 UTC

Don't mind, read my next post.
____________

Profile Karsten Vinding
Volunteer tester
Send message
Joined: 18 May 99
Posts: 140
Credit: 16,681,905
RAC: 2,874
Denmark
Message 1346218 - Posted: 13 Mar 2013, 17:09:23 UTC - in response to Message 1346212.
Last modified: 13 Mar 2013, 17:18:44 UTC

After looking at Raistmers results I decided to try it with 8 cores active with no affinity, and with GPU affinity set to core 0 (with Prolaso).

I got this:



With 7 cores active, affinity set to cores 1-7, and GPU on 0 I got this.



There is a _small_ difference. Probably 3-4%

This is a much better result than what I saw yesterday, probably because yesterday I locked the GPU to core 1 instead of core 0. Probably because I misunderstood something in the process.

Now I havent seen a heavy blanked WU yet, or seen if the GPU usage bug rears it head again, but if this keeps looking this good, I will probably be crunching on all cylinders again. And be happy doing it :)

It must be said that these are AP WU's being crunched, I dont know how MB will react.
____________

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3501
Credit: 47,709,280
RAC: 45,715
Russia
Message 1346228 - Posted: 13 Mar 2013, 17:55:27 UTC

My benchmark was on MB app.
But much more statistic required cause the real problem with full CPU loaded were intermittent slowdowns. Time to time app execution time increased considerably, in times. That was called "low GPU usage bug". If app performance will remain steady then yes, full CPU load with such affinity settings would be attractive config.
Also, currently I run (mostly AP) in config 1 free CPU core + 2 AP tasks at once (on Q9450 + HD6950 host). And host RAC goes up as a rocket, even no sign of stabilization still. Perhaps each GPU app should be pinned to different CPU core, not both to core0. How to test this with ProcessLasso - not quite clear.
cov_route, could you post example of your script config for this particular case, please? 4 CPU cores, 2 GPU app instances, each instance pinned to own core, CPU apps affinity unspecified.

____________

Profile cov_route
Avatar
Send message
Joined: 13 Sep 12
Posts: 295
Credit: 7,414,021
RAC: 12,698
Canada
Message 1346348 - Posted: 13 Mar 2013, 23:23:19 UTC
Last modified: 13 Mar 2013, 23:24:35 UTC

Raistmer: change the names of the apps as required. The CPU app must be first, the GPU second. Each GPU will get it's own core 0-3, CPU's will all be 0,1,2,3. All priorities below normal.

$pri_names = @( ` "Idle",` "BelowNormal",` "Normal",` "AboveNormal",` "High",` "RealTime") $proc_names = @( ` "AK_v8b2_win_sse3_amd",` "MB7_win_x86_SSE_OpenCL_ATi_HD5_r1764") $pri_actv = @( ` @(1,1), ` @(1,1)) $pri_inac = @( ` @(1,1), ` @(1,1)) $aff_actv = @( ` @(15), ` @(1,2,4,8)) $aff_inac = @( ` @(15), ` @(1,2,4,8))

If you want CPU priority idle and GPU priority below normal use:
$pri_actv = @( ` @(0,0), ` @(1,1)) $pri_inac = @( ` @(0,0), ` @(1,1))

For extra fanciness if you want the CPU priority to be below normal except when it shares a core with the GPU job when it should be idle, use:
$pri_actv = @( ` @(1,0), ` @(1,1)) $pri_inac = @( ` @(1,0), ` @(1,1))

Profile cov_route
Avatar
Send message
Joined: 13 Sep 12
Posts: 295
Credit: 7,414,021
RAC: 12,698
Canada
Message 1346717 - Posted: 14 Mar 2013, 23:24:05 UTC

A fix for a possible bug that prevents setting priorities and added configuration instructions.

https://www.box.com/s/ir2hx9e08k88kz3tq77p

Profile Karsten Vinding
Volunteer tester
Send message
Joined: 18 May 99
Posts: 140
Credit: 16,681,905
RAC: 2,874
Denmark
Message 1346757 - Posted: 15 Mar 2013, 5:03:33 UTC - in response to Message 1346717.

Well, sadly today the performance according to GPU-z was down again to between 85 and 92%, and never reached 100%.

Removing one CPU thread helped a little. But to get back to 100% GPU usage I had to go back to crunching on 6 cores, and give GPU 2 free cores. Now I'm at 99-100% again.

Its a little frustrating that the system reacts so differently from day to day, and I dont see any pattern to it.
____________

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,090,328
RAC: 39,525
Argentina
Message 1346768 - Posted: 15 Mar 2013, 5:45:47 UTC - in response to Message 1346757.

Well, sadly today the performance according to GPU-z was down again to between 85 and 92%, and never reached 100%.

Removing one CPU thread helped a little. But to get back to 100% GPU usage I had to go back to crunching on 6 cores, and give GPU 2 free cores. Now I'm at 99-100% again.

Its a little frustrating that the system reacts so differently from day to day, and I dont see any pattern to it.

What about the crunching times? I mean, Is that extra 15% on GPU usage worth enough to loose 2 CPU cores?
____________

Profile MikeProject donor
Volunteer tester
Avatar
Send message
Joined: 17 Feb 01
Posts: 24535
Credit: 33,859,930
RAC: 23,612
Germany
Message 1346787 - Posted: 15 Mar 2013, 7:53:46 UTC

If you read readme text from the OpenCL apps its a known fact that on new drivers its necessary to free at least one CPU core.
I also have to free 2 cores on my FX 8150.

Yes it is worth.
I did long run tests and it is confirmed.

____________

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3501
Credit: 47,709,280
RAC: 45,715
Russia
Message 1346790 - Posted: 15 Mar 2013, 8:03:08 UTC - in response to Message 1346757.


Its a little frustrating that the system reacts so differently from day to day, and I dont see any pattern to it.

Do you account for different input data patterns for app itself? It's known fact that not all input data can be processed evenly good with current GPU apps.

____________

Profile Karsten Vinding
Volunteer tester
Send message
Joined: 18 May 99
Posts: 140
Credit: 16,681,905
RAC: 2,874
Denmark
Message 1346865 - Posted: 15 Mar 2013, 15:05:58 UTC - in response to Message 1346790.


Do you account for different input data patterns for app itself? It's known fact that not all input data can be processed evenly good with current GPU apps.


I don't and I don't think I can.

It's not a criticism of the apps or the work the developers do.

It's frustration over the fact that you cannot get your crunching up to the _full_ potential, because of some obscure bug or decission made by the driver developers.

I have to run with two cores disabled to have the highest throughput on average. But I'm convinced that if the AMD (and nVidia) developers worked on this, this problem could be removed or at least reduced to a point where we could use all cores, and stil have full GPU utilization.
____________

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4141
Credit: 33,589,261
RAC: 26,873
United Kingdom
Message 1346884 - Posted: 15 Mar 2013, 15:39:49 UTC - in response to Message 1346865.
Last modified: 15 Mar 2013, 15:43:55 UTC


Do you account for different input data patterns for app itself? It's known fact that not all input data can be processed evenly good with current GPU apps.


I don't and I don't think I can.

He means the Angle Range (or % of Blanking for AP) of the Wu in Question, you can find that out by looking at the Result once you're reported it.

For AP, GPU load is heavily related to the % of Blanking, the more Blanking the less the GPU load and the longer the Wu will take. (The Blanking portion of the task is carried out on the CPU, which slows everything down).

Claggy

Profile cov_route
Avatar
Send message
Joined: 13 Sep 12
Posts: 295
Credit: 7,414,021
RAC: 12,698
Canada
Message 1346890 - Posted: 15 Mar 2013, 16:18:27 UTC

It may (or may not) be possible to wring the last 10% of performance out of a GPU with the task assigned to a single core, I don't know. On my machine I run like that with all cores processing CPU jobs and GPU utilization is consistently 96-100%. But it's only one (slightly dated) machine. (Note that I had to experiment to find the right app and driver combination -r1764 + 12.8- to make it work so well, so it's not just a matter of core config).

If things work out, the real benefit will be fixing the GPU utilization bug, not optimization. As things are now under the default settings a user might see his GPU drop to very low utilizations, like 10% (I've had that). And many or most of those users will never know it because they don't check and don't know HOW to check. They just expect SAH to work out of the box.

If the GPU app default settings could get the utilization up over, say, 70% all the time that would be a big change for the positive in my opinion.

Horacio
Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,090,328
RAC: 39,525
Argentina
Message 1346906 - Posted: 15 Mar 2013, 16:51:17 UTC - in response to Message 1346787.

If you read readme text from the OpenCL apps its a known fact that on new drivers its necessary to free at least one CPU core.
I also have to free 2 cores on my FX 8150.

Yes it is worth.
I did long run tests and it is confirmed.

I know that, but they are testing a different approach using the affinities to avoid reserving cores. And while Im skeptic about this approach, I think that using only the GPU utilization as parameter its not a good meassurement of its efficacy...
____________

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3501
Credit: 47,709,280
RAC: 45,715
Russia
Message 1347260 - Posted: 16 Mar 2013, 11:30:27 UTC - in response to Message 1346890.
Last modified: 16 Mar 2013, 11:30:41 UTC


If the GPU app default settings could get the utilization up over, say, 70% all the time that would be a big change for the positive in my opinion.

Agree. To not slowdown improvements and testing because my own lack of time I will provide CPUlock with changed (to 1 core instead of 2) logic soon.
Then all can test this approach more easely, just by setting corresponding switch. And if testing will be positive this switch will be on by default on next stock update.
____________

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3501
Credit: 47,709,280
RAC: 45,715
Russia
Message 1348089 - Posted: 18 Mar 2013, 14:29:11 UTC

Here is build with new CPUlock behavior. Also, new option added to bind to the same CPU instead of different ones. Look ReadMe for usage info.

Please test if it can help with GPU load on fully loaded CPU (switches usage required, CPUlock is OFF by default for now).



https://dl.dropbox.com/u/60381958/AP6_win_x86_SSE2_OpenCL_ATI_r1785.7z
____________

Profile cov_route
Avatar
Send message
Joined: 13 Sep 12
Posts: 295
Credit: 7,414,021
RAC: 12,698
Canada
Message 1348111 - Posted: 18 Mar 2013, 15:40:21 UTC - in response to Message 1348089.

Thank you Raistmer I will get to it tonight.

Profile cov_route
Avatar
Send message
Joined: 13 Sep 12
Posts: 295
Credit: 7,414,021
RAC: 12,698
Canada
Message 1348327 - Posted: 19 Mar 2013, 1:36:39 UTC

I tried using -cpu_lock with 2 instances and it ran with affinity 1,2,3,4. Are there other switches I need to set?

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3501
Credit: 47,709,280
RAC: 45,715
Russia
Message 1348433 - Posted: 19 Mar 2013, 10:33:35 UTC - in response to Message 1348327.

I tried using -cpu_lock with 2 instances and it ran with affinity 1,2,3,4. Are there other switches I need to set?

yes, per Readme another switches are needed too.
But here: https://dl.dropbox.com/u/60381958/AP6_win_x86_SSE2_OpenCL_ATI_r1786.7z simplified version.
Now -cpu_lock needs only -instances_per_device N if more than 1 task per GPU and
-cpu_lock_fixed_cpu N doesn't need additional switches at all.
Try this binary.

____________

Profile cov_route
Avatar
Send message
Joined: 13 Sep 12
Posts: 295
Credit: 7,414,021
RAC: 12,698
Canada
Message 1348939 - Posted: 21 Mar 2013, 2:28:38 UTC

Couldn't get on last night to post my findings.

1786 doesn't work on my machine, it exits after a few seconds. However I was able to observe the affinity behaviour during that brief time.

-cpu_lock doesn't seem to work for me. I always get 1,2,3,4. Tried with and without -instances_per_device.

-cpu_lock_fixed_cpu does work.

Here is part of the bench output for -cpu_lock:

AP6_win_x86_SSE2_OpenCL_ATI_r1786.exe -cpu_lock / short_ap_21oc08ab_B2_P0_00081_20081130_08605.wu :
AppName: AP6_win_x86_SSE2_OpenCL_ATI_r1786.exe
AppArgs: -cpu_lock
TaskName: short_ap_21oc08ab_B2_P0_00081_20081130_08605.wu
Started at : 21:28:37.231
Ended at : 21:28:42.237
4.912 secs Elapsed
0.641 secs CPU time

ref-AP6_win_x86_SSE2_OpenCL_ATI_r1761.exe-short_ap_21oc08ab_B2_P0_00081_20081130_08605.wu.res: <ap_signal>40,<pulses>30,<best_pulses>10
result-AP6_win_x86_SSE2_OpenCL_ATI_r1786.exe-short_ap_21oc08ab_B2_P0_00081_20081130_08605.wu.res: <ap_signal>0,<pulses>0,<best_pulses>0
All Signals: Weakly similar or Different.
Pulses: pulse at signal 0 has no match (direction -->)
Weakly similar or Different.
Best Pulses: Weakly similar or Different.

-(.\testDatas\ref\ref-AP6_win_x86_SSE2_OpenCL_ATI_r1761.exe-short_ap_21oc08ab_B2_P0_00081_20081130_08605.wu.res)-
Reportable Single Pulses: 0 [OK], 0 above threshold*THRESHOLD_FUDGE
Reportable Repeating Pulses: 30 [Weak]
Single Pulses (Best): 0 [OK], 0 above threshold*THRESHOLD_FUDGE

-(.\testDatas\result-AP6_win_x86_SSE2_OpenCL_ATI_r1786.exe-short_ap_21oc08ab_B2_P0_00081_20081130_08605.wu.res)-
Reportable Single Pulses: 0 [OK], 0 above threshold*THRESHOLD_FUDGE
Reportable Repeating Pulses: 0 [Weak]
Single Pulses (Best): 0 [OK], 0 above threshold*THRESHOLD_FUDGE


[ stderr ]
21:28:37 (3304): Can't open init data file - running in standalone mode
CPU affinity adjustment enabled
21:28:37 (3304): Can't open init data file - running in standalone mode
Priority of worker thread raised successfully
Priority of process adjusted successfully, below normal priority class used
21:28:37 (3304): Can't open init data file - running in standalone mode
OpenCL platform detected: Advanced Micro Devices, Inc.
WARNING: BOINC supplied wrong platform!
BOINC assigns device 0
WARNING: BOINC failed to provide OpenCL device, using own enumeration abilities
Used GPU device parameters are:
Number of compute units: 6
Single buffer allocation size: 256MB
max WG size: 256
Info: CPU affinity mask used: 0

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3501
Credit: 47,709,280
RAC: 45,715
Russia
Message 1348979 - Posted: 21 Mar 2013, 6:13:31 UTC - in response to Message 1348939.

Ok, thanks for info. I thought it should work in such config.
Please try also with this cmd line:
-cpu_lock -gpu_lock -instances_num 1

____________

Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Script for Affinity & Priority Management

Copyright © 2014 University of California