Script for Affinity & Priority Management

Message boards : Number crunching : Script for Affinity & Priority Management
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1349057 - Posted: 21 Mar 2013, 12:33:10 UTC - in response to Message 1348979.  

Ok, thanks for info. I thought it should work in such config.
Please try also with this cmd line:
-cpu_lock -gpu_lock -instances_num 1

No need to rty this. I tired that rev on own host - looks like some debug code left active - app exits before processing. Will check code after coming home.

SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1349057 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1349729 - Posted: 23 Mar 2013, 9:29:58 UTC

There is some interesting post about peak performance of AMD GPUs: http://devgurus.amd.com/message/1287981#1287981

In particular they have some advices for peak performance and first one looks very correlated with results of this thread:

Second, the NUMA. We found out that the "manual usage" of libnuma library (malloc and thread affinity) can dramatically increase performance. Devices in PCIe slots are controlled by different NUMA nodes. It is important to init and run command queue on the same NUMA node with the device it controls (maybe, because cl_command_queue runs in a separate thread). Also, clEnqueueTransfer calls can work faster if host data and a device belong to one NUMA node.


And another one about AMD drivers:
The second part of the post is about AMD drivers. We are chasing the peak performance, and it can be achieved in one case: when we have kernels on a device running on peak performance without gaps. As we see, all drivers after 12.6 (starting from 12.8 and up to 13.2) are not able to run kernels in a proper way. On the other hand, 12.6 drivers allow transfer overlap on two calls (clEnqueueWriteBuffer and clEnqueueReadBuffer) on PCI Express 2.0, but they don't work with overlap on PCIe 3.0. So, unfortunately, we have to state that at the moment AMD do not have appropriate drivers for a full-fledged work with OpenCL on a modern hardware.
Our tests show that NVidia GPUs don't have any problems at all, they are just slow...

See you at the GTC 2013!

From Russia with love,
Pavel,
Anton.


(bolded part is particularly important)
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1349729 · Report as offensive
Profile cov_route
Avatar

Send message
Joined: 13 Sep 12
Posts: 342
Credit: 10,270,618
RAC: 0
Canada
Message 1349776 - Posted: 23 Mar 2013, 13:18:13 UTC
Last modified: 23 Mar 2013, 13:51:12 UTC

About NUMA, I can set my memory controller to either ganged or unganged mode, that strikes me as a NUMA-related thing. I wonder if that makes any differrence (I did some little speed tests and I didn't see any difference but I didn't look into the GPU utilization bug).

I wonder if there is any way to discover the NUMA node of a PCI-E slot and core in order to match them up. I guess I could try and discover the matchup by running tests with different core-specific affinities. More testing!

Per that last quote, are we looking at a downgrade to 12.6? Even more tests I think.

Edit: Actually, on a one-socket consumer-level board I would think there is no NUMA (IOW all the same node).
ID: 1349776 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1349787 - Posted: 23 Mar 2013, 14:12:42 UTC - in response to Message 1349776.  


Edit: Actually, on a one-socket consumer-level board I would think there is no NUMA (IOW all the same node).

Perhaps. Last thing I read (quite many years ago) about multicore CPU was that core0 always process interrupts from hardware (so, no matter what PCI-E slot is). Maybe multi-socket mobo required indeed to notice this difference.

SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1349787 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1349820 - Posted: 23 Mar 2013, 17:00:34 UTC - in response to Message 1349776.  

About NUMA, I can set my memory controller to either ganged or unganged mode, that strikes me as a NUMA-related thing. I wonder if that makes any differrence (I did some little speed tests and I didn't see any difference but I didn't look into the GPU utilization bug).

I wonder if there is any way to discover the NUMA node of a PCI-E slot and core in order to match them up. I guess I could try and discover the matchup by running tests with different core-specific affinities. More testing!

Per that last quote, are we looking at a downgrade to 12.6? Even more tests I think.

Edit: Actually, on a one-socket consumer-level board I would think there is no NUMA (IOW all the same node).

In Windows you have 2 ways to check see if you have more than 1 NUMA node.
1) Use task manager.
http://www.benjaminathawes.com/blog/Lists/Photos/TaskManagerNUMA.png
2) Download cpuinfo and run it from a command prompt.
http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx

We do have some users with multi-socket systems running several GPUs. So this kind of information could be relevant to their use.

As the number of processor cores goes up I could see future designs incorporating separate memory buses in a standard single socket processor. Instead of just increasing the memory channels as they currently tend to do. So it may become important to more and more users in the near future.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1349820 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1349828 - Posted: 23 Mar 2013, 17:25:41 UTC - in response to Message 1349820.  

Thanks for info.
Apparently on single node hosts there will be no such menu item at all (my host has no CPU history menu item showed on screenshot).

And here https://dl.dropbox.com/u/60381958/AP6_win_x86_SSE2_OpenCL_ATI_r1791.7z one more update for -cpu_lock feature. Hope this time it will work as intended.
If more than 1 instance per GPU use corresponding switch to inform app about that fact.

SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1349828 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1349842 - Posted: 23 Mar 2013, 17:44:03 UTC - in response to Message 1349828.  

Thanks for info.
Apparently on single node hosts there will be no such menu item at all (my host has no CPU history menu item showed on screenshot).

You need to be viewing the performace graphs to see the CPU history menu. (unless your host has a single core CPU)
In my host it shows only the option to show one graph per core or a general graph, with no NUMA options... (on Win Vista and 7)
ID: 1349842 · Report as offensive
Profile cov_route
Avatar

Send message
Joined: 13 Sep 12
Posts: 342
Credit: 10,270,618
RAC: 0
Canada
Message 1350286 - Posted: 24 Mar 2013, 21:34:59 UTC

r1791 runs fine and -cpu_lock works. For a single instance it always ties to core 0. With two instances and -instances_per_device 2 and -gpu_lock, one instance ran on core 0 and the other on core 1.
ID: 1350286 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1350471 - Posted: 25 Mar 2013, 11:02:21 UTC - in response to Message 1350286.  

r1791 runs fine and -cpu_lock works. For a single instance it always ties to core 0. With two instances and -instances_per_device 2 and -gpu_lock, one instance ran on core 0 and the other on core 1.


As I wanted it to run.
And if -gpu_lock omitted? -cpu_lock -instances_per_device 2 should be enough, though adding -gpu_lock will not harm, just unneeded.

SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1350471 · Report as offensive
Previous · 1 · 2 · 3

Message boards : Number crunching : Script for Affinity & Priority Management


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.