Script for Affinity & Priority Management

Author	Message
Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1349057 - Posted: 21 Mar 2013, 12:33:10 UTC - in response to Message 1348979. Ok, thanks for info. I thought it should work in such config. Please try also with this cmd line: -cpu_lock -gpu_lock -instances_num 1 No need to rty this. I tired that rev on own host - looks like some debug code left active - app exits before processing. Will check code after coming home. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1349057 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1349729 - Posted: 23 Mar 2013, 9:29:58 UTC There is some interesting post about peak performance of AMD GPUs: http://devgurus.amd.com/message/1287981#1287981 In particular they have some advices for peak performance and first one looks very correlated with results of this thread: Second, the NUMA. We found out that the "manual usage" of libnuma library (malloc and thread affinity) can dramatically increase performance. Devices in PCIe slots are controlled by different NUMA nodes. It is important to init and run command queue on the same NUMA node with the device it controls (maybe, because cl_command_queue runs in a separate thread). Also, clEnqueueTransfer calls can work faster if host data and a device belong to one NUMA node. And another one about AMD drivers: The second part of the post is about AMD drivers. We are chasing the peak performance, and it can be achieved in one case: when we have kernels on a device running on peak performance without gaps. As we see, all drivers after 12.6 (starting from 12.8 and up to 13.2) are not able to run kernels in a proper way. On the other hand, 12.6 drivers allow transfer overlap on two calls (clEnqueueWriteBuffer and clEnqueueReadBuffer) on PCI Express 2.0, but they don't work with overlap on PCIe 3.0. So, unfortunately, we have to state that at the moment AMD do not have appropriate drivers for a full-fledged work with OpenCL on a modern hardware. Our tests show that NVidia GPUs don't have any problems at all, they are just slow... See you at the GTC 2013! From Russia with love, Pavel, Anton. (bolded part is particularly important) SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1349729 ·

cov_route Send message Joined: 13 Sep 12 Posts: 342 Credit: 10,270,618 RAC: 0	Message 1349776 - Posted: 23 Mar 2013, 13:18:13 UTC Last modified: 23 Mar 2013, 13:51:12 UTC About NUMA, I can set my memory controller to either ganged or unganged mode, that strikes me as a NUMA-related thing. I wonder if that makes any differrence (I did some little speed tests and I didn't see any difference but I didn't look into the GPU utilization bug). I wonder if there is any way to discover the NUMA node of a PCI-E slot and core in order to match them up. I guess I could try and discover the matchup by running tests with different core-specific affinities. More testing! Per that last quote, are we looking at a downgrade to 12.6? Even more tests I think. Edit: Actually, on a one-socket consumer-level board I would think there is no NUMA (IOW all the same node). ID: 1349776 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1349787 - Posted: 23 Mar 2013, 14:12:42 UTC - in response to Message 1349776. Edit: Actually, on a one-socket consumer-level board I would think there is no NUMA (IOW all the same node). Perhaps. Last thing I read (quite many years ago) about multicore CPU was that core0 always process interrupts from hardware (so, no matter what PCI-E slot is). Maybe multi-socket mobo required indeed to notice this difference. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1349787 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1349820 - Posted: 23 Mar 2013, 17:00:34 UTC - in response to Message 1349776. About NUMA, I can set my memory controller to either ganged or unganged mode, that strikes me as a NUMA-related thing. I wonder if that makes any differrence (I did some little speed tests and I didn't see any difference but I didn't look into the GPU utilization bug). I wonder if there is any way to discover the NUMA node of a PCI-E slot and core in order to match them up. I guess I could try and discover the matchup by running tests with different core-specific affinities. More testing! Per that last quote, are we looking at a downgrade to 12.6? Even more tests I think. Edit: Actually, on a one-socket consumer-level board I would think there is no NUMA (IOW all the same node). In Windows you have 2 ways to check see if you have more than 1 NUMA node. 1) Use task manager. http://www.benjaminathawes.com/blog/Lists/Photos/TaskManagerNUMA.png 2) Download cpuinfo and run it from a command prompt. http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx We do have some users with multi-socket systems running several GPUs. So this kind of information could be relevant to their use. As the number of processor cores goes up I could see future designs incorporating separate memory buses in a standard single socket processor. Instead of just increasing the memory channels as they currently tend to do. So it may become important to more and more users in the near future. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1349820 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1349828 - Posted: 23 Mar 2013, 17:25:41 UTC - in response to Message 1349820. Thanks for info. Apparently on single node hosts there will be no such menu item at all (my host has no CPU history menu item showed on screenshot). And here https://dl.dropbox.com/u/60381958/AP6_win_x86_SSE2_OpenCL_ATI_r1791.7z one more update for -cpu_lock feature. Hope this time it will work as intended. If more than 1 instance per GPU use corresponding switch to inform app about that fact. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1349828 ·

Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0	Message 1349842 - Posted: 23 Mar 2013, 17:44:03 UTC - in response to Message 1349828. Thanks for info. Apparently on single node hosts there will be no such menu item at all (my host has no CPU history menu item showed on screenshot). You need to be viewing the performace graphs to see the CPU history menu. (unless your host has a single core CPU) In my host it shows only the option to show one graph per core or a general graph, with no NUMA options... (on Win Vista and 7) ID: 1349842 ·

cov_route Send message Joined: 13 Sep 12 Posts: 342 Credit: 10,270,618 RAC: 0	Message 1350286 - Posted: 24 Mar 2013, 21:34:59 UTC r1791 runs fine and -cpu_lock works. For a single instance it always ties to core 0. With two instances and -instances_per_device 2 and -gpu_lock, one instance ran on core 0 and the other on core 1. ID: 1350286 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1350471 - Posted: 25 Mar 2013, 11:02:21 UTC - in response to Message 1350286. r1791 runs fine and -cpu_lock works. For a single instance it always ties to core 0. With two instances and -instances_per_device 2 and -gpu_lock, one instance ran on core 0 and the other on core 1. As I wanted it to run. And if -gpu_lock omitted? -cpu_lock -instances_per_device 2 should be enough, though adding -gpu_lock will not harm, just unneeded. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1350471 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.