Update for GPU AP (both NV and ATi) to rev560

Author	Message
red-ray Send message Joined: 24 Jun 99 Posts: 308 Credit: 9,029,848 RAC: 0	Message 1215159 - Posted: 7 Apr 2012, 16:13:01 UTC - in response to Message 1215135. Last modified: 7 Apr 2012, 16:54:18 UTC The remaining doubt is about the Nvidia drivers, should I keep using the old 266.58 that was needed by the NV-r521 to not clog the CPUs? (In fact, 266.58 is Ok and I have no reason to upgrade it, but one of my hosts has 2 560Ti's and the older driver for them is 266.66 which is said that has some issues...) 270.xx and up suffer from excess CPU usage. AFAIK it still not fixed in latest drivers. I use 263.xx for crunching with GTX250 and it goes very well. What about using a GTX 680 when the only drivers are 301.10? How high is "excessive CPU usage" as a % of a CPU on a 2.66GHz Core 2 Quad please? It will use a full core. Thats why its called 100% CPU bug. It took AMD almost 6 month to fix this bug so you will have to see how long it takes for nvidia to resolve this. 100%! What is the split between Interrupt, DPC, Kernel and User time? To me this sounds like a synchronisation issue and I suspect the code could be changed to get round this. Has the Open CL threading model been changed? Do you realy mean a "full core"? This would mean 2 CPUs would be 100% loaded on an i7 which I suspect is not the case. ID: 1215159 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1215160 - Posted: 7 Apr 2012, 16:25:22 UTC - in response to Message 1215159. Last modified: 7 Apr 2012, 16:26:35 UTC The remaining doubt is about the Nvidia drivers, should I keep using the old 266.58 that was needed by the NV-r521 to not clog the CPUs? (In fact, 266.58 is Ok and I have no reason to upgrade it, but one of my hosts has 2 560Ti's and the older driver for them is 266.66 which is said that has some issues...) 270.xx and up suffer from excess CPU usage. AFAIK it still not fixed in latest drivers. I use 263.xx for crunching with GTX250 and it goes very well. What about using a GTX 680 when the only drivers are 301.10? How high is "excessive CPU usage" as a % of a CPU on a 2.66GHz Core 2 Quad please? It will use a full core. Thats why its called 100% CPU bug. It took AMD almost 6 month to fix this bug so you will have to see how long it takes for nvidia to resolve this. 100%! What is the split between Interrupt, DPC, Kernel and User time? To me this sounds like a synchronisation issue and I suspect the code could be changed to get round this. Has the Open CL threading model been changed? Do you rely mean a "full core"? This would mean 2 CPUs would be 100% loaded on an i7 which I suspect is not the case. Raistmer tried to fix it with different functions without success. It has to be fixed by Nvidia. On the other hand GPU app benefit outweighs anyways. Running 2 APs in less than 2 hours instead of 1 in 6 - 7 hours is not to shabby i imagine. With each crime and every kindness we birth our future. ID: 1215160 ·

X-Files 27 Send message Joined: 17 May 99 Posts: 104 Credit: 111,191,433 RAC: 0	Message 1215168 - Posted: 7 Apr 2012, 17:00:04 UTC Last modified: 7 Apr 2012, 17:10:21 UTC This system doesn't suffer from excessive CPU usage: Win7, i5-2500k GTX 570 296.35 (quadro driver using modded inf) I think at most its using 20% CPU. ID: 1215168 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 1215193 - Posted: 7 Apr 2012, 17:55:49 UTC - in response to Message 1214708. Tested it on my ATI HD6850, 2GB, Catalysts 12.3 Latest AP on ATI, finished all right, faster than I am used to as well. {grin} I checked with GPU-Z, but a little late to be honest. However, what I saw was a load of 25% alternated with no load, CPU load max 4%. The AP was 7.23 percent blanked. Will wait for the next. ID: 1215193 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1215279 - Posted: 7 Apr 2012, 21:16:17 UTC - in response to Message 1215193. Tested it on my ATI HD6850, 2GB, Catalysts 12.3 Latest AP on ATI, finished all right, faster than I am used to as well. {grin} I checked with GPU-Z, but a little late to be honest. However, what I saw was a load of 25% alternated with no load, CPU load max 4%. The AP was 7.23 percent blanked. Will wait for the next. You should increase unroll factor to 10 at least to get better speed on your card. Also FFA_block and FFA_block_fetch. With each crime and every kindness we birth our future. ID: 1215279 ·

TRuEQ & TuVaLu Volunteer tester Send message Joined: 4 Oct 99 Posts: 505 Credit: 69,523,653 RAC: 10	Message 1215316 - Posted: 7 Apr 2012, 22:06:41 UTC Last modified: 7 Apr 2012, 22:07:21 UTC I've read somewhere that DATA_CHUNK_UNROLL should be setted to: half of the Max compute number units: Can anyone confirm this please?? //TRuEQ TRuEQ & TuVaLu ID: 1215316 ·

TRuEQ & TuVaLu Volunteer tester Send message Joined: 4 Oct 99 Posts: 505 Credit: 69,523,653 RAC: 10	Message 1215318 - Posted: 7 Apr 2012, 22:11:47 UTC I also saw that the sbs can be set in <cmdline>-instances_per_device 2 -unroll 9 -ffa_block 8192 -ffa_block_fetch 4096 -sbs 256 -hp</cmdline> default is 128 I saw somewhere that it isn't implemented yet. But in my stderr it looks like it is set to 256.... Can someone explain this please? <core_client_version>7.0.24</core_client_version> <![CDATA[ <stderr_txt> Number of app instances per device setted to:2 DATA_CHUNK_UNROLL setted to:9 FFA thread block override value:8192 FFA thread fetchblock override value:4096 Maximum single buffer size setted to:256MB Running on device number: 0 DATA_CHUNK_UNROLL at default:9 Priority of worker thread raised successfully Priority of process adjusted successfully, high priority class used OpenCL platform detected: Advanced Micro Devices, Inc. BOINC assigns 0 device, slots 0 to 1 (including) will be checked Used slot is 0; Used GPU device parameters are: Number of compute units: 18 Single buffer allocation size: 256MB max WG size: 256 TRuEQ & TuVaLu ID: 1215318 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 1215322 - Posted: 7 Apr 2012, 22:24:20 UTC - in response to Message 1215279. You should increase unroll factor to 10 at least to get better speed on your card. Also FFA_block and FFA_block_fetch. It doesn't happen often that someone talks to me in riddles, but now it's happened. :-) Care to elaborate, please? Just think I am a newbie. And be gentle with that whip. Nice.... ID: 1215322 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1215323 - Posted: 7 Apr 2012, 22:25:53 UTC - in response to Message 1215316. I've read somewhere that DATA_CHUNK_UNROLL should be setted to: half of the Max compute number units: Can anyone confirm this please?? //TRuEQ Thats the safe value yes. But for fast cards it can be increased above number of CUs. So long it dont result in invalids or getting screen lags. Unroll 10 - 12 should work on high end cards. Some cards can cope with 16. With each crime and every kindness we birth our future. ID: 1215323 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1215324 - Posted: 7 Apr 2012, 22:26:56 UTC - in response to Message 1215322. Last modified: 7 Apr 2012, 22:31:40 UTC You should increase unroll factor to 10 at least to get better speed on your card. Also FFA_block and FFA_block_fetch. It doesn't happen often that someone talks to me in riddles, but now it's happened. :-) Care to elaborate, please? Just think I am a newbie. And be gentle with that whip. Nice.... From the Lunatics Installer 0.40 Release Notes: AP/MB: -instances_per_device N how many tasks you want to run in parallel. Inverse of <count>. -hp gives the app high priority -no_cpu_lock prevents the app from using only a specific CPU core AP only: -v505 to process AP 5.05 tasks -sbs 128 is the max size of single buffer that can be used in program. Lower limit is 128MB, upper - max size allowed particular card. [Note: not active yet] -unroll 4 Optimal at half the number of Compute Units of the GPU. Lower values also reduce VRAM use. Decrease if you experience lags. -ffa_block 2048 defines how many different periods GPU will process per single kernel call -ffa_block_fetch 1024 defines how many threads will be used in FFA initial fetch kernel ffa_block should be divisible by ffa_block_fetch. Going too high will result in premature 30/30 exit errors. Claggy ID: 1215324 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1215326 - Posted: 7 Apr 2012, 22:29:22 UTC - in response to Message 1215322. Last modified: 7 Apr 2012, 22:31:23 UTC You should increase unroll factor to 10 at least to get better speed on your card. Also FFA_block and FFA_block_fetch. It doesn't happen often that someone talks to me in riddles, but now it's happened. :-) Care to elaborate, please? Just think I am a newbie. And be gentle with that whip. Nice.... Sorry i didnt intend to riddle. You can edit your appinfo command line param to -unroll 10 -ffa_block 8192 -ffa_block_fetch 4096 -no_CPU_lock. That should speed up your card and give better run times. With each crime and every kindness we birth our future. ID: 1215326 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1215350 - Posted: 7 Apr 2012, 22:48:22 UTC - in response to Message 1215127. I'm using 268,36 on my almost new notebook and have just finished the first AP, all went fine and the CPU use was rather low, mostly under 5%. The Stderr output contains some stuff that I wonder about, is there something I should do or is that all normal info? http://setiathome.berkeley.edu/result.php?resultid=2387164861 Your stderr completelly fine, just many restarts were done for this task. ID: 1215350 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 1215364 - Posted: 7 Apr 2012, 23:03:50 UTC - in response to Message 1215326. Sorry i didnt intend to riddle. LOL, well thanks for the explanation. You too Claggy! All lines in app_info.xml adjusted. ID: 1215364 ·

TRuEQ & TuVaLu Volunteer tester Send message Joined: 4 Oct 99 Posts: 505 Credit: 69,523,653 RAC: 10	Message 1215531 - Posted: 8 Apr 2012, 7:44:55 UTC - in response to Message 1215323. I've read somewhere that DATA_CHUNK_UNROLL should be setted to: half of the Max compute number units: Can anyone confirm this please?? //TRuEQ Thats the safe value yes. But for fast cards it can be increased above number of CUs. So long it dont result in invalids or getting screen lags. Unroll 10 - 12 should work on high end cards. Some cards can cope with 16. Then I understand. Thank you Mike TRuEQ & TuVaLu ID: 1215531 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1215533 - Posted: 8 Apr 2012, 7:46:36 UTC - in response to Message 1215364. Sorry i didnt intend to riddle. LOL, well thanks for the explanation. You too Claggy! All lines in app_info.xml adjusted. You have a typo in your appinfo. Please check. With each crime and every kindness we birth our future. ID: 1215533 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 1215553 - Posted: 8 Apr 2012, 8:23:08 UTC - in response to Message 1215533. Last modified: 8 Apr 2012, 8:35:43 UTC DATA_CHUNK_UNROLL setted to:10 FFA thread block override value:8192 FFA thread fetchblock override value:4096 CPU affinity ajustment will be skipped Maximum single buffer size setted to:128MB better? Although, it's "set", not "setted". The verb goes set, set, set. Very easy that one. :P Also adjusted -unroll to 6, instead of 10. It appears my GPU only has 12 Compute Units, so if half is optimal, then half it is. -unroll 4 Optimal at half the number of Compute Units of the GPU. Lower values also reduce VRAM use. Decrease if you experience lags. ID: 1215553 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1215557 - Posted: 8 Apr 2012, 8:38:46 UTC - in response to Message 1215553. DATA_CHUNK_UNROLL setted to:10 FFA thread block override value:8192 FFA thread fetchblock override value:4096 CPU affinity ajustment will be skipped Maximum single buffer size setted to:128MB better? Although, it's "set", not "setted". The verb goes set, set, set. Very easy that one. :P Also adjusted -unroll to 6, instead of 10. It appears my GPU only has 12 Compute Units, so if half is optimal, then half it is. -unroll 4 Optimal at half the number of Compute Units of the GPU. Lower values also reduce VRAM use. Decrease if you experience lags. Yep, much better. Nice run time as well. Looks good. With each crime and every kindness we birth our future. ID: 1215557 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1215566 - Posted: 8 Apr 2012, 9:16:47 UTC - in response to Message 1215159. 100%! What is the split between Interrupt, DPC, Kernel and User time? To me this sounds like a synchronisation issue and I suspect the code could be changed to get round this. Has the Open CL threading model been changed? Do you realy mean a "full core"? This would mean 2 CPUs would be 100% loaded on an i7 which I suspect is not the case. Ray, I would happy if you can provide this info ;) And yes, I suspect sync issue too. Yes, perhaps NV changed OpenCL sync model to spinloop. AFAIK CUDA supports 2 synch models and they configurable via CUDA API. So far nobody discover configuration possibility for NV OpenCL implementation and such config not part of OpenCL standart too. There was trick offered on NV forums to create many fake contexts (it was supposet that OpenCL sync model corresponds CUDA's "auto" mode that will switch between sync strategies basing on number of contexts vs number of CPUs). I implemented fake contexts creation. It did not help. If you have some ideas how to tell NV's OpenCL runtime to use yield instead of spinloop, please let me know. And to be precise it should be "whole logical CPU usage bug" instead of "whole physical core usage bug". ID: 1215566 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1215567 - Posted: 8 Apr 2012, 9:18:50 UTC - in response to Message 1215168. Last modified: 8 Apr 2012, 9:19:08 UTC This system doesn't suffer from excessive CPU usage: Win7, i5-2500k GTX 570 296.35 (quadro driver using modded inf) I think at most its using 20% CPU. Good to know this! Now it's interesting to know what caused that, your driver version, your Os version or your GPU version... or smth we even can't imagine ;) ID: 1215567 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1215568 - Posted: 8 Apr 2012, 9:23:39 UTC - in response to Message 1215316. I've read somewhere that DATA_CHUNK_UNROLL should be setted to: half of the Max compute number units: Can anyone confirm this please?? //TRuEQ Best value depends on number for memory channels too. there are 32 (I write by memory not looking to code and wrote that code long ago so mistakes possible) workitems per "unroll". Wavefront supports 64. Each compute unit needs at least 2 wavefronts for full load (actually better would be nicer). ID: 1215568 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.