Message boards :
Number crunching :
Update for GPU AP (both NV and ATi) to rev560
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
red-ray Send message Joined: 24 Jun 99 Posts: 308 Credit: 9,029,848 RAC: 0 |
The remaining doubt is about the Nvidia drivers, should I keep using the old 266.58 that was needed by the NV-r521 to not clog the CPUs? 100%! What is the split between Interrupt, DPC, Kernel and User time? To me this sounds like a synchronisation issue and I suspect the code could be changed to get round this. Has the Open CL threading model been changed? Do you realy mean a "full core"? This would mean 2 CPUs would be 100% loaded on an i7 which I suspect is not the case. |
Mike Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80 |
The remaining doubt is about the Nvidia drivers, should I keep using the old 266.58 that was needed by the NV-r521 to not clog the CPUs? Raistmer tried to fix it with different functions without success. It has to be fixed by Nvidia. On the other hand GPU app benefit outweighs anyways. Running 2 APs in less than 2 hours instead of 1 in 6 - 7 hours is not to shabby i imagine. With each crime and every kindness we birth our future. |
X-Files 27 Send message Joined: 17 May 99 Posts: 104 Credit: 111,191,433 RAC: 0 |
This system doesn't suffer from excessive CPU usage: Win7, i5-2500k GTX 570 296.35 (quadro driver using modded inf) I think at most its using 20% CPU. |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
Tested it on my ATI HD6850, 2GB, Catalysts 12.3 Latest AP on ATI, finished all right, faster than I am used to as well. {grin} I checked with GPU-Z, but a little late to be honest. However, what I saw was a load of 25% alternated with no load, CPU load max 4%. The AP was 7.23 percent blanked. Will wait for the next. |
Mike Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80 |
Tested it on my ATI HD6850, 2GB, Catalysts 12.3 You should increase unroll factor to 10 at least to get better speed on your card. Also FFA_block and FFA_block_fetch. With each crime and every kindness we birth our future. |
TRuEQ & TuVaLu Send message Joined: 4 Oct 99 Posts: 505 Credit: 69,523,653 RAC: 10 |
I've read somewhere that DATA_CHUNK_UNROLL should be setted to: half of the Max compute number units: Can anyone confirm this please?? //TRuEQ TRuEQ & TuVaLu |
TRuEQ & TuVaLu Send message Joined: 4 Oct 99 Posts: 505 Credit: 69,523,653 RAC: 10 |
I also saw that the sbs can be set in <cmdline>-instances_per_device 2 -unroll 9 -ffa_block 8192 -ffa_block_fetch 4096 -sbs 256 -hp</cmdline> default is 128 I saw somewhere that it isn't implemented yet. But in my stderr it looks like it is set to 256.... Can someone explain this please? <core_client_version>7.0.24</core_client_version> <![CDATA[ <stderr_txt> Number of app instances per device setted to:2 DATA_CHUNK_UNROLL setted to:9 FFA thread block override value:8192 FFA thread fetchblock override value:4096 Maximum single buffer size setted to:256MB Running on device number: 0 DATA_CHUNK_UNROLL at default:9 Priority of worker thread raised successfully Priority of process adjusted successfully, high priority class used OpenCL platform detected: Advanced Micro Devices, Inc. BOINC assigns 0 device, slots 0 to 1 (including) will be checked Used slot is 0; Used GPU device parameters are: Number of compute units: 18 Single buffer allocation size: 256MB max WG size: 256 TRuEQ & TuVaLu |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
You should increase unroll factor to 10 at least to get better speed on your card. It doesn't happen often that someone talks to me in riddles, but now it's happened. :-) Care to elaborate, please? Just think I am a newbie. And be gentle with that whip. Nice.... |
Mike Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80 |
I've read somewhere that DATA_CHUNK_UNROLL should be setted to: half of the Max compute number units: Thats the safe value yes. But for fast cards it can be increased above number of CUs. So long it dont result in invalids or getting screen lags. Unroll 10 - 12 should work on high end cards. Some cards can cope with 16. With each crime and every kindness we birth our future. |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
You should increase unroll factor to 10 at least to get better speed on your card. From the Lunatics Installer 0.40 Release Notes: AP/MB: Claggy |
Mike Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80 |
You should increase unroll factor to 10 at least to get better speed on your card. Sorry i didnt intend to riddle. You can edit your appinfo command line param to -unroll 10 -ffa_block 8192 -ffa_block_fetch 4096 -no_CPU_lock. That should speed up your card and give better run times. With each crime and every kindness we birth our future. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
I'm using 268,36 on my almost new notebook and have just finished the first AP, all went fine and the CPU use was rather low, mostly under 5%. Your stderr completelly fine, just many restarts were done for this task. |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
Sorry i didnt intend to riddle. LOL, well thanks for the explanation. You too Claggy! All lines in app_info.xml adjusted. |
TRuEQ & TuVaLu Send message Joined: 4 Oct 99 Posts: 505 Credit: 69,523,653 RAC: 10 |
I've read somewhere that DATA_CHUNK_UNROLL should be setted to: half of the Max compute number units: Then I understand. Thank you Mike TRuEQ & TuVaLu |
Mike Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80 |
Sorry i didnt intend to riddle. You have a typo in your appinfo. Please check. With each crime and every kindness we birth our future. |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
DATA_CHUNK_UNROLL setted to:10 FFA thread block override value:8192 FFA thread fetchblock override value:4096 CPU affinity ajustment will be skipped Maximum single buffer size setted to:128MB better? Although, it's "set", not "setted". The verb goes set, set, set. Very easy that one. :P Also adjusted -unroll to 6, instead of 10. It appears my GPU only has 12 Compute Units, so if half is optimal, then half it is. -unroll 4 Optimal at half the number of Compute Units of the GPU. Lower values also reduce VRAM use. Decrease if you experience lags. |
Mike Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80 |
DATA_CHUNK_UNROLL setted to:10 FFA thread block override value:8192 FFA thread fetchblock override value:4096 CPU affinity ajustment will be skipped Maximum single buffer size setted to:128MB Yep, much better. Nice run time as well. Looks good. With each crime and every kindness we birth our future. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Ray, I would happy if you can provide this info ;) And yes, I suspect sync issue too. Yes, perhaps NV changed OpenCL sync model to spinloop. AFAIK CUDA supports 2 synch models and they configurable via CUDA API. So far nobody discover configuration possibility for NV OpenCL implementation and such config not part of OpenCL standart too. There was trick offered on NV forums to create many fake contexts (it was supposet that OpenCL sync model corresponds CUDA's "auto" mode that will switch between sync strategies basing on number of contexts vs number of CPUs). I implemented fake contexts creation. It did not help. If you have some ideas how to tell NV's OpenCL runtime to use yield instead of spinloop, please let me know. And to be precise it should be "whole logical CPU usage bug" instead of "whole physical core usage bug". |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
This system doesn't suffer from excessive CPU usage: Good to know this! Now it's interesting to know what caused that, your driver version, your Os version or your GPU version... or smth we even can't imagine ;) |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
I've read somewhere that DATA_CHUNK_UNROLL should be setted to: half of the Max compute number units: Best value depends on number for memory channels too. there are 32 (I write by memory not looking to code and wrote that code long ago so mistakes possible) workitems per "unroll". Wavefront supports 64. Each compute unit needs at least 2 wavefronts for full load (actually better would be nicer). |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.