Update for GPU AP (both NV and ATi) to rev560

Message boards : Number crunching : Update for GPU AP (both NV and ATi) to rev560
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile red-ray
Avatar

Send message
Joined: 24 Jun 99
Posts: 308
Credit: 9,029,848
RAC: 0
United Kingdom
Message 1215159 - Posted: 7 Apr 2012, 16:13:01 UTC - in response to Message 1215135.  
Last modified: 7 Apr 2012, 16:54:18 UTC

The remaining doubt is about the Nvidia drivers, should I keep using the old 266.58 that was needed by the NV-r521 to not clog the CPUs?
(In fact, 266.58 is Ok and I have no reason to upgrade it, but one of my hosts has 2 560Ti's and the older driver for them is 266.66 which is said that has some issues...)

270.xx and up suffer from excess CPU usage. AFAIK it still not fixed in latest drivers.
I use 263.xx for crunching with GTX250 and it goes very well.

What about using a GTX 680 when the only drivers are 301.10?

How high is "excessive CPU usage" as a % of a CPU on a 2.66GHz Core 2 Quad please?


It will use a full core.
Thats why its called 100% CPU bug.

It took AMD almost 6 month to fix this bug so you will have to see how long it takes for nvidia to resolve this.



100%! What is the split between Interrupt, DPC, Kernel and User time?
To me this sounds like a synchronisation issue and I suspect the code could be changed to get round this. Has the Open CL threading model been changed?

Do you realy mean a "full core"? This would mean 2 CPUs would be 100% loaded on an i7 which I suspect is not the case.
ID: 1215159 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1215160 - Posted: 7 Apr 2012, 16:25:22 UTC - in response to Message 1215159.  
Last modified: 7 Apr 2012, 16:26:35 UTC

The remaining doubt is about the Nvidia drivers, should I keep using the old 266.58 that was needed by the NV-r521 to not clog the CPUs?
(In fact, 266.58 is Ok and I have no reason to upgrade it, but one of my hosts has 2 560Ti's and the older driver for them is 266.66 which is said that has some issues...)

270.xx and up suffer from excess CPU usage. AFAIK it still not fixed in latest drivers.
I use 263.xx for crunching with GTX250 and it goes very well.

What about using a GTX 680 when the only drivers are 301.10?

How high is "excessive CPU usage" as a % of a CPU on a 2.66GHz Core 2 Quad please?


It will use a full core.
Thats why its called 100% CPU bug.

It took AMD almost 6 month to fix this bug so you will have to see how long it takes for nvidia to resolve this.



100%! What is the split between Interrupt, DPC, Kernel and User time?
To me this sounds like a synchronisation issue and I suspect the code could be changed to get round this. Has the Open CL threading model been changed?

Do you rely mean a "full core"? This would mean 2 CPUs would be 100% loaded on an i7 which I suspect is not the case.


Raistmer tried to fix it with different functions without success.
It has to be fixed by Nvidia.

On the other hand GPU app benefit outweighs anyways.
Running 2 APs in less than 2 hours instead of 1 in 6 - 7 hours is not to shabby i imagine.


With each crime and every kindness we birth our future.
ID: 1215160 · Report as offensive
Profile X-Files 27
Avatar

Send message
Joined: 17 May 99
Posts: 104
Credit: 111,191,433
RAC: 0
Canada
Message 1215168 - Posted: 7 Apr 2012, 17:00:04 UTC
Last modified: 7 Apr 2012, 17:10:21 UTC

This system doesn't suffer from excessive CPU usage:
Win7, i5-2500k
GTX 570
296.35 (quadro driver using modded inf)




I think at most its using 20% CPU.
ID: 1215168 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1215193 - Posted: 7 Apr 2012, 17:55:49 UTC - in response to Message 1214708.  

Tested it on my ATI HD6850, 2GB, Catalysts 12.3
Latest AP on ATI, finished all right, faster than I am used to as well. {grin}

I checked with GPU-Z, but a little late to be honest. However, what I saw was a load of 25% alternated with no load, CPU load max 4%. The AP was 7.23 percent blanked.

Will wait for the next.
ID: 1215193 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1215279 - Posted: 7 Apr 2012, 21:16:17 UTC - in response to Message 1215193.  

Tested it on my ATI HD6850, 2GB, Catalysts 12.3
Latest AP on ATI, finished all right, faster than I am used to as well. {grin}

I checked with GPU-Z, but a little late to be honest. However, what I saw was a load of 25% alternated with no load, CPU load max 4%. The AP was 7.23 percent blanked.

Will wait for the next.


You should increase unroll factor to 10 at least to get better speed on your card.
Also FFA_block and FFA_block_fetch.



With each crime and every kindness we birth our future.
ID: 1215279 · Report as offensive
Profile TRuEQ & TuVaLu
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 505
Credit: 69,523,653
RAC: 10
Sweden
Message 1215316 - Posted: 7 Apr 2012, 22:06:41 UTC
Last modified: 7 Apr 2012, 22:07:21 UTC

I've read somewhere that DATA_CHUNK_UNROLL should be setted to: half of the Max compute number units:

Can anyone confirm this please??

//TRuEQ
TRuEQ & TuVaLu
ID: 1215316 · Report as offensive
Profile TRuEQ & TuVaLu
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 505
Credit: 69,523,653
RAC: 10
Sweden
Message 1215318 - Posted: 7 Apr 2012, 22:11:47 UTC

I also saw that the sbs can be set in <cmdline>-instances_per_device 2 -unroll 9 -ffa_block 8192 -ffa_block_fetch 4096 -sbs 256 -hp</cmdline>

default is 128

I saw somewhere that it isn't implemented yet.

But in my stderr it looks like it is set to 256....
Can someone explain this please?


<core_client_version>7.0.24</core_client_version>
<![CDATA[
<stderr_txt>
Number of app instances per device setted to:2
DATA_CHUNK_UNROLL setted to:9
FFA thread block override value:8192
FFA thread fetchblock override value:4096
Maximum single buffer size setted to:256MB
Running on device number: 0
DATA_CHUNK_UNROLL at default:9
Priority of worker thread raised successfully
Priority of process adjusted successfully, high priority class used
OpenCL platform detected: Advanced Micro Devices, Inc.
BOINC assigns 0 device, slots 0 to 1 (including) will be checked
Used slot is 0; Used GPU device parameters are:
Number of compute units: 18
Single buffer allocation size: 256MB
max WG size: 256


TRuEQ & TuVaLu
ID: 1215318 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1215322 - Posted: 7 Apr 2012, 22:24:20 UTC - in response to Message 1215279.  

You should increase unroll factor to 10 at least to get better speed on your card.
Also FFA_block and FFA_block_fetch.

It doesn't happen often that someone talks to me in riddles, but now it's happened. :-)

Care to elaborate, please?
Just think I am a newbie. And be gentle with that whip. Nice....
ID: 1215322 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1215323 - Posted: 7 Apr 2012, 22:25:53 UTC - in response to Message 1215316.  

I've read somewhere that DATA_CHUNK_UNROLL should be setted to: half of the Max compute number units:

Can anyone confirm this please??

//TRuEQ


Thats the safe value yes.

But for fast cards it can be increased above number of CUs.
So long it dont result in invalids or getting screen lags.

Unroll 10 - 12 should work on high end cards.
Some cards can cope with 16.



With each crime and every kindness we birth our future.
ID: 1215323 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1215324 - Posted: 7 Apr 2012, 22:26:56 UTC - in response to Message 1215322.  
Last modified: 7 Apr 2012, 22:31:40 UTC

You should increase unroll factor to 10 at least to get better speed on your card.
Also FFA_block and FFA_block_fetch.

It doesn't happen often that someone talks to me in riddles, but now it's happened. :-)

Care to elaborate, please?
Just think I am a newbie. And be gentle with that whip. Nice....


From the Lunatics Installer 0.40 Release Notes:

AP/MB:
-instances_per_device N how many tasks you want to run in parallel. Inverse of <count>.

-hp gives the app high priority

-no_cpu_lock prevents the app from using only a specific CPU core

AP only:
-v505 to process AP 5.05 tasks

-sbs 128 is the max size of single buffer that can be used in program. Lower limit is 128MB, upper - max size allowed particular card. [Note: not active yet]

-unroll 4 Optimal at half the number of Compute Units of the GPU. Lower values also reduce VRAM use. Decrease if you experience lags.

-ffa_block 2048 defines how many different periods GPU will process per single kernel call

-ffa_block_fetch 1024 defines how many threads will be used in FFA initial fetch kernel

ffa_block should be divisible by ffa_block_fetch. Going too high will result in premature 30/30 exit errors.


Claggy
ID: 1215324 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1215326 - Posted: 7 Apr 2012, 22:29:22 UTC - in response to Message 1215322.  
Last modified: 7 Apr 2012, 22:31:23 UTC

You should increase unroll factor to 10 at least to get better speed on your card.
Also FFA_block and FFA_block_fetch.

It doesn't happen often that someone talks to me in riddles, but now it's happened. :-)

Care to elaborate, please?
Just think I am a newbie. And be gentle with that whip. Nice....


Sorry i didnt intend to riddle.

You can edit your appinfo

command line param to -unroll 10 -ffa_block 8192 -ffa_block_fetch 4096 -no_CPU_lock.

That should speed up your card and give better run times.


With each crime and every kindness we birth our future.
ID: 1215326 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1215350 - Posted: 7 Apr 2012, 22:48:22 UTC - in response to Message 1215127.  

I'm using 268,36 on my almost new notebook and have just finished the first AP, all went fine and the CPU use was rather low, mostly under 5%.

The Stderr output contains some stuff that I wonder about, is there something I should do or is that all normal info?

http://setiathome.berkeley.edu/result.php?resultid=2387164861


Your stderr completelly fine, just many restarts were done for this task.
ID: 1215350 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1215364 - Posted: 7 Apr 2012, 23:03:50 UTC - in response to Message 1215326.  

Sorry i didnt intend to riddle.

LOL, well thanks for the explanation. You too Claggy!

All lines in app_info.xml adjusted.
ID: 1215364 · Report as offensive
Profile TRuEQ & TuVaLu
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 505
Credit: 69,523,653
RAC: 10
Sweden
Message 1215531 - Posted: 8 Apr 2012, 7:44:55 UTC - in response to Message 1215323.  

I've read somewhere that DATA_CHUNK_UNROLL should be setted to: half of the Max compute number units:

Can anyone confirm this please??

//TRuEQ


Thats the safe value yes.

But for fast cards it can be increased above number of CUs.
So long it dont result in invalids or getting screen lags.

Unroll 10 - 12 should work on high end cards.
Some cards can cope with 16.


Then I understand.

Thank you Mike
TRuEQ & TuVaLu
ID: 1215531 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1215533 - Posted: 8 Apr 2012, 7:46:36 UTC - in response to Message 1215364.  

Sorry i didnt intend to riddle.

LOL, well thanks for the explanation. You too Claggy!

All lines in app_info.xml adjusted.


You have a typo in your appinfo.
Please check.



With each crime and every kindness we birth our future.
ID: 1215533 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1215553 - Posted: 8 Apr 2012, 8:23:08 UTC - in response to Message 1215533.  
Last modified: 8 Apr 2012, 8:35:43 UTC

DATA_CHUNK_UNROLL setted to:10
FFA thread block override value:8192
FFA thread fetchblock override value:4096
CPU affinity ajustment will be skipped
Maximum single buffer size setted to:128MB

better?

Although, it's "set", not "setted".
The verb goes set, set, set. Very easy that one. :P

Also adjusted -unroll to 6, instead of 10. It appears my GPU only has 12 Compute Units, so if half is optimal, then half it is.
-unroll 4 Optimal at half the number of Compute Units of the GPU. Lower values also reduce VRAM use. Decrease if you experience lags.

ID: 1215553 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1215557 - Posted: 8 Apr 2012, 8:38:46 UTC - in response to Message 1215553.  

DATA_CHUNK_UNROLL setted to:10
FFA thread block override value:8192
FFA thread fetchblock override value:4096
CPU affinity ajustment will be skipped
Maximum single buffer size setted to:128MB

better?

Although, it's "set", not "setted".
The verb goes set, set, set. Very easy that one. :P

Also adjusted -unroll to 6, instead of 10. It appears my GPU only has 12 Compute Units, so if half is optimal, then half it is.
-unroll 4 Optimal at half the number of Compute Units of the GPU. Lower values also reduce VRAM use. Decrease if you experience lags.


Yep, much better.

Nice run time as well.
Looks good.



With each crime and every kindness we birth our future.
ID: 1215557 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1215566 - Posted: 8 Apr 2012, 9:16:47 UTC - in response to Message 1215159.  


100%! What is the split between Interrupt, DPC, Kernel and User time?
To me this sounds like a synchronisation issue and I suspect the code could be changed to get round this. Has the Open CL threading model been changed?

Do you realy mean a "full core"? This would mean 2 CPUs would be 100% loaded on an i7 which I suspect is not the case.


Ray, I would happy if you can provide this info ;)
And yes, I suspect sync issue too.
Yes, perhaps NV changed OpenCL sync model to spinloop.
AFAIK CUDA supports 2 synch models and they configurable via CUDA API.
So far nobody discover configuration possibility for NV OpenCL implementation and such config not part of OpenCL standart too.
There was trick offered on NV forums to create many fake contexts (it was supposet that OpenCL sync model corresponds CUDA's "auto" mode that will switch between sync strategies basing on number of contexts vs number of CPUs). I implemented fake contexts creation. It did not help.
If you have some ideas how to tell NV's OpenCL runtime to use yield instead of spinloop, please let me know.

And to be precise it should be "whole logical CPU usage bug" instead of "whole physical core usage bug".
ID: 1215566 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1215567 - Posted: 8 Apr 2012, 9:18:50 UTC - in response to Message 1215168.  
Last modified: 8 Apr 2012, 9:19:08 UTC

This system doesn't suffer from excessive CPU usage:
Win7, i5-2500k
GTX 570
296.35 (quadro driver using modded inf)




I think at most its using 20% CPU.


Good to know this!
Now it's interesting to know what caused that, your driver version, your Os version or your GPU version... or smth we even can't imagine ;)
ID: 1215567 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1215568 - Posted: 8 Apr 2012, 9:23:39 UTC - in response to Message 1215316.  

I've read somewhere that DATA_CHUNK_UNROLL should be setted to: half of the Max compute number units:

Can anyone confirm this please??

//TRuEQ


Best value depends on number for memory channels too.
there are 32 (I write by memory not looking to code and wrote that code long ago so mistakes possible) workitems per "unroll". Wavefront supports 64. Each compute unit needs at least 2 wavefronts for full load (actually better would be nicer).
ID: 1215568 · Report as offensive
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Update for GPU AP (both NV and ATi) to rev560


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.