@NVidia-developers: Low background GPU load RISES crunching throughput significantly

Message boards : Number crunching : @NVidia-developers: Low background GPU load RISES crunching throughput significantly
Message board moderation

To post messages, you must log in.

AuthorMessage
Ulrich Metzner
Volunteer tester
Avatar

Send message
Joined: 3 Jul 02
Posts: 1256
Credit: 13,565,513
RAC: 13
Germany
Message 1566380 - Posted: 3 Sep 2014, 7:10:09 UTC
Last modified: 3 Sep 2014, 7:10:54 UTC

I already posted this in the other thread (http://setiathome.berkeley.edu/forum_thread.php?id=75549) but i wanted to get more attention from the developers for this observation, that's why i post a new thread.

I have an interesting observation for the developers of OpenCL on NVidias:

AP app running: r2399 use_sleep active, GPUs: GT 430, GT 640 running driver 340.52 on Win XP pro.

If i run only one AP unit per GPU, i can only get to exactly maximum 50% GPU load, that's why i let ran 2 WUs per GPU.

Now on the last single AP left running alone on the main GPU, i had the 47-50% GPU load. And now for the WOW: The moment i started DVB-C streaming, the GPU load got up to nearly 100% - and no, it's not because of the streaming, the calculation speed running only one WU is really crunching significantly faster, if there is some low "background load" running parallel on the same GPU. The moment i stop the streaming, the GPU load drops to values below 50% and the crunching is SLOWER! I'm stunned! :? :? :?
Aloha, Uli

ID: 1566380 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1566419 - Posted: 3 Sep 2014, 9:31:27 UTC - in response to Message 1566380.  

My guess is that in your particular setup (with limited knowledge of your setup and current AP apps), the single task loading is just below the threshold to trigger the higher power state, which is also one parameter among many that dictates boost functionality.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1566419 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34257
Credit: 79,922,639
RAC: 80
Germany
Message 1566420 - Posted: 3 Sep 2014, 9:35:17 UTC

I already suggested bigger unroll values but this causes stuttering while streaming video.
He dont like this.


With each crime and every kindness we birth our future.
ID: 1566420 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1566421 - Posted: 3 Sep 2014, 9:42:59 UTC - in response to Message 1566420.  

I already suggested bigger unroll values but this causes stuttering while streaming video.
He dont like this.


Yeah, the high and increasing driver latencies in Windows are a pain. Probably neither the OpenCL nor Cuda will improve single task utilisation until we both start making more use of the streaming functionality to hide them.

Hmmm, might be running into similar driver latency issues as I see on larger GPUs with Cuda MB. There *should* be a happy middle unroll setting that would maintain loading, but not overcommit. Finding that (if available) could be the hard part though, and will depend on how Raistmer implemented the unroll.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1566421 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34257
Credit: 79,922,639
RAC: 80
Germany
Message 1566422 - Posted: 3 Sep 2014, 9:50:21 UTC
Last modified: 3 Sep 2014, 9:50:44 UTC

I can get my card to 90% utilisation on single tasks.
Thats not much the problem.
He is using two different GPU`s 430 and 640 IIRC.
Thats more complicated to find sweet spot.


With each crime and every kindness we birth our future.
ID: 1566422 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13733
Credit: 208,696,464
RAC: 304
Australia
Message 1566423 - Posted: 3 Sep 2014, 9:52:55 UTC - in response to Message 1566421.  
Last modified: 3 Sep 2014, 9:55:08 UTC

Yeah, the high and increasing driver latencies in Windows are a pain. Probably neither the OpenCL nor Cuda will improve single task utilisation until we both start making more use of the streaming functionality to hide them.


DirectX12 is meant to give a significant performance boost by reducing CPU load to allow GPU load to increase (for iGPU gaming). Would part of it possibly be due to reducing driver latencies?
Grant
Darwin NT
ID: 1566423 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1566424 - Posted: 3 Sep 2014, 9:55:05 UTC - in response to Message 1566422.  

He is using two different GPU`s 430 and 640 IIRC.
Thats more complicated to find sweet spot.


I see. Yes, on MB, for configuration of mixed GPUs, I needed to add the ability for advanced users to configure the application by device PCI(e) bus+slot ID. Even then, I would expect the VRAM on those GPUs to be an additional complicating bottleneck.

So not an easy setup ;)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1566424 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34257
Credit: 79,922,639
RAC: 80
Germany
Message 1566425 - Posted: 3 Sep 2014, 9:59:50 UTC - in response to Message 1566424.  

He is using two different GPU`s 430 and 640 IIRC.
Thats more complicated to find sweet spot.


I see. Yes, on MB, for configuration of mixed GPUs, I needed to add the ability for advanced users to configure the application by device PCI(e) bus+slot ID. Even then, I would expect the VRAM on those GPUs to be an additional complicating bottleneck.

So not an easy setup ;)


Yep and this in conjunction with just a dual core CPU.


With each crime and every kindness we birth our future.
ID: 1566425 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1566426 - Posted: 3 Sep 2014, 10:00:08 UTC - in response to Message 1566423.  

Yeah, the high and increasing driver latencies in Windows are a pain. Probably neither the OpenCL nor Cuda will improve single task utilisation until we both start making more use of the streaming functionality to hide them.


DirectX12 is meant to give a significant performance boost by reducing CPU load to allow GPU load to increase (for iGPU gaming). Would part of it possibly be due to reducing driver latencies?


Absolutely. It's said that DIrectX12 will be using a closer to hardware approach similar to AMD's Mantle (also been said Intel may move to for its iGPUs). At least some portion of that would be hardware spec, but probably streamlining of driver architecture etc would get those latencies down.

Cuda and OpenCL with NV, on Windows, underneath, use low level directX calls for much functionality, so any improvements there should help.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1566426 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13733
Credit: 208,696,464
RAC: 304
Australia
Message 1566428 - Posted: 3 Sep 2014, 10:04:50 UTC - in response to Message 1566426.  

Cuda and OpenCL with NV, on Windows, underneath, use low level directX calls for much functionality, so any improvements there should help.

Bring on DX12 then.
Would be nice to see just what Maxwell is capable of.
Grant
Darwin NT
ID: 1566428 · Report as offensive
Ulrich Metzner
Volunteer tester
Avatar

Send message
Joined: 3 Jul 02
Posts: 1256
Credit: 13,565,513
RAC: 13
Germany
Message 1566470 - Posted: 3 Sep 2014, 13:48:24 UTC - in response to Message 1566419.  

My guess is that in your particular setup (with limited knowledge of your setup and current AP apps), the single task loading is just below the threshold to trigger the higher power state, which is also one parameter among many that dictates boost functionality.

It is not the power state, i can view the GPUs switching to lower power states after ~10-15 seconds, when i stop crunching completely.

With driver version 337.88 i could get to 90-95% load per GPU, without or even with use_sleep. Even without use_sleep, a single AP WU uses only ~10-15% of a CPU core, with use_sleep nearly nothing.

With driver 340.52 i'm not able to get over 50% with use_sleep, even with ridiculous high unroll and fftxx values. Without use_sleep i get >90% GPU load out of the box, but a single AP WU hogs a complete CPU core for feeding and let CPU WUs starving, with use_sleep it uses almost no CPU. That's why i switched to 2 AP per GPU for this driver.

It is like the low background GPU load "triggers" something in the code/driver, that also speeds up the crunching process/shortens sleep pauses.

If all fails, i'll switch back to driver 337.88 for this computer, cause it is an XP box anyway and DX12 is of no concern for this classic.

I didn't wanted to rant in any way, just make the observation known to the developers. Maybe, it points someone into the right direction for getting the same functionality as CUDA MB7 into AP. BTW: The possibility to configure different GPUs dependent on their socket address is a real killer feature! *doublethumbsup* ;)

Thanks everyone!
Aloha, Uli

ID: 1566470 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34257
Credit: 79,922,639
RAC: 80
Germany
Message 1566473 - Posted: 3 Sep 2014, 14:07:39 UTC
Last modified: 3 Sep 2014, 14:10:15 UTC

Claggy mentioned in a other thread about this issue caused by driver 340.52.

This is not app related its more driver/OS/host dependent.

http://setiathome.berkeley.edu/forum_thread.php?id=75309


With each crime and every kindness we birth our future.
ID: 1566473 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1566509 - Posted: 3 Sep 2014, 16:02:12 UTC - in response to Message 1566470.  

The possibility to configure different GPUs dependent on their socket address is a real killer feature! *doublethumbsup* ;)

Thanks everyone!


Install second BOINc client and enjoy.
http://vyper.kafit.se/wp/index.php/2011/02/04/running-different-nvidia-architectures-most-optimal-at-setihome/
ID: 1566509 · Report as offensive
Ulrich Metzner
Volunteer tester
Avatar

Send message
Joined: 3 Jul 02
Posts: 1256
Credit: 13,565,513
RAC: 13
Germany
Message 1566526 - Posted: 3 Sep 2014, 16:58:14 UTC - in response to Message 1566509.  
Last modified: 3 Sep 2014, 17:29:50 UTC

Install second BOINc client and enjoy.
http://vyper.kafit.se/wp/index.php/2011/02/04/running-different-nvidia-architectures-most-optimal-at-setihome/

Yes, i know about this possibility, but to be honest, it's too much of a hassle to me. The way the X41 executables handles it is the elegant way to go! ;)

[edit]
I rolled back the driver to version 337.88 and everything will be fine on the next AP frenzy. :D
Aloha, Uli

ID: 1566526 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1566585 - Posted: 3 Sep 2014, 19:12:09 UTC - in response to Message 1566526.  

Really elegant way would be to add such ability to app_config.xml and not to relay on science apps.
app_config already can pass cmdline to app. All it needs is to distinguish compute devices it governs.
ID: 1566585 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1566589 - Posted: 3 Sep 2014, 19:20:47 UTC - in response to Message 1566585.  
Last modified: 3 Sep 2014, 19:24:59 UTC

Really elegant way would be to add such ability to app_config.xml and not to relay on science apps.
app_config already can pass cmdline to app. All it needs is to distinguish compute devices it governs.

Within <app> or <app_version> add something like where N would be the device ID.
<ati_dev>N</ati_dev>
<cuda_dev>N</cuda_dev>
<intel_dev>N</intel_dev>
Then each card could have completely separate configuration.
Maybe they can squeak that in for 7.4.x
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1566589 · Report as offensive
Ulrich Metzner
Volunteer tester
Avatar

Send message
Joined: 3 Jul 02
Posts: 1256
Credit: 13,565,513
RAC: 13
Germany
Message 1566590 - Posted: 3 Sep 2014, 19:22:09 UTC - in response to Message 1566585.  

Really elegant way would be to add such ability to app_config.xml and not to relay on science apps.
app_config already can pass cmdline to app. All it needs is to distinguish compute devices it governs.

+1 , no wait, +100 for that suggestion! ;)
Aloha, Uli

ID: 1566590 · Report as offensive

Message boards : Number crunching : @NVidia-developers: Low background GPU load RISES crunching throughput significantly


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.