Message boards :
Number crunching :
@NVidia-developers: Low background GPU load RISES crunching throughput significantly
Message board moderation
Author | Message |
---|---|
Ulrich Metzner Send message Joined: 3 Jul 02 Posts: 1256 Credit: 13,565,513 RAC: 13 |
I already posted this in the other thread (http://setiathome.berkeley.edu/forum_thread.php?id=75549) but i wanted to get more attention from the developers for this observation, that's why i post a new thread. I have an interesting observation for the developers of OpenCL on NVidias: AP app running: r2399 use_sleep active, GPUs: GT 430, GT 640 running driver 340.52 on Win XP pro. If i run only one AP unit per GPU, i can only get to exactly maximum 50% GPU load, that's why i let ran 2 WUs per GPU. Now on the last single AP left running alone on the main GPU, i had the 47-50% GPU load. And now for the WOW: The moment i started DVB-C streaming, the GPU load got up to nearly 100% - and no, it's not because of the streaming, the calculation speed running only one WU is really crunching significantly faster, if there is some low "background load" running parallel on the same GPU. The moment i stop the streaming, the GPU load drops to values below 50% and the crunching is SLOWER! I'm stunned! :? :? :? Aloha, Uli |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
My guess is that in your particular setup (with limited knowledge of your setup and current AP apps), the single task loading is just below the threshold to trigger the higher power state, which is also one parameter among many that dictates boost functionality. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Mike Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80 |
I already suggested bigger unroll values but this causes stuttering while streaming video. He dont like this. With each crime and every kindness we birth our future. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I already suggested bigger unroll values but this causes stuttering while streaming video. Yeah, the high and increasing driver latencies in Windows are a pain. Probably neither the OpenCL nor Cuda will improve single task utilisation until we both start making more use of the streaming functionality to hide them. Hmmm, might be running into similar driver latency issues as I see on larger GPUs with Cuda MB. There *should* be a happy middle unroll setting that would maintain loading, but not overcommit. Finding that (if available) could be the hard part though, and will depend on how Raistmer implemented the unroll. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Mike Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80 |
I can get my card to 90% utilisation on single tasks. Thats not much the problem. He is using two different GPU`s 430 and 640 IIRC. Thats more complicated to find sweet spot. With each crime and every kindness we birth our future. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
Yeah, the high and increasing driver latencies in Windows are a pain. Probably neither the OpenCL nor Cuda will improve single task utilisation until we both start making more use of the streaming functionality to hide them. DirectX12 is meant to give a significant performance boost by reducing CPU load to allow GPU load to increase (for iGPU gaming). Would part of it possibly be due to reducing driver latencies? Grant Darwin NT |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
He is using two different GPU`s 430 and 640 IIRC. I see. Yes, on MB, for configuration of mixed GPUs, I needed to add the ability for advanced users to configure the application by device PCI(e) bus+slot ID. Even then, I would expect the VRAM on those GPUs to be an additional complicating bottleneck. So not an easy setup ;) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Mike Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80 |
He is using two different GPU`s 430 and 640 IIRC. Yep and this in conjunction with just a dual core CPU. With each crime and every kindness we birth our future. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Yeah, the high and increasing driver latencies in Windows are a pain. Probably neither the OpenCL nor Cuda will improve single task utilisation until we both start making more use of the streaming functionality to hide them. Absolutely. It's said that DIrectX12 will be using a closer to hardware approach similar to AMD's Mantle (also been said Intel may move to for its iGPUs). At least some portion of that would be hardware spec, but probably streamlining of driver architecture etc would get those latencies down. Cuda and OpenCL with NV, on Windows, underneath, use low level directX calls for much functionality, so any improvements there should help. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
Cuda and OpenCL with NV, on Windows, underneath, use low level directX calls for much functionality, so any improvements there should help. Bring on DX12 then. Would be nice to see just what Maxwell is capable of. Grant Darwin NT |
Ulrich Metzner Send message Joined: 3 Jul 02 Posts: 1256 Credit: 13,565,513 RAC: 13 |
My guess is that in your particular setup (with limited knowledge of your setup and current AP apps), the single task loading is just below the threshold to trigger the higher power state, which is also one parameter among many that dictates boost functionality. It is not the power state, i can view the GPUs switching to lower power states after ~10-15 seconds, when i stop crunching completely. With driver version 337.88 i could get to 90-95% load per GPU, without or even with use_sleep. Even without use_sleep, a single AP WU uses only ~10-15% of a CPU core, with use_sleep nearly nothing. With driver 340.52 i'm not able to get over 50% with use_sleep, even with ridiculous high unroll and fftxx values. Without use_sleep i get >90% GPU load out of the box, but a single AP WU hogs a complete CPU core for feeding and let CPU WUs starving, with use_sleep it uses almost no CPU. That's why i switched to 2 AP per GPU for this driver. It is like the low background GPU load "triggers" something in the code/driver, that also speeds up the crunching process/shortens sleep pauses. If all fails, i'll switch back to driver 337.88 for this computer, cause it is an XP box anyway and DX12 is of no concern for this classic. I didn't wanted to rant in any way, just make the observation known to the developers. Maybe, it points someone into the right direction for getting the same functionality as CUDA MB7 into AP. BTW: The possibility to configure different GPUs dependent on their socket address is a real killer feature! *doublethumbsup* ;) Thanks everyone! Aloha, Uli |
Mike Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80 |
Claggy mentioned in a other thread about this issue caused by driver 340.52. This is not app related its more driver/OS/host dependent. http://setiathome.berkeley.edu/forum_thread.php?id=75309 With each crime and every kindness we birth our future. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
The possibility to configure different GPUs dependent on their socket address is a real killer feature! *doublethumbsup* ;) Install second BOINc client and enjoy. http://vyper.kafit.se/wp/index.php/2011/02/04/running-different-nvidia-architectures-most-optimal-at-setihome/ |
Ulrich Metzner Send message Joined: 3 Jul 02 Posts: 1256 Credit: 13,565,513 RAC: 13 |
Install second BOINc client and enjoy. Yes, i know about this possibility, but to be honest, it's too much of a hassle to me. The way the X41 executables handles it is the elegant way to go! ;) [edit] I rolled back the driver to version 337.88 and everything will be fine on the next AP frenzy. :D Aloha, Uli |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Really elegant way would be to add such ability to app_config.xml and not to relay on science apps. app_config already can pass cmdline to app. All it needs is to distinguish compute devices it governs. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Really elegant way would be to add such ability to app_config.xml and not to relay on science apps. Within <app> or <app_version> add something like where N would be the device ID. <ati_dev>N</ati_dev> <cuda_dev>N</cuda_dev> <intel_dev>N</intel_dev> Then each card could have completely separate configuration. Maybe they can squeak that in for 7.4.x SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Ulrich Metzner Send message Joined: 3 Jul 02 Posts: 1256 Credit: 13,565,513 RAC: 13 |
Really elegant way would be to add such ability to app_config.xml and not to relay on science apps. +1 , no wait, +100 for that suggestion! ;) Aloha, Uli |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.