@NVidia-developers: Low background GPU load RISES crunching throughput significantly

Author	Message
Ulrich Metzner Volunteer tester Send message Joined: 3 Jul 02 Posts: 1256 Credit: 13,565,513 RAC: 13	Message 1566380 - Posted: 3 Sep 2014, 7:10:09 UTC Last modified: 3 Sep 2014, 7:10:54 UTC I already posted this in the other thread (http://setiathome.berkeley.edu/forum_thread.php?id=75549) but i wanted to get more attention from the developers for this observation, that's why i post a new thread. I have an interesting observation for the developers of OpenCL on NVidias: AP app running: r2399 use_sleep active, GPUs: GT 430, GT 640 running driver 340.52 on Win XP pro. If i run only one AP unit per GPU, i can only get to exactly maximum 50% GPU load, that's why i let ran 2 WUs per GPU. Now on the last single AP left running alone on the main GPU, i had the 47-50% GPU load. And now for the WOW: The moment i started DVB-C streaming, the GPU load got up to nearly 100% - and no, it's not because of the streaming, the calculation speed running only one WU is really crunching significantly faster, if there is some low "background load" running parallel on the same GPU. The moment i stop the streaming, the GPU load drops to values below 50% and the crunching is SLOWER! I'm stunned! :? :? :? Aloha, Uli ID: 1566380 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1566419 - Posted: 3 Sep 2014, 9:31:27 UTC - in response to Message 1566380. My guess is that in your particular setup (with limited knowledge of your setup and current AP apps), the single task loading is just below the threshold to trigger the higher power state, which is also one parameter among many that dictates boost functionality. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1566419 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1566420 - Posted: 3 Sep 2014, 9:35:17 UTC I already suggested bigger unroll values but this causes stuttering while streaming video. He dont like this. With each crime and every kindness we birth our future. ID: 1566420 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1566421 - Posted: 3 Sep 2014, 9:42:59 UTC - in response to Message 1566420. I already suggested bigger unroll values but this causes stuttering while streaming video. He dont like this. Yeah, the high and increasing driver latencies in Windows are a pain. Probably neither the OpenCL nor Cuda will improve single task utilisation until we both start making more use of the streaming functionality to hide them. Hmmm, might be running into similar driver latency issues as I see on larger GPUs with Cuda MB. There should be a happy middle unroll setting that would maintain loading, but not overcommit. Finding that (if available) could be the hard part though, and will depend on how Raistmer implemented the unroll. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1566421 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1566422 - Posted: 3 Sep 2014, 9:50:21 UTC Last modified: 3 Sep 2014, 9:50:44 UTC I can get my card to 90% utilisation on single tasks. Thats not much the problem. He is using two different GPU`s 430 and 640 IIRC. Thats more complicated to find sweet spot. With each crime and every kindness we birth our future. ID: 1566422 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1566423 - Posted: 3 Sep 2014, 9:52:55 UTC - in response to Message 1566421. Last modified: 3 Sep 2014, 9:55:08 UTC Yeah, the high and increasing driver latencies in Windows are a pain. Probably neither the OpenCL nor Cuda will improve single task utilisation until we both start making more use of the streaming functionality to hide them. DirectX12 is meant to give a significant performance boost by reducing CPU load to allow GPU load to increase (for iGPU gaming). Would part of it possibly be due to reducing driver latencies? Grant Darwin NT ID: 1566423 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1566424 - Posted: 3 Sep 2014, 9:55:05 UTC - in response to Message 1566422. He is using two different GPU`s 430 and 640 IIRC. Thats more complicated to find sweet spot. I see. Yes, on MB, for configuration of mixed GPUs, I needed to add the ability for advanced users to configure the application by device PCI(e) bus+slot ID. Even then, I would expect the VRAM on those GPUs to be an additional complicating bottleneck. So not an easy setup ;) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1566424 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1566425 - Posted: 3 Sep 2014, 9:59:50 UTC - in response to Message 1566424. He is using two different GPU`s 430 and 640 IIRC. Thats more complicated to find sweet spot. I see. Yes, on MB, for configuration of mixed GPUs, I needed to add the ability for advanced users to configure the application by device PCI(e) bus+slot ID. Even then, I would expect the VRAM on those GPUs to be an additional complicating bottleneck. So not an easy setup ;) Yep and this in conjunction with just a dual core CPU. With each crime and every kindness we birth our future. ID: 1566425 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1566426 - Posted: 3 Sep 2014, 10:00:08 UTC - in response to Message 1566423. Yeah, the high and increasing driver latencies in Windows are a pain. Probably neither the OpenCL nor Cuda will improve single task utilisation until we both start making more use of the streaming functionality to hide them. DirectX12 is meant to give a significant performance boost by reducing CPU load to allow GPU load to increase (for iGPU gaming). Would part of it possibly be due to reducing driver latencies? Absolutely. It's said that DIrectX12 will be using a closer to hardware approach similar to AMD's Mantle (also been said Intel may move to for its iGPUs). At least some portion of that would be hardware spec, but probably streamlining of driver architecture etc would get those latencies down. Cuda and OpenCL with NV, on Windows, underneath, use low level directX calls for much functionality, so any improvements there should help. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1566426 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1566428 - Posted: 3 Sep 2014, 10:04:50 UTC - in response to Message 1566426. Cuda and OpenCL with NV, on Windows, underneath, use low level directX calls for much functionality, so any improvements there should help. Bring on DX12 then. Would be nice to see just what Maxwell is capable of. Grant Darwin NT ID: 1566428 ·

Ulrich Metzner Volunteer tester Send message Joined: 3 Jul 02 Posts: 1256 Credit: 13,565,513 RAC: 13	Message 1566470 - Posted: 3 Sep 2014, 13:48:24 UTC - in response to Message 1566419. My guess is that in your particular setup (with limited knowledge of your setup and current AP apps), the single task loading is just below the threshold to trigger the higher power state, which is also one parameter among many that dictates boost functionality. It is not the power state, i can view the GPUs switching to lower power states after ~10-15 seconds, when i stop crunching completely. With driver version 337.88 i could get to 90-95% load per GPU, without or even with use_sleep. Even without use_sleep, a single AP WU uses only ~10-15% of a CPU core, with use_sleep nearly nothing. With driver 340.52 i'm not able to get over 50% with use_sleep, even with ridiculous high unroll and fftxx values. Without use_sleep i get >90% GPU load out of the box, but a single AP WU hogs a complete CPU core for feeding and let CPU WUs starving, with use_sleep it uses almost no CPU. That's why i switched to 2 AP per GPU for this driver. It is like the low background GPU load "triggers" something in the code/driver, that also speeds up the crunching process/shortens sleep pauses. If all fails, i'll switch back to driver 337.88 for this computer, cause it is an XP box anyway and DX12 is of no concern for this classic. I didn't wanted to rant in any way, just make the observation known to the developers. Maybe, it points someone into the right direction for getting the same functionality as CUDA MB7 into AP. BTW: The possibility to configure different GPUs dependent on their socket address is a real killer feature! doublethumbsup ;) Thanks everyone! Aloha, Uli ID: 1566470 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1566473 - Posted: 3 Sep 2014, 14:07:39 UTC Last modified: 3 Sep 2014, 14:10:15 UTC Claggy mentioned in a other thread about this issue caused by driver 340.52. This is not app related its more driver/OS/host dependent. http://setiathome.berkeley.edu/forum_thread.php?id=75309 With each crime and every kindness we birth our future. ID: 1566473 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1566509 - Posted: 3 Sep 2014, 16:02:12 UTC - in response to Message 1566470. The possibility to configure different GPUs dependent on their socket address is a real killer feature! doublethumbsup ;) Thanks everyone! Install second BOINc client and enjoy. http://vyper.kafit.se/wp/index.php/2011/02/04/running-different-nvidia-architectures-most-optimal-at-setihome/ ID: 1566509 ·

Ulrich Metzner Volunteer tester Send message Joined: 3 Jul 02 Posts: 1256 Credit: 13,565,513 RAC: 13	Message 1566526 - Posted: 3 Sep 2014, 16:58:14 UTC - in response to Message 1566509. Last modified: 3 Sep 2014, 17:29:50 UTC Install second BOINc client and enjoy. http://vyper.kafit.se/wp/index.php/2011/02/04/running-different-nvidia-architectures-most-optimal-at-setihome/ Yes, i know about this possibility, but to be honest, it's too much of a hassle to me. The way the X41 executables handles it is the elegant way to go! ;) [edit] I rolled back the driver to version 337.88 and everything will be fine on the next AP frenzy. :D Aloha, Uli ID: 1566526 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1566585 - Posted: 3 Sep 2014, 19:12:09 UTC - in response to Message 1566526. Really elegant way would be to add such ability to app_config.xml and not to relay on science apps. app_config already can pass cmdline to app. All it needs is to distinguish compute devices it governs. ID: 1566585 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1566589 - Posted: 3 Sep 2014, 19:20:47 UTC - in response to Message 1566585. Last modified: 3 Sep 2014, 19:24:59 UTC Really elegant way would be to add such ability to app_config.xml and not to relay on science apps. app_config already can pass cmdline to app. All it needs is to distinguish compute devices it governs. Within <app> or <app_version> add something like where N would be the device ID. <ati_dev>N</ati_dev> <cuda_dev>N</cuda_dev> <intel_dev>N</intel_dev> Then each card could have completely separate configuration. Maybe they can squeak that in for 7.4.x SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1566589 ·

Ulrich Metzner Volunteer tester Send message Joined: 3 Jul 02 Posts: 1256 Credit: 13,565,513 RAC: 13	Message 1566590 - Posted: 3 Sep 2014, 19:22:09 UTC - in response to Message 1566585. Really elegant way would be to add such ability to app_config.xml and not to relay on science apps. app_config already can pass cmdline to app. All it needs is to distinguish compute devices it governs. +1 , no wait, +100 for that suggestion! ;) Aloha, Uli ID: 1566590 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.