GPU Utilization: Thermal Throttling?

Message boards : Number crunching : GPU Utilization: Thermal Throttling?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1804650 - Posted: 24 Jul 2016, 14:23:20 UTC

After updating to Lunatics 0.44 I noticed it wasn't keeping the GPU busy enough so I tried a few things.

First I tried running 4 tasks concurrently -- this seemed to help a bit but then my over-all RAC seemed to suffer.

Second I tried setting the process priority to Above Normal (in mbcuda.cfg). My theory was that with a hyper-threaded CPU the process to keep the GPU fed might be held back when the CPU was saturated. This seemed to help but it's hard to say.

What I found fascinating was during the last 12hrs of crunching 1 task the GPU utilization is low for certain work-units but the GPU temperature is higher:



Doesn't that look like it's thermally throttling? Note that I removed my CPU temp from this graph for clarity and it shows a steady 75-80C for the same duration so this isn't some external factor like my air-conditioning cycling.
ID: 1804650 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1804669 - Posted: 24 Jul 2016, 15:22:44 UTC - in response to Message 1804650.  

Yes. In effect any clockrate above the base clocks is GPU-Boost doing its job, of overclocking to a large array of driving parameters.

[Sorry for length of mental dump]

Most likely one interpretation, more accurate than 'thermal throttling', would be 'reduced dynamic overclock', which could be temperature related, but also power, voltage etc (something like 23 paramaters IIRC, most hidden and stability related)

If you see things drop below the specified base clock, then you could consider that as a problem somewhere (in system or external).

I heard somewhere recently (forget where) that some people under water cooling flash their devices to disable GPU boost, though in general setting the clock lower/deeper into the stability region than GPU boost would be advised anyway. My understanding is one reason they do this is to present a more or less constant load, which simplifies fan/pump configuration, and stabilises the acoustics.

In general, the 'baseline' cuda applications should reach 90-100% with 2-3 tasks (GPU and system dependant). Any more than that will increase switching overheads and CPU cost.

I keep my very much CPU bottlenecked Windows 7, Core2Duo, system with GTX980, in that state so as to represent something close to worst case imbalance.

Saturating the CPU threads/cores such that the GPU drivers choke in the DPC queue would most likely be visible using LatencyMon or similar tool while running. Tuning the system, chipset drivers, BIOS, amount of OC, and possibly other system components like RAM latency, will change that balance, and potentially move any bottleneck around. In the case of baseline Cuda v8 apps on Windows Vista Onwards, the Latencies involved are higher than under the old XPsp2 regime, but less than on modern Mac OSX. Multiple instances does hide those latencies somewhat, but the best case down the road will be single instance binaries that will use all the allocated hardware effectively, so as to reduce switching overhead to absolute minimums while also hiding the latencies.

Newer techniques have been explored, pushing slowly (about as quickly as the gaming industry) to a better state. Most of the same considerations are relevant to the motivations behind DirectX12 and Vulkan as well (with the same amount of complicated work to get to the end goal of properly hetergeneous systems + applications)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1804669 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1804677 - Posted: 24 Jul 2016, 15:41:02 UTC

Some digging turned up that the default fan-control curve is set very conservatively -- it doesn't want to run them loud even when the GPU is getting warm.

I've also learned to my horror that you can't customize the fan curve without a) running a process or b) flashing the BIOS on the card.

I'll leave MSI Afterburner running for the next stress-test cycle -- with a slightly more aggressive curve it's running the GPU 15C lower and the fan is barely above 50%.
ID: 1804677 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1804700 - Posted: 24 Jul 2016, 17:11:28 UTC

When running my Nano cards with the original air cooling, they were definitely thermally throttled. What I observed is very different from what you show. When throttling is taking place, voltage and frequency are reduced and loading actually goes up. Perhaps what you are seeing is the GPU getting a task it's not very efficient at. I see loading go down whenever I get an AP task.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1804700 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1804789 - Posted: 25 Jul 2016, 7:34:31 UTC - in response to Message 1804700.  

When throttling is taking place, voltage and frequency are reduced and loading actually goes up. Perhaps what you are seeing is the GPU getting a task it's not very efficient at. I see loading go down whenever I get an AP task.

I've noticed that when the GPU load drops, and the Memory Controller load increases, the power consumption increases.
When the Memory Controller load drops off, and the GPU load increases, the power consumption drops off. This has been most noticeable since the introduction of the Guppies.
Grant
Darwin NT
ID: 1804789 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1804813 - Posted: 25 Jul 2016, 13:16:25 UTC

Last 12 hours:



So it isn't thermal throttling because the GPU didn't get anywhere near as warm this time.

The last few tasks were 19no10ac.22340.1703.5.32.199_0 and 19no10ab.18988.885.9.36.154_0.

I guess now I need to look into what "pulsefind: blocks per SM 4 (Fermi or newer default)" is all about since Fermi is 2 generations older than what I've got.
ID: 1804813 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1804823 - Posted: 25 Jul 2016, 14:16:19 UTC

Wasn't all this discussed in some (credible) detail, a few weeks ago?

When running pulsefinding (dominant in VLARs, hence guppi), the code can't be fully parallelised, and all kernels run on a single SM. So, the first SM (where GPU loadings are measured) is very, very busy: reported loadings are high. But the other SMs are nearly idle, so power consumption and temperature (averaged over the whole GPU die) are much lower.

Conversely, when other signal types are being searched for, parallelisation allows more SMs to be used: average load per SM is lower, but total load summed over all SMs will be higher, causing higher energy use and thus higher temperatures.

The effects will be more noticeable, the larger the number of SMs your GPU has (the more expensive it is).
ID: 1804823 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1804824 - Posted: 25 Jul 2016, 14:43:37 UTC - in response to Message 1804823.  

Wasn't all this discussed in some (credible) detail, a few weeks ago?


Yes, however as the parties (that I recall were) involved well know, the issues with concurrency versus computational efficiency, versus communications (memory access) complexity, become exceedingly complex; especially when throwing in some 4 or 5 generations of disparate hardware, underlying OS differences, and new generations of gpu rolling out faster than drivers can be made for them.

IMO what's happened is we caught up to game development, with respect to current and predicted future needs. So expect the landscape to be no less volatile and confusing for most people than what's going on with VR, DX12 and Vulkan.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1804824 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1804912 - Posted: 26 Jul 2016, 1:05:23 UTC - in response to Message 1804823.  

This is interesting and has me wondering about the trade-offs of multi-processing a faster card vs running multiple cheaper cards with less SMs so that when it's gone narrow there's less wasted resources.

Evidently the Pascal chips have improved context-switching; it's too bad there isn't a benchmark.
ID: 1804912 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1804959 - Posted: 26 Jul 2016, 7:24:10 UTC - in response to Message 1804912.  
Last modified: 26 Jul 2016, 7:25:31 UTC

Evidently the Pascal chips have improved context-switching;

Significantly improved.

From a recent article at AnnandTech.
Asynchronous Concurrent Compute: Pascal Gets More Flexible and the following page on pre-emption are worth a read.
Grant
Darwin NT
ID: 1804959 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1804973 - Posted: 26 Jul 2016, 11:36:46 UTC - in response to Message 1804959.  

Evidently the Pascal chips have improved context-switching;

Significantly improved.

From a recent article at AnnandTech.
Asynchronous Concurrent Compute: Pascal Gets More Flexible and the following page on pre-emption are worth a read.


What I'm reading there accurately reflects the behaviour I see on GTX 980 when using multiple streams on Windows. Driving the device(s) to high load without bogging down the machine is a fiddly balancing act.

That hardware scheduling refinement in Pascal is potentially a very big thing for us. --- > looks like I need to try get a hold of one (preferably 1060/1070)

The missing part raised by the question in the article: ' if dynamic scheduling is so great, why didn’t NVIDIA do this sooner?', is that the transistor budget for that is high and wasn't very necessary until commercial VR. The low latency levels needed for VR to not make you motion sick, are much shorter than conventional displays.

For our purposes it just means we should see better loading without choking the host system as much.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1804973 · Report as offensive

Message boards : Number crunching : GPU Utilization: Thermal Throttling?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.