Limiting GPU temperature on NVIDIA graphic card on Linux


log in

Advanced search

Message boards : Number crunching : Limiting GPU temperature on NVIDIA graphic card on Linux

Author Message
Bent Jakobsen
Send message
Joined: 27 Jul 04
Posts: 2
Credit: 77,199
RAC: 0
Denmark
Message 996464 - Posted: 15 May 2010, 15:10:19 UTC

I was looking for a way to limit the maximum temperature that my NVIDIA card get heated to, when crunching GPU units through boinc (here on the milkyway project).

The reason that I wanted to limit the maximum temperature is two fold:

1.To limit the wear and tear on the remaining components
2.To limit the noice from the graphic card.

This should, from my point of view, be done through boinc.
But I did not anything to use for this purpose on Linux, so I wrote this bash script.

I'll just upload it so if anyone want to use it, just do :)

But it is not perfect, and is not optimized, so feel free to improve. And please do remember that you use it at your own risk, so don't come running to me when your graphic card get toasted.

Best regards
Bent


-----BASH CODE BEGIN-----
#!/bin/bash

#--------------------------------------------------------------------------------------------------
# Version: 001
#
# Description: To ensure that the NVIDIA temperature stay within a certain range, when running the milkyway GPU application
#
# Opperating System: This is made for Linux.
#
# Notes:
# - This example is made for only one GPU (gpu:0)
# - Requires to be run as root or an elevated account that is allowed to stop and restart processes
#
# Requirements/commands used:
# nvidia-settings
# awk
# sed
# pslist
# grep
# cat
# kill
# sleep
# echo
# cut
#
# NOTE: Please do check that you have all these commands before running it.
#
# Disclaimer:
# I assume no responsibility use it at your own risk ...
# So don't complain about anything that may arise from using this....
# It is your own fault ;)
#
# And do please note that this script might have some unexpected bugs, as it is only some bash-code
# which I have thrown together...
#
# License: Free - Do please modify if you want to do so
#
# Todo: The bash code should be optimized, and check for any unexpected failures...
# --------------------------------------------------------------------------------------------------

GPUMAX=67
GPURESUME=56
SLEEPTIME=2

# GPUMAX: Maximum temperature
# GPURESUME: Temperature where we can resume computing
# SLEEPTIME: Amount of seconds between measurement


COUNTER=0
while [ $COUNTER -lt 10 ]; do
GPU0Temp=$( nvidia-settings -q [gpu:0]/GPUCoreTemp | grep '):' | awk '{print $4}' | sed 's/\.//' )
MPID=""
MPID=$( pslist milkyway_0.24_x )
if [ -z "$MPID" ]
then
echo "GPU0 (NO MILKYWAY) = "$GPU0Temp
else
MILKYWAYPID=$( pslist milkyway_0.24_x | cut -d " " -f 1 )
# Get status: /proc/
STATUS=""
STATUS=$( cat /proc/$MILKYWAYPID/status | grep "State:" | grep "stopped" )
fi
if [ "$GPU0Temp" -gt "$GPUMAX" ]
then
# Temperature greather than allowed so pause GPU-thread
kill -STOP $MILKYWAYPID
RESUME=0
while [ $RESUME -lt 10 ]; do
GPU0Temp1=$( nvidia-settings -q [gpu:0]/GPUCoreTemp | grep '):' | awk '{print $4}' | sed 's/\.//' )
if [ "$GPU0Temp1" -lt "$GPURESUME" ]
then
RESUME=20
fi
echo "GPU0 (STOPPED) = "$GPU0Temp1
sleep $SLEEPTIME
done
kill -CONT $MILKYWAYPID
else
if [ "$GPU0Temp" -lt "$GPURESUME" ]
then
# Temperature less than allowed so resume GPU-thread - if it has been stopped
if ! [ -z "$MPID" ]
then
# Check to see if process is stopped
if ! [ -z "$STATUS" ]
then
kill -CONT $MILKYWAYPID
echo "GPU0 (RESUMED) = "$GPU0Temp
fi
fi
else
if ! [ -z "$MPID" ]
then
# Check status: /proc/
if [ -z "$STATUS" ]
then
echo "GPU0 (RUNNING) = "$GPU0Temp
else
echo "GPU0 (COOLING) = "$GPU0Temp
fi
fi
fi
sleep $SLEEPTIME
fi
done


____________

jravin
Send message
Joined: 25 Mar 02
Posts: 930
Credit: 98,075,777
RAC: 86,037
United States
Message 996602 - Posted: 17 May 2010, 6:22:39 UTC - in response to Message 996464.
Last modified: 17 May 2010, 6:23:03 UTC

EVGA has a tool - "Precision" - for Windows that enables control of the fan on your cards; check their website to see if they have a version for Linux
____________

Bent Jakobsen
Send message
Joined: 27 Jul 04
Posts: 2
Credit: 77,199
RAC: 0
Denmark
Message 996679 - Posted: 17 May 2010, 14:54:38 UTC - in response to Message 996602.

Hi jravin,

Thanks for your message.

Firstly I can not find a Linux version of precision – however I can perhaps find another tool to do the same if I wanted to. But basically I don't want to.

Allow me to try to explain.

Your way, as I see it, is to reduce NVIDIA slowdown threshold temperature, and thereby limiting my already ”slow” GeForce GTX 285 card. This would be a good way to fix a situation where we are having a lot of ”bad behaving applications”, and we wanted to ensure that no matter what, the temperature would not rise above a certain level, at expense off the noice level, and at the expense of the graphic card if the fan fails.

But from my point of view it is the controlling application (read: boinc) which should ensure that it does not allow a ”bad behaving” application like milkyway to be run, when the environment is not within the acceptable limits specified by the local administrator (read: me).

Current boinc does not allow such control, and therefore, in a way, is a ”bad behaving” application.

Therefore if I want to run milkyway (or any other GPU boinc based application) I have to take on the responsibility to ensure the running environment, and for this situation I have made the script, so that milkyway is paused when the temperature is above the maximum specified temperature, and allowed to contiune computation when the temperature is below a certain temperature.

So you see we are actually looking at the same issue but from two different points of view.

Best regards

Bent
____________

woodenboatguy
Send message
Joined: 10 Nov 00
Posts: 368
Credit: 3,969,364
RAC: 0
Canada
Message 996855 - Posted: 18 May 2010, 3:07:23 UTC - in response to Message 996679.

I have a number of GTX 285's. I say just let 'er rip. I get up to the high '80s low '90s and employ a solution someone here suggested. I have a fan mounted within the box blowing down the length of the three cards thereby increasing flow across the intake fans on the top and middle cards. It improved temps by 3 - 4C.

Smoke 'em if you got 'em I say. What else are you going to do with all your money?!!

Regards,
____________

jravin
Send message
Joined: 25 Mar 02
Posts: 930
Credit: 98,075,777
RAC: 86,037
United States
Message 996874 - Posted: 18 May 2010, 4:21:53 UTC - in response to Message 996679.

Hi jravin,

Thanks for your message.

Firstly I can not find a Linux version of precision – however I can perhaps find another tool to do the same if I wanted to. But basically I don't want to.

Allow me to try to explain.

Your way, as I see it, is to reduce NVIDIA slowdown threshold temperature, and thereby limiting my already ”slow” GeForce GTX 285 card. This would be a good way to fix a situation where we are having a lot of ”bad behaving applications”, and we wanted to ensure that no matter what, the temperature would not rise above a certain level, at expense off the noice level, and at the expense of the graphic card if the fan fails.

But from my point of view it is the controlling application (read: boinc) which should ensure that it does not allow a ”bad behaving” application like milkyway to be run, when the environment is not within the acceptable limits specified by the local administrator (read: me).

Current boinc does not allow such control, and therefore, in a way, is a ”bad behaving” application.

Therefore if I want to run milkyway (or any other GPU boinc based application) I have to take on the responsibility to ensure the running environment, and for this situation I have made the script, so that milkyway is paused when the temperature is above the maximum specified temperature, and allowed to contiune computation when the temperature is below a certain temperature.

So you see we are actually looking at the same issue but from two different points of view.

Best regards

Bent


NO! The point of Precision ( or a similar tool for Linux) is that you control the fan speed, so you increase that, to cut the temp. Thus, you don't throttle back the card to avoid high temps; you up the fan speed for more cooling.
____________

jravin
Send message
Joined: 25 Mar 02
Posts: 930
Credit: 98,075,777
RAC: 86,037
United States
Message 997225 - Posted: 20 May 2010, 0:39:00 UTC

I tried a Google search for "linux fan speed control" and came up with this, which may be what you want:

http://www.linuxhardware.org/nvclock/

Good luck!

Jon
____________

AlProject donor
Send message
Joined: 3 Apr 99
Posts: 481
Credit: 52,113,922
RAC: 23,518
United States
Message 997293 - Posted: 20 May 2010, 5:45:26 UTC

I've been wondering for a while if there was an windows app out there which would allow me to control temps, as opposed to precision which allows me to control speeds? I want to set a temp in the program, and it sets the speed of the fan to whatever it takes to make it hold (as best as possible, there's always a margin of error) to the set temp. As far as I know, there isn't such a program out there, does anyone know of one that works?
____________

w1hueProject donor
Volunteer tester
Send message
Joined: 4 Aug 00
Posts: 48
Credit: 1,741,669
RAC: 1,320
United States
Message 997656 - Posted: 21 May 2010, 21:53:39 UTC - in response to Message 997293.

I've been wondering for a while if there was an windows app out there which would allow me to control temps, as opposed to precision which allows me to control speeds? I want to set a temp in the program, and it sets the speed of the fan to whatever it takes to make it hold (as best as possible, there's always a margin of error) to the set temp. As far as I know, there isn't such a program out there, does anyone know of one that works?


Have you tried TThrottle? (Go to http://www.efmer.eu/boinc/). I have used it to keep CPU temps under control and it works great. Also supposed to work with GPUs, but my old NVIDIA GeForce 210 never gets hotter than 65 deg. C (well within its limits) when running SETI.

____________

Message boards : Number crunching : Limiting GPU temperature on NVIDIA graphic card on Linux

Copyright © 2014 University of California