Seeing Linux GPU temperatures and getting alerts when things go south

Message boards : Number crunching : Seeing Linux GPU temperatures and getting alerts when things go south
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Joseph Stateson Project Donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 309
Credit: 70,759,933
RAC: 3
United States
Message 2023598 - Posted: 18 Dec 2019, 18:09:24 UTC
Last modified: 18 Dec 2019, 18:56:40 UTC

I use Boinctasks for my remote systems and that tool does not report temperatures from Linux like it does for Windows.
For some time, I have been using BT and have configured several "rules" to send a text message to my phone
if it detacts a stuck task or temp too high. That was not possible on Linux until now.

I have a python script at https://github.com/JStateson/BoincTasks that runs as a service under systemd
and reports temperatures to boinctasks. If addtion, if the NVidia driver recommends a reboot to recover a "lost" GPU,
then that script sends a text message alerting me and turns off GPU usage on boinc.

Anyone is welcome to use this tool and suggestions for improvement would be nice. You may already be using an
excellent temperature checking and reporting program. This script allows temps to show up on the boinctask
display which is convenient for me.
ID: 2023598 · Report as offensive
wujj123456

Send message
Joined: 5 Sep 04
Posts: 40
Credit: 20,877,975
RAC: 219
China
Message 2023666 - Posted: 19 Dec 2019, 5:19:35 UTC

Nice one. I am curious how many of you ever run into temperature problems... I use high air-flow case and so far haven't really seen any problem even with open air-cooling GPUs stacked next to each other. For all the years of gaming and running BOINC on and off, I've never had a GPU shutdown. I did have one card outright burn out in HTPC case two years ago, but it was a busted capacitor.
ID: 2023666 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 2023667 - Posted: 19 Dec 2019, 5:29:55 UTC - in response to Message 2023666.  

I've replaced 7 cards over the years. Most of those are due to burn out and were before the hybrids came out. Now I almost have all hybrids and that doesn't often as often as it used to.
ID: 2023667 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 2024512 - Posted: 23 Dec 2019, 1:28:30 UTC

Hi,

whenever I develop a new improved version of the software I run into temperature problems. I'm running on air and flying there in the vincinity of upper limits of cooling.

Some times it just happens that even though my system has been running OK for some hours a temperature catastrophe hits when I'm away from my computer. One of my GPUs goes south and does not recover. It will either run slow or rapidly destroy my work queue.

I could take a look at your solution.

--
Petri
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 2024512 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2024515 - Posted: 23 Dec 2019, 1:34:54 UTC
Last modified: 23 Dec 2019, 1:37:11 UTC

I agree with Z, the hybrids are perfect for crunching. With them you could easily keep the temps within a safe range even on hot & high humidity places like the one i live.

The only problem i have with the hybrids is with their pumps, i had 2 of them fail about 1 per year. They are hard to find here but when changed all return to work fine.
ID: 2024515 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 2024517 - Posted: 23 Dec 2019, 1:41:12 UTC - in response to Message 2024515.  

I agree with Z, the hybrids are perfect for crunching. With them you could easily keep the temps within a safe range even on hot & high humidity places like the one i live.

The only problem i have with the hybrids is with their pumps, i had 2 of them fail about 1 per year. They are hard to find here but when changed all return to work fine.


:)

Here is not humid nor warm. Still my GPUs go south. Or because of that.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 2024517 · Report as offensive

Message boards : Number crunching : Seeing Linux GPU temperatures and getting alerts when things go south


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.