Cuda runtime, memory related failure, threadsafe temporary exit..

Questions and Answers : GPU applications : Cuda runtime, memory related failure, threadsafe temporary exit..
Message board moderation

To post messages, you must log in.

AuthorMessage
ncoded.com Project Donor

Send message
Joined: 16 Aug 16
Posts: 4
Credit: 25,131,308
RAC: 2
United Kingdom
Message 1847321 - Posted: 8 Feb 2017, 13:17:16 UTC
Last modified: 8 Feb 2017, 13:18:10 UTC

Hi,

We have been running SETI on two GPUs; a GTX 970 and a GTX 750 Ti for a few days.

We just got the message "Cuda runtime, memory related failure, threadsafe temporary exit"

It is on the machine called E5-2683V3-1 which has 16 GB DDR4 and has a 14/28 core single V3 Xeon.

Machine is running Windows 10 64 Bit.

Any ideas as this is the first time we have come across this on any of our machines..

Thanks
ID: 1847321 · Report as offensive
ncoded.com Project Donor

Send message
Joined: 16 Aug 16
Posts: 4
Credit: 25,131,308
RAC: 2
United Kingdom
Message 1847337 - Posted: 8 Feb 2017, 15:37:31 UTC - in response to Message 1847321.  
Last modified: 8 Feb 2017, 15:40:20 UTC

Just as an update on this issue, within an hour of the problem as detailed above, on a 2nd machine (i7-4790-2) also with two GTX 970's we started getting computation errors under a completely different project Collatz.

The reason I mention this is both had the same symptoms, screen going blank temporary indicating some kind of driver issue; all four GPUs use exactly the same NVIdia driver.

We have not noticed any driver updates in the last day or so. The only common factor between both these boxes is the driver as mentioned, and that both are new installs of the latest BOINC software.

Obviously if there is some kind of issue with the latest software and NVidia driver then we expect there to be many similar reports.

It does seem odd that the driver problem is showing CUDA issues with SETI, and computation problems with Collatz.

Our other two machines are running GPUs but with completely different drivers (Geo-force 210), running SETI but at this time do not seem to be showing any problems.
ID: 1847337 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22188
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1847343 - Posted: 8 Feb 2017, 16:44:23 UTC

Your recent additional notes point toward it being a driver issue:
BOINC doesn't do any computation, it only manages the applications, you've had the same issue with different applications and hardware.
What you are describing is very typical of a driver halt/re-start cycle.
It wouldn't be the first time that there have been issues with the "latest" drivers when doing computational work - the modern drivers are heavily focused on games with computation a poor third. I would make sure I had drivers from Nvidia, I've been using driver version 368.81 for some time, and that appears to be stable on my Windows 7 PC, which has a pair of GTX 970s.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1847343 · Report as offensive
ncoded.com Project Donor

Send message
Joined: 16 Aug 16
Posts: 4
Credit: 25,131,308
RAC: 2
United Kingdom
Message 1847384 - Posted: 8 Feb 2017, 20:56:53 UTC - in response to Message 1847343.  

Hi Bob,

Thanks for the information.

We did a restart on each machine and everything seems to be okay now on both projects.

If it continues we will look at older drivers; right now we are using the latest driver 376.19 which in truth has been okay for the last few weeks apart from these issues today.
ID: 1847384 · Report as offensive

Questions and Answers : GPU applications : Cuda runtime, memory related failure, threadsafe temporary exit..


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.