Nvidia Driver kernal something stopped responding something recovered

Message boards : Number crunching : Nvidia Driver kernal something stopped responding something recovered
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile JakeTheDog
Avatar

Send message
Joined: 3 Nov 13
Posts: 153
Credit: 2,585,912
RAC: 0
United States
Message 1792647 - Posted: 1 Jun 2016, 18:34:36 UTC
Last modified: 1 Jun 2016, 18:47:50 UTC

I think I downloaded 2 GPU SOG tasks several hours before servers went down May 31 early morning with that extended downtime issue.
https://setiathome.berkeley.edu/result.php?resultid=4960012801
https://setiathome.berkeley.edu/result.php?resultid=4960012803

I was about 50% through 1 of them when screen would go black for a few seconds and a Windows taskbar message would say "Nvidia driver kernel something stopped responding and recovered." I noticed the task list kept saying postponed 30 seconds, and event log sometimes said GPU not detected. I suspended that one, and it would happen to the other GPU task that had not started. Sorry, but I didn't copy down the exact messages in the logs.

I was so worried that my GPU or PSU were broken. I reinstalled drivers, rolled them back as well, with DDU to clean them up. Tried some benchmark tests. Tasks still the same. On CPUID HWM, temperatures were OK, not familiar with voltages but they didn't go very high compared to heavy gaming. Heavy gaming that used more power did not have same issue.

Servers came back up. I decided to reset project and download new GPU tasks. New GPU tasks are running OK. The 2 I reset are still in my account's task list. Do they stay there until the deadline?

Were those 2 tasks somehow corrupted? Possibly related to server crash? Or random corruption? SOG related (did many of those successfully, though)? Hopefully nothing wrong with my hardware.

Windows 7 64-bit
i5-3570k not OC right now
MSI GTX 650TI Boost
16GB RAM
ID: 1792647 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1792666 - Posted: 1 Jun 2016, 19:30:30 UTC - in response to Message 1792647.  

You suffered a TDR fault with Windows and unfortunately, those tasks that were running when it happened are toast. Anytime the Nvidia driver disappears while crunching leads to a task error. The tasks will just time out after deadline but will be "ghosts" in your In Progress tasks. If you happen to go over to the Nvidia graphics card forums and search for the TDR fault theme, then there is a registry hack that extends the timeout detection algorithm which reduces the driver re-initialize problem.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1792666 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1792672 - Posted: 1 Jun 2016, 20:11:56 UTC - in response to Message 1792647.  

ID: 1792672 · Report as offensive
Profile JakeTheDog
Avatar

Send message
Joined: 3 Nov 13
Posts: 153
Credit: 2,585,912
RAC: 0
United States
Message 1792680 - Posted: 1 Jun 2016, 21:08:57 UTC

TDR registry seems to be already set to 8 as default. I'm not familiar with the settings files in BOINC, I'll take another look at that OpenCL thread if the driver crash happens again
ID: 1792680 · Report as offensive
The_Matrix
Volunteer tester

Send message
Joined: 17 Nov 03
Posts: 414
Credit: 5,827,850
RAC: 0
Germany
Message 1799378 - Posted: 29 Jun 2016, 15:34:33 UTC
Last modified: 29 Jun 2016, 15:34:54 UTC

strange thing, i changed the drivers, set pcie bus frequency lower, but
the SoG workunits, since Lunarics 0.45 BETA newest release.

Crunshing stop, and display drivers recovers permanently.

Known issue ?
ID: 1799378 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1799385 - Posted: 29 Jun 2016, 15:59:31 UTC - in response to Message 1799378.  

strange thing, i changed the drivers, set pcie bus frequency lower, but
the SoG workunits, since Lunarics 0.45 BETA newest release.

Crunshing stop, and display drivers recovers permanently.

Known issue ?

http://setiathome.berkeley.edu/forum_thread.php?id=79760
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1799385 · Report as offensive
Profile JakeTheDog
Avatar

Send message
Joined: 3 Nov 13
Posts: 153
Credit: 2,585,912
RAC: 0
United States
Message 1800168 - Posted: 2 Jul 2016, 20:01:31 UTC
Last modified: 2 Jul 2016, 20:02:22 UTC

If anyone's interested, this happened to me again. I tried 2 things at the same time so don't know which one fixed it.

My TDR was already 8, I decided to increase to 10. Restarted BOINC but did not reboot the computer and it kept happening. https://support.microsoft.com/en-gb/kb/2665946

From Raistmer's thread, I added -sbs 256 -period_iterations_num 100 to some txt files. not sure if the right ones. They were mb_cmdline-8.12_windows_intel__opencl_nvidia_sah.txt and mb_cmdline-8.12_windows_intel__opencl_nvidia_SoG.txt. Again, did not reboot computer but restarted BOINC, still happened. And this time my computer crashed and rebooted on its own.

After reboot, GPU task is running OK so far. So, not sure which one fixed it, but possibly changing TDR even higher to 10 with a reboot (because changed registry) maybe did it for me.
ID: 1800168 · Report as offensive

Message boards : Number crunching : Nvidia Driver kernal something stopped responding something recovered


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.