Q about nvidia crash & recovery.

Message boards : Number crunching : Q about nvidia crash & recovery.
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile 52 Aces
Avatar

Send message
Joined: 7 Jan 02
Posts: 497
Credit: 14,261,068
RAC: 67
United States
Message 941901 - Posted: 22 Oct 2009, 4:27:06 UTC

I'm running an EVGA nvidia gts250 (512mb/19107 driver).

Sometimes if I launch a game on my crunch box while seti is running, the graphic card blanks-n-tanks, I get a message the driver has recovered, and the current CUDA WU terminates with a "Computational Error."

All that is fine. What isn't fine is from then on out, at least until I reboot, CUDA jobs that used to run in 20 minutes will now run slow-mo taking over 2 hours.

Has anyone seen this and do you know of any process other than rebooting the box to get the nvidia card to re-initialize properly? Full exit of boinc & services does not resolve, this is burried someplace. I've tried the obvious things inside the nvidia applet (and EVGA Precision) and windows proper.

    Boinc Messages:
    10/21/2009 9:08:53 PM NVIDIA GPU has become unusable; disabling tasks
    10/21/2009 9:08:55 PM NVIDIA GPU has become usable; enabling tasks

    Windows Event Viewer:
    Display driver nvlddmkm stopped responding and has successfully recovered.



Thx in advanced & cheers.

ID: 941901 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 941928 - Posted: 22 Oct 2009, 6:42:28 UTC - in response to Message 941901.  

What isn't fine is from then on out, at least until I reboot, CUDA jobs that used to run in 20 minutes will now run slow-mo taking over 2 hours.

That's because all tasks are running in CPU-fallback mode until you reboot and thus reinitialise your graphics device.

As far as I know, there's no other way than rebooting.

Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)

SETI@home classic workunits 3,758
SETI@home classic CPU time 66,520 hours
ID: 941928 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 941930 - Posted: 22 Oct 2009, 6:50:55 UTC

Full power recycle on the hardware, aka a reboot or power down/power up, is the only way to reinitialize stuck hardware.
ID: 941930 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 941966 - Posted: 22 Oct 2009, 16:25:28 UTC - in response to Message 941901.  

You didn't say what Boinc version you're running, but i suspect it's 6.10.14,
as i've had 'NVIDIA GPU has become unusable; disabling tasks' as well,
mine was on Collatz Conjecture with a bit of IE Browsing and downloading,
But went away when i upgraded to 6.10.15, changes are:

Rom 19 October 2009
- client: Use is_remote_desktop() instead of the various GPU functions to determine when the client software has been switched into Remote Desktop mode and shutsdown GPU apps. This will prevent App crashes

Claggy
ID: 941966 · Report as offensive
Profile 52 Aces
Avatar

Send message
Joined: 7 Jan 02
Posts: 497
Credit: 14,261,068
RAC: 67
United States
Message 942025 - Posted: 22 Oct 2009, 19:48:31 UTC - in response to Message 941966.  

Thx all & Claggy.

Yes, I was on .14 and just updated to 6.10.15 ! Thx, good find.


One other item I stumbled on last night trying to solve this, I'm using a Gigabyte (P55-UD2) motherboard, and it comes with their own clocking util (EasyTune6) to ease configuration of the Award Bios settings. Although not all settings are part of BIOS, it has a TAB called "Graphics," and sure enough the values it showed were the slow ones that only appear after a GPU crash & recover. Although I could not set the values to the OC levels, I *COULD* up the values to the original out-of-the-box levels. So GPU WU's instead of taking 20 minutes will take 22 minutes after a crash (which is much better than 2 hours).

Classic, those non-Bios settings don't survive a reboot (but do of course survive a GPU crash & driver recover), but I might be able to auto-load a profile file, I'll worry about it later. But thought I'd share this info now, as here were a set of settings that really don't belong where they were, and somehow inserted themselves AHEAD of everything nVidia ships (albeit, only in the scenario of a GPU crash).
ID: 942025 · Report as offensive
Profile X-Files 27
Avatar

Send message
Joined: 17 May 99
Posts: 104
Credit: 111,191,433
RAC: 0
Canada
Message 942030 - Posted: 22 Oct 2009, 20:32:39 UTC - in response to Message 941928.  

What isn't fine is from then on out, at least until I reboot, CUDA jobs that used to run in 20 minutes will now run slow-mo taking over 2 hours.

Its because the card is running in 2d mode.

I always have this error with this setup:
GPU0: GTX295
-> SLI
GPU2: GTX295
-> PhysX
GPU1: GTX260


But when use this setup (no crashing anymore):
GPU0: GTX295
-> PhysX
GPU2: GTX295
-> Extend monitor
GPU1: GTX260
ID: 942030 · Report as offensive
Profile Misfit
Volunteer tester
Avatar

Send message
Joined: 21 Jun 01
Posts: 21804
Credit: 2,815,091
RAC: 0
United States
Message 942044 - Posted: 22 Oct 2009, 22:21:11 UTC - in response to Message 941966.  

You didn't say what Boinc version you're running, but i suspect it's 6.10.14,
as i've had 'NVIDIA GPU has become unusable; disabling tasks' as well,
mine was on Collatz Conjecture with a bit of IE Browsing and downloading,
But went away when i upgraded to 6.10.15, changes are:

Rom 19 October 2009
- client: Use is_remote_desktop() instead of the various GPU functions to determine when the client software has been switched into Remote Desktop mode and shutsdown GPU apps. This will prevent App crashes

Claggy

Looks like I'll have to upgrade. I had the exact same crash (trashed 3 GPUGrid units) yesterday while gaming. Currently using 6.6.36
me@rescam.org
ID: 942044 · Report as offensive
Profile 52 Aces
Avatar

Send message
Joined: 7 Jan 02
Posts: 497
Credit: 14,261,068
RAC: 67
United States
Message 942067 - Posted: 23 Oct 2009, 0:11:09 UTC - in response to Message 942044.  

Looks like I'll have to upgrade. I had the exact same crash (trashed 3 GPUGrid units) yesterday while gaming. Currently using 6.6.36


Lucky you, looks like 6.10.16 just got released.
ID: 942067 · Report as offensive
Profile Misfit
Volunteer tester
Avatar

Send message
Joined: 21 Jun 01
Posts: 21804
Credit: 2,815,091
RAC: 0
United States
Message 942721 - Posted: 25 Oct 2009, 8:28:09 UTC - in response to Message 942067.  
Last modified: 25 Oct 2009, 8:28:57 UTC

I've upgraded. I was gaming with BOINC completely shut down. Still had the video driver crash.
Display driver nvlddmkm stopped responding and has successfully recovered. (Event ID 4101)
This has happened with the current drivers and latest previous drivers. I never suffered a video crash with SETI CUDA. The problems started a few days into GPU Grid (was gaming and crunching Grid at the same time.) So I'm wondering if a file somewhere has been corrupted.
me@rescam.org
ID: 942721 · Report as offensive
jenesuispasbavard
Volunteer tester
Avatar

Send message
Joined: 13 Sep 05
Posts: 49
Credit: 12,385,974
RAC: 0
United States
Message 942828 - Posted: 26 Oct 2009, 0:50:43 UTC
Last modified: 26 Oct 2009, 0:52:35 UTC

It's happened to me before, the GPU clocks go from 550/1375/900 MHz (core/shaders/memory) to 383/767/301 MHz and stay there, which is why WUs take considerably longer. Unfortunately, the only solution I know of is to restart. The drivers do this when the card gets too hot and/or you overclock too far.

You can use GPU-Z to check whether the clocks go down (under the Sensors tab).
ID: 942828 · Report as offensive

Message boards : Number crunching : Q about nvidia crash & recovery.


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.