Q about nvidia crash & recovery.


log in

Advanced search

Message boards : Number crunching : Q about nvidia crash & recovery.

Author Message
Profile 52 AcesProject donor
Avatar
Send message
Joined: 7 Jan 02
Posts: 497
Credit: 13,336,385
RAC: 3,340
United States
Message 941901 - Posted: 22 Oct 2009, 4:27:06 UTC

I'm running an EVGA nvidia gts250 (512mb/19107 driver).

Sometimes if I launch a game on my crunch box while seti is running, the graphic card blanks-n-tanks, I get a message the driver has recovered, and the current CUDA WU terminates with a "Computational Error."

All that is fine. What isn't fine is from then on out, at least until I reboot, CUDA jobs that used to run in 20 minutes will now run slow-mo taking over 2 hours.

Has anyone seen this and do you know of any process other than rebooting the box to get the nvidia card to re-initialize properly? Full exit of boinc & services does not resolve, this is burried someplace. I've tried the obvious things inside the nvidia applet (and EVGA Precision) and windows proper.


    Boinc Messages:
    10/21/2009 9:08:53 PM NVIDIA GPU has become unusable; disabling tasks
    10/21/2009 9:08:55 PM NVIDIA GPU has become usable; enabling tasks

    Windows Event Viewer:
    Display driver nvlddmkm stopped responding and has successfully recovered.



Thx in advanced & cheers.

Profile Gundolf Jahn
Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 359,338
RAC: 33
Germany
Message 941928 - Posted: 22 Oct 2009, 6:42:28 UTC - in response to Message 941901.

What isn't fine is from then on out, at least until I reboot, CUDA jobs that used to run in 20 minutes will now run slow-mo taking over 2 hours.

That's because all tasks are running in CPU-fallback mode until you reboot and thus reinitialise your graphics device.

As far as I know, there's no other way than rebooting.

Gruß,
Gundolf
____________
Computer sind nicht alles im Leben. (Kleiner Scherz)

SETI@home classic workunits 3,758
SETI@home classic CPU time 66,520 hours

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12327
Credit: 2,632,050
RAC: 1,146
Netherlands
Message 941930 - Posted: 22 Oct 2009, 6:50:55 UTC

Full power recycle on the hardware, aka a reboot or power down/power up, is the only way to reinitialize stuck hardware.
____________
Jord

Fighting for the correct use of the apostrophe, together with Weird Al Yankovic

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4141
Credit: 33,602,411
RAC: 27,511
United Kingdom
Message 941966 - Posted: 22 Oct 2009, 16:25:28 UTC - in response to Message 941901.

You didn't say what Boinc version you're running, but i suspect it's 6.10.14,
as i've had 'NVIDIA GPU has become unusable; disabling tasks' as well,
mine was on Collatz Conjecture with a bit of IE Browsing and downloading,
But went away when i upgraded to 6.10.15, changes are:

Rom 19 October 2009
- client: Use is_remote_desktop() instead of the various GPU functions to determine when the client software has been switched into Remote Desktop mode and shutsdown GPU apps. This will prevent App crashes

Claggy

Profile 52 AcesProject donor
Avatar
Send message
Joined: 7 Jan 02
Posts: 497
Credit: 13,336,385
RAC: 3,340
United States
Message 942025 - Posted: 22 Oct 2009, 19:48:31 UTC - in response to Message 941966.

Thx all & Claggy.

Yes, I was on .14 and just updated to 6.10.15 ! Thx, good find.


One other item I stumbled on last night trying to solve this, I'm using a Gigabyte (P55-UD2) motherboard, and it comes with their own clocking util (EasyTune6) to ease configuration of the Award Bios settings. Although not all settings are part of BIOS, it has a TAB called "Graphics," and sure enough the values it showed were the slow ones that only appear after a GPU crash & recover. Although I could not set the values to the OC levels, I *COULD* up the values to the original out-of-the-box levels. So GPU WU's instead of taking 20 minutes will take 22 minutes after a crash (which is much better than 2 hours).

Classic, those non-Bios settings don't survive a reboot (but do of course survive a GPU crash & driver recover), but I might be able to auto-load a profile file, I'll worry about it later. But thought I'd share this info now, as here were a set of settings that really don't belong where they were, and somehow inserted themselves AHEAD of everything nVidia ships (albeit, only in the scenario of a GPU crash).

Profile X-Files 27
Avatar
Send message
Joined: 17 May 99
Posts: 100
Credit: 107,862,964
RAC: 0
Canada
Message 942030 - Posted: 22 Oct 2009, 20:32:39 UTC - in response to Message 941928.

What isn't fine is from then on out, at least until I reboot, CUDA jobs that used to run in 20 minutes will now run slow-mo taking over 2 hours.

Its because the card is running in 2d mode.

I always have this error with this setup:
GPU0: GTX295
-> SLI
GPU2: GTX295
-> PhysX
GPU1: GTX260


But when use this setup (no crashing anymore):
GPU0: GTX295
-> PhysX
GPU2: GTX295
-> Extend monitor
GPU1: GTX260
____________

Profile Misfit
Volunteer tester
Avatar
Send message
Joined: 21 Jun 01
Posts: 21790
Credit: 2,510,901
RAC: 0
United States
Message 942044 - Posted: 22 Oct 2009, 22:21:11 UTC - in response to Message 941966.

You didn't say what Boinc version you're running, but i suspect it's 6.10.14,
as i've had 'NVIDIA GPU has become unusable; disabling tasks' as well,
mine was on Collatz Conjecture with a bit of IE Browsing and downloading,
But went away when i upgraded to 6.10.15, changes are:

Rom 19 October 2009
- client: Use is_remote_desktop() instead of the various GPU functions to determine when the client software has been switched into Remote Desktop mode and shutsdown GPU apps. This will prevent App crashes

Claggy

Looks like I'll have to upgrade. I had the exact same crash (trashed 3 GPUGrid units) yesterday while gaming. Currently using 6.6.36
____________

Join BOINC Synergy!

Profile 52 AcesProject donor
Avatar
Send message
Joined: 7 Jan 02
Posts: 497
Credit: 13,336,385
RAC: 3,340
United States
Message 942067 - Posted: 23 Oct 2009, 0:11:09 UTC - in response to Message 942044.

Looks like I'll have to upgrade. I had the exact same crash (trashed 3 GPUGrid units) yesterday while gaming. Currently using 6.6.36


Lucky you, looks like 6.10.16 just got released.

Profile Misfit
Volunteer tester
Avatar
Send message
Joined: 21 Jun 01
Posts: 21790
Credit: 2,510,901
RAC: 0
United States
Message 942721 - Posted: 25 Oct 2009, 8:28:09 UTC - in response to Message 942067.
Last modified: 25 Oct 2009, 8:28:57 UTC

I've upgraded. I was gaming with BOINC completely shut down. Still had the video driver crash.
Display driver nvlddmkm stopped responding and has successfully recovered. (Event ID 4101)
This has happened with the current drivers and latest previous drivers. I never suffered a video crash with SETI CUDA. The problems started a few days into GPU Grid (was gaming and crunching Grid at the same time.) So I'm wondering if a file somewhere has been corrupted.
____________

Join BOINC Synergy!

Profile Chirag Patel
Volunteer tester
Avatar
Send message
Joined: 13 Sep 05
Posts: 48
Credit: 8,442,837
RAC: 8,672
India
Message 942828 - Posted: 26 Oct 2009, 0:50:43 UTC
Last modified: 26 Oct 2009, 0:52:35 UTC

It's happened to me before, the GPU clocks go from 550/1375/900 MHz (core/shaders/memory) to 383/767/301 MHz and stay there, which is why WUs take considerably longer. Unfortunately, the only solution I know of is to restart. The drivers do this when the card gets too hot and/or you overclock too far.

You can use GPU-Z to check whether the clocks go down (under the Sensors tab).
____________

Message boards : Number crunching : Q about nvidia crash & recovery.

Copyright © 2014 University of California