Message boards :
Number crunching :
Q about nvidia crash & recovery.
Message board moderation
Author | Message |
---|---|
52 Aces Send message Joined: 7 Jan 02 Posts: 497 Credit: 14,261,068 RAC: 67 |
I'm running an EVGA nvidia gts250 (512mb/19107 driver). Sometimes if I launch a game on my crunch box while seti is running, the graphic card blanks-n-tanks, I get a message the driver has recovered, and the current CUDA WU terminates with a "Computational Error." All that is fine. What isn't fine is from then on out, at least until I reboot, CUDA jobs that used to run in 20 minutes will now run slow-mo taking over 2 hours. Has anyone seen this and do you know of any process other than rebooting the box to get the nvidia card to re-initialize properly? Full exit of boinc & services does not resolve, this is burried someplace. I've tried the obvious things inside the nvidia applet (and EVGA Precision) and windows proper. Boinc Messages: 10/21/2009 9:08:53 PM NVIDIA GPU has become unusable; disabling tasks 10/21/2009 9:08:55 PM NVIDIA GPU has become usable; enabling tasks Windows Event Viewer: Display driver nvlddmkm stopped responding and has successfully recovered.
|
Gundolf Jahn Send message Joined: 19 Sep 00 Posts: 3184 Credit: 446,358 RAC: 0 |
What isn't fine is from then on out, at least until I reboot, CUDA jobs that used to run in 20 minutes will now run slow-mo taking over 2 hours. That's because all tasks are running in CPU-fallback mode until you reboot and thus reinitialise your graphics device. As far as I know, there's no other way than rebooting. Gruß, Gundolf Computer sind nicht alles im Leben. (Kleiner Scherz) SETI@home classic workunits 3,758 SETI@home classic CPU time 66,520 hours |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
Full power recycle on the hardware, aka a reboot or power down/power up, is the only way to reinitialize stuck hardware. |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
You didn't say what Boinc version you're running, but i suspect it's 6.10.14, as i've had 'NVIDIA GPU has become unusable; disabling tasks' as well, mine was on Collatz Conjecture with a bit of IE Browsing and downloading, But went away when i upgraded to 6.10.15, changes are: Rom 19 October 2009 - client: Use is_remote_desktop() instead of the various GPU functions to determine when the client software has been switched into Remote Desktop mode and shutsdown GPU apps. This will prevent App crashes Claggy |
52 Aces Send message Joined: 7 Jan 02 Posts: 497 Credit: 14,261,068 RAC: 67 |
Thx all & Claggy. Yes, I was on .14 and just updated to 6.10.15 ! Thx, good find. One other item I stumbled on last night trying to solve this, I'm using a Gigabyte (P55-UD2) motherboard, and it comes with their own clocking util (EasyTune6) to ease configuration of the Award Bios settings. Although not all settings are part of BIOS, it has a TAB called "Graphics," and sure enough the values it showed were the slow ones that only appear after a GPU crash & recover. Although I could not set the values to the OC levels, I *COULD* up the values to the original out-of-the-box levels. So GPU WU's instead of taking 20 minutes will take 22 minutes after a crash (which is much better than 2 hours). Classic, those non-Bios settings don't survive a reboot (but do of course survive a GPU crash & driver recover), but I might be able to auto-load a profile file, I'll worry about it later. But thought I'd share this info now, as here were a set of settings that really don't belong where they were, and somehow inserted themselves AHEAD of everything nVidia ships (albeit, only in the scenario of a GPU crash). |
X-Files 27 Send message Joined: 17 May 99 Posts: 104 Credit: 111,191,433 RAC: 0 |
What isn't fine is from then on out, at least until I reboot, CUDA jobs that used to run in 20 minutes will now run slow-mo taking over 2 hours. Its because the card is running in 2d mode. I always have this error with this setup: GPU0: GTX295 -> SLI GPU2: GTX295 -> PhysX GPU1: GTX260 But when use this setup (no crashing anymore): GPU0: GTX295 -> PhysX GPU2: GTX295 -> Extend monitor GPU1: GTX260 |
Misfit Send message Joined: 21 Jun 01 Posts: 21804 Credit: 2,815,091 RAC: 0 |
You didn't say what Boinc version you're running, but i suspect it's 6.10.14, Looks like I'll have to upgrade. I had the exact same crash (trashed 3 GPUGrid units) yesterday while gaming. Currently using 6.6.36 me@rescam.org |
52 Aces Send message Joined: 7 Jan 02 Posts: 497 Credit: 14,261,068 RAC: 67 |
Looks like I'll have to upgrade. I had the exact same crash (trashed 3 GPUGrid units) yesterday while gaming. Currently using 6.6.36 Lucky you, looks like 6.10.16 just got released. |
Misfit Send message Joined: 21 Jun 01 Posts: 21804 Credit: 2,815,091 RAC: 0 |
I've upgraded. I was gaming with BOINC completely shut down. Still had the video driver crash. Display driver nvlddmkm stopped responding and has successfully recovered. (Event ID 4101) This has happened with the current drivers and latest previous drivers. I never suffered a video crash with SETI CUDA. The problems started a few days into GPU Grid (was gaming and crunching Grid at the same time.) So I'm wondering if a file somewhere has been corrupted. me@rescam.org |
jenesuispasbavard Send message Joined: 13 Sep 05 Posts: 49 Credit: 12,385,974 RAC: 0 |
It's happened to me before, the GPU clocks go from 550/1375/900 MHz (core/shaders/memory) to 383/767/301 MHz and stay there, which is why WUs take considerably longer. Unfortunately, the only solution I know of is to restart. The drivers do this when the card gets too hot and/or you overclock too far. You can use GPU-Z to check whether the clocks go down (under the Sensors tab). |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.