Safe despite errors? (nvlddmkm)

Message boards : Number crunching : Safe despite errors? (nvlddmkm)
Message board moderation

To post messages, you must log in.

AuthorMessage
[ue] Alex

Send message
Joined: 3 Apr 99
Posts: 9
Credit: 1,026,736
RAC: 0
Finland
Message 850490 - Posted: 7 Jan 2009, 17:39:34 UTC
Last modified: 7 Jan 2009, 17:39:57 UTC

I am getting constant (nvlddmkm) errors. GFX card, GTX280, stops to function because of that driver, starts again and the process repeats until my whole system reboots.

All i want confirmation on is that, is this because of the data in the WU itself, or is it a driver issue, or gfx card issue or my optimized client.....and will this cause damage to my card?

Basically is the GPU client safe to run?
ID: 850490 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 850497 - Posted: 7 Jan 2009, 17:49:04 UTC

It is the opinion of many here that since CUDA is returning work that is invalid it should not be used. The final answer to your question is of course up to you.
Boinc....Boinc....Boinc....Boinc....
ID: 850497 · Report as offensive
[ue] Alex

Send message
Joined: 3 Apr 99
Posts: 9
Credit: 1,026,736
RAC: 0
Finland
Message 850498 - Posted: 7 Jan 2009, 17:50:59 UTC - in response to Message 850497.  
Last modified: 7 Jan 2009, 17:51:06 UTC

Yes, i looked through these forums. Was just wondering if anyone had an educated best guess. I know running this kind of stuff 24hours is taxing on any system.
I think ill go back to cpu's for awhile.

Thanks
ID: 850498 · Report as offensive
Profile David @ TPS

Send message
Joined: 30 Sep 04
Posts: 70
Credit: 11,323,275
RAC: 0
United States
Message 850500 - Posted: 7 Jan 2009, 17:54:02 UTC - in response to Message 850490.  
Last modified: 7 Jan 2009, 17:55:35 UTC

....... Basically is the GPU client safe to run?



Not safe for others who have CPU's, in my opinion.

...... as I am setting No New Tasks on some boxes.........

(Have 10 day cache, so if it gets fixed before then I can re-load)
ID: 850500 · Report as offensive
Golden_Frog
Volunteer tester
Avatar

Send message
Joined: 28 Oct 99
Posts: 27
Credit: 1,650,057
RAC: 0
United States
Message 850506 - Posted: 7 Jan 2009, 18:00:03 UTC
Last modified: 7 Jan 2009, 18:01:33 UTC

My best bet is that the video driver is corrupted. I had the same issues with driver crashing on my 8800GS box. Even after a driver sweep I couldn't get it fixed. I ended up reformatting and downgrading from Vista 64bit to Vista 32 bit. This seems to have fixed the issue as I have been crunching error free for 2 days now.
ID: 850506 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 850508 - Posted: 7 Jan 2009, 18:02:15 UTC - in response to Message 850490.  
Last modified: 7 Jan 2009, 18:13:57 UTC

safe for your PC ? Your Seti work? Any time you your PC restarts suddenly its bad.

Possible solution http://www.eggheadcafe.com/software/aspnet/29415832/display-driver-nvlddmkm-s.aspx?

As always google is your friend


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 850508 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 850564 - Posted: 7 Jan 2009, 20:47:34 UTC - in response to Message 850490.  

I am getting constant (nvlddmkm) errors. GFX card, GTX280, stops to function because of that driver, starts again and the process repeats until my whole system reboots.

All i want confirmation on is that, is this because of the data in the WU itself, or is it a driver issue, or gfx card issue or my optimized client.....and will this cause damage to my card?

Basically is the GPU client safe to run?

"Safe" is as always, relative.

If in your particular environment it causes you to lose your video card, then if you do any other work on that machine that could be impacted (lost) due to a crash, then I would likely recommend against it until you find the root cause and fix it. I'd start with drivers.

The other question, which some have raised is "safety" with regard to the work being done.

Matt reports that 3% of the results being returned are from CUDA, with 97% logically being returned by CPU apps.

I don't know what percentage of the CUDA work is valid, and what percentage isn't, and I don't think anyone else does either.

But, in order for a "bad" CUDA result to make it to the science database, it has to match up with another, identically bad CUDA result. In other words, not only bad results, but bad in the same way.

The odds of a work unit being assigned to two CUDAs is 3% * 3%, or about 0.09% and if we assume that every CUDA result is bad (which probably isn't true) and that bad CUDA results are consistent, then the "threat" is 0.09%.

The rest get caught and filtered out by the validator.

Since that is the most pessimistic number, and we probably are getting valid CUDA results, I'd say that CUDA is fairly safe from a science standpoint.

... and that the "bad" results are very helpful to the project as a whole right now.

-- Ned
ID: 850564 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 851226 - Posted: 9 Jan 2009, 9:42:55 UTC - in response to Message 850508.  

safe for your PC ? Your Seti work? Any time you your PC restarts suddenly its bad.

Possible solution http://www.eggheadcafe.com/software/aspnet/29415832/display-driver-nvlddmkm-s.aspx?

As always google is your friend

Checked this solution - it's not the case on my host at least with 181.20 driver. It updates nvlddmkm.sys in system32/drivers properly. Maybe nVidia already repaired this driver installer bug.
ID: 851226 · Report as offensive
David

Send message
Joined: 8 Aug 00
Posts: 20
Credit: 301,705
RAC: 0
Message 867445 - Posted: 20 Feb 2009, 21:17:36 UTC - in response to Message 851226.  

I have also been having this issue for quite some time, and it only BSODs with the nvlddmkm.sys error when BOINC is running.

This error still occurs, even with the latest Nvidia drivers (I've tried many versions) Forceware 182.06 WHQL.

Any resolution for this by chance?

Thanks!

ID: 867445 · Report as offensive
Imannotu

Send message
Joined: 9 Aug 08
Posts: 1
Credit: 205,956
RAC: 0
United States
Message 867468 - Posted: 20 Feb 2009, 23:11:56 UTC
Last modified: 20 Feb 2009, 23:19:00 UTC

O.k. I know this sounds crazy but I actually had this same driver crash happen for me in video games so I searched around game makers websites and found out it was a problem with over heating. I use EVGA precision (because they made my card) and found that this crash only happened when my card reached 80-90 Celsius then either the driver would crash or the computer would crash. I found out this was a defense mechanism put in place by nvidia so that in the case of over heating your computer would stop its GPU intensive activitys and allow the fan time to cool your card. So for a while I turned up the fan and kept the temp. low manually. But just when I thought I was going out of my mind nvidia and subsequently EVGA released a new driver that solved the fan regulation issue. Now my card stays at 50-63 on full load and has no problems. I also had this problem with boinc so I tuned the fan up and that solved the issue how ever if you are not able to control the fan you could also try lowering the load on you'r card by turning down the % of gpu boinc can use. And as always, update to the newest drivers.

Hope this helps.
ID: 867468 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 867475 - Posted: 21 Feb 2009, 0:01:32 UTC - in response to Message 867468.  

Hi, IMHO, most graphics cards, run too hot, too easy.
When they are also being used @ 100% and maybe also overclocked, the certainly get too hot, if not actively cooled and the casing also, has sufficient airflow. Otherwise, it's doubt full, that they have a long life.
(I've seen cards with burnt memory-chips, for example.)

I'am sure there are back-planes, for PCI-Express, too, they only way to run > 3 cards, at a time.
I'am pretty sure, of the future for parallel computing, using graphic-cards, everything is difficult in the beginning, with a lot of failures, too.
But, when you see, the number 1 in R.A.C., sure a powerfull PCU Q9770 @ 4GHz. and 6 nVIDIA GTX 295 cards.
The biggest part of the computation, comes from these graphic cards.
Probably 7-8K for the CPU and ~16K for the 6 GTX 295, a beautiful piece of 'hard-ware', I.M.H.O. :)

ID: 867475 · Report as offensive
David

Send message
Joined: 8 Aug 00
Posts: 20
Credit: 301,705
RAC: 0
Message 867756 - Posted: 21 Feb 2009, 21:35:45 UTC - in response to Message 867475.  


I hate to say it, but in this case it IS NOT that the cards are running to hot.

This will happen when BOINC has just just barely started to crunch numbers (I do not have the screensaver enabled). The cards heat levels stay pretty much near nominal levels.

Again, it only happens with BOINC. I can run massively graphic intensive games, with NO crashes...ever.

So I've narrowed it down to BOINC, and no heat issues.
ID: 867756 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 867790 - Posted: 21 Feb 2009, 23:03:09 UTC - in response to Message 867756.  
Last modified: 21 Feb 2009, 23:04:39 UTC

The GTX 200 series of cards run very hot by themselves. The added problem is that they blow the hot air out the backplate, heating up half the memory that's there. And that goes pretty quickly with CUDA. If you do not have sufficient airflow behind your computer, you're prone to heat related crashes.

Run GPU-Z to check your actual temperatures. If need be run with the log on.
ID: 867790 · Report as offensive
David

Send message
Joined: 8 Aug 00
Posts: 20
Credit: 301,705
RAC: 0
Message 868292 - Posted: 22 Feb 2009, 23:39:07 UTC - in response to Message 867790.  

Again, it's not the GPUs fault. Nor are the GPU temps anything that would not be expected under a heavy load.

For the GTX 280 in idle you can expect a temperature of 50-55 Degrees C. Pretty normal. At 100%, the temperatures tnormally settle at 85 Degrees, but nowhere near the 105 Degrees C threshold for the GPU to jump into safe mode.

My GPUs with BOINC running usually run between 69 degrees C and 81 degrees C under full load. The fans don't even kick up to 100%, but setting in around 51%.

This is WELL within norms.

Again, I play extremely graphics intensive games, that frequently push the GPUs to the limit, and have the fans running at 100%, but it NEVER crashes. EVER.

Let me repeat this again, the crash on occurs ONLY when BOINC is running tasks.

It never crashes any other time....NEVER.

The crash is directly related to BOINC.


ID: 868292 · Report as offensive
Profile Misfit
Volunteer tester
Avatar

Send message
Joined: 21 Jun 01
Posts: 21804
Credit: 2,815,091
RAC: 0
United States
Message 868297 - Posted: 22 Feb 2009, 23:49:18 UTC

ID: 868297 · Report as offensive
David

Send message
Joined: 8 Aug 00
Posts: 20
Credit: 301,705
RAC: 0
Message 868302 - Posted: 23 Feb 2009, 0:01:12 UTC - in response to Message 868297.  


Again, none of these apply in this case, as this issue only occurs with BOINC running tasks.

Also, I long ago installed the latest drivers, and patches.

After a lot of trial and error, testing, etc...I've tracked this down to ONLY BOINC running tasks, and no other app or game on my system EVER causes this issue.

Why do I have to keep repeating this?
ID: 868302 · Report as offensive
Profile Misfit
Volunteer tester
Avatar

Send message
Joined: 21 Jun 01
Posts: 21804
Credit: 2,815,091
RAC: 0
United States
Message 868307 - Posted: 23 Feb 2009, 0:19:50 UTC - in response to Message 868302.  
Last modified: 23 Feb 2009, 0:30:14 UTC

Why do I have to keep repeating this?

BOINC message board <-- try there. Otherwise your only solution is to stop using BOINC. No more need for repetition.

Edit: The latest driver versions were released recently, since you updated long ago you'll need to update again.
me@rescam.org
ID: 868307 · Report as offensive
David

Send message
Joined: 8 Aug 00
Posts: 20
Credit: 301,705
RAC: 0
Message 868311 - Posted: 23 Feb 2009, 0:35:06 UTC - in response to Message 868307.  


Brilliant. The suggestion is to stop using BOINC. Wow.....

I am running the latest drivers, I've said that over and over again as well.

Geesh...sad.
ID: 868311 · Report as offensive
Profile Fish
Volunteer tester
Avatar

Send message
Joined: 4 Oct 00
Posts: 35
Credit: 2,051,424
RAC: 0
United States
Message 868314 - Posted: 23 Feb 2009, 0:44:05 UTC

The only combo that has worked for me in 64bit without any issues... 6.5.0/181.22



Fish
ID: 868314 · Report as offensive
Profile Misfit
Volunteer tester
Avatar

Send message
Joined: 21 Jun 01
Posts: 21804
Credit: 2,815,091
RAC: 0
United States
Message 868317 - Posted: 23 Feb 2009, 0:47:41 UTC - in response to Message 868311.  

Brilliant. The suggestion is to stop using BOINC. Wow.....

I am running the latest drivers, I've said that over and over again as well.

Geesh...sad.

to quote:
Also, I long ago installed the latest drivers, and patches.

So in your mind any date after Feb 18 when the latest drivers were made available is LONG AGO.

Brilliant that you finally posted at the BOINC board. Sad that you didn't appreciate the help the people were trying to provide for you here. :(
me@rescam.org
ID: 868317 · Report as offensive

Message boards : Number crunching : Safe despite errors? (nvlddmkm)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.