Some puzzle...


log in

Advanced search

Message boards : Number crunching : Some puzzle...

Author Message
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3588
Credit: 48,740,604
RAC: 24,422
Russia
Message 1320735 - Posted: 28 Dec 2012, 9:50:36 UTC

About my unstable NV host again...
Now it entered into period of increased instability again.
Blue screens or reboots almost immediately after login or even before login (BOINC running as service).
And this happens under both installed OSes, Win2003 Server x64 and Win7 x64.

I booted Win7 into safe mode and moved BOINC's data folder so was able to re-boot into Win2003 server w/o BSoD.
Then I started to test GPU with MSI Afterburn.
Burn test (like FurMark) ran ~10 mins, GPU temp increased over 70C, GPU load was 98% or more, one CPU core was completely busy... and no BSoDs/restarts.

But when I restored BOINC setup (that configured to run 1 CPU core + GPU) BSoD happened almost immediately.

So, the puzzle is: in what system load from FurMark/MSI Afterburn differs so radically from BOINC load?

IMHO power draw from PSU should be even higher with burn-in test...
Unfortunately, I can't measure power directly, but GPU temperature was lower with CUDA app....


____________

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 5079
Credit: 74,107,638
RAC: 5,940
Australia
Message 1320740 - Posted: 28 Dec 2012, 10:16:53 UTC - in response to Message 1320735.
Last modified: 28 Dec 2012, 10:23:46 UTC

Assuming the card or anything else isn't broken,
If you applied Windows updates since June/July this year, then you have a fairly major technology mismatch (as far as Cuda is concerned) between Windows, and using an old driver. There are substantial changes to texture/font cache management, most of which would be resolved by using the newest WHQL [clean install advanced option] & x41zc public beta application. These synchronisation issues aren't 'correctable' using old setup, as they are deemed critical security issues (hence BSOD), and are a function of the evolving landscape of gpgpu technology.

Happy new year,
Jason
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3588
Credit: 48,740,604
RAC: 24,422
Russia
Message 1320750 - Posted: 28 Dec 2012, 11:08:42 UTC

In general auto-update disabled there on both OSes, but can't be sure when it was manually updated last time.
Cause Win7x64 not "production" OS there can experiment with it a little.

Till now it looked as purely hardware issue (leaving CPU completely idle usually decreased frequency of BSoDs). But cause it holds burn-in GPU tests quite Ok...
____________

ClaggyProject donor
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4209
Credit: 34,468,854
RAC: 18,804
United Kingdom
Message 1320754 - Posted: 28 Dec 2012, 11:31:50 UTC - in response to Message 1320748.

The latest 310.70 NV driver has been getting good comments.
You might try a clean install of that on the Win 7 OS, as Jason suggested.

I updated my Win 7 rig (my daily driver) to it a few days ago, and it seems to be working very well. I am still running the x41z app, but will be updating that soon as well.

Running legacy GPUs on the 304.xx and later drivers introduces quite a slowdown on the x41 Cuda32 and Cuda42 apps, while the Cuda5 app is even slower, the Cuda22 and Cuda23 apps don't seem to be affected.
(at least on my 9800GTX+ Win Vista x64 host, i haven't managed to get anyone with a GTX2** GPU to do similar benches yet), that's why my 9800GTX+ runs 301.42 Cuda42 drivers, my last posted bench: Message 1284483

Claggy

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3588
Credit: 48,740,604
RAC: 24,422
Russia
Message 1320778 - Posted: 28 Dec 2012, 12:37:03 UTC - in response to Message 1320754.

Cause I'm going to do driver update for Win7 I can do tests on GTX260.

____________

Profile TRuEQ & TuVaLu
Volunteer tester
Avatar
Send message
Joined: 4 Oct 99
Posts: 479
Credit: 19,963,078
RAC: 19,958
Sweden
Message 1320789 - Posted: 28 Dec 2012, 13:18:50 UTC

Does the 260 comp work with stock app without problem???

Might try driver 306.97

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3588
Credit: 48,740,604
RAC: 24,422
Russia
Message 1320796 - Posted: 28 Dec 2012, 13:59:05 UTC

Completely disabling CPU again makes host much more stable. Looks like it's hardware problem after all. Need some burn-in CPU tests to check.

____________

zoom314Project donor
Avatar
Send message
Joined: 30 Nov 03
Posts: 46757
Credit: 36,999,451
RAC: 3,420
United States
Message 1320962 - Posted: 28 Dec 2012, 21:26:09 UTC
Last modified: 28 Dec 2012, 21:27:15 UTC

The cpu as an i/o device for the gpu, sigh.

I'd tried 310.70, the beta version, I'd gotten a BSOD, it might have been a problem with My hardware before I thoroughly cleaned out the PC, I'm running 306.97 x64 on Win 7 Pro x64 and x41zc, I see regular slow downs from around 10 minutes to about 18-20 minutes per wu crunched, temps fall from in the mid to low 70's to the mid 60's when this happens, I'm also using Boinc 6.10.58 x64 and BoincTasks 1.44 x64 too, I don't know if the author sees this as important or not, but it should be looked into, all this happens on an EVGA GTX590 Classified(a model #1598 in fact), I run from 7pm to 7am in the winter and 8pn to 8am the rest of the time with the fan at 100%(not a mere 95%) using Precision X 3.04 and I do have clean 12v power going to the pcie bus as I have an EVGA Power Booster x1 pcie card in place, so I have plenty of power going to the GTX590 card, I get driver crashes once a day when doing Seti, but the card just picks up and just keeps going. This isn't a complaint...
____________
My Facebook, War Commander, 2015

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3588
Credit: 48,740,604
RAC: 24,422
Russia
Message 1321002 - Posted: 28 Dec 2012, 22:43:27 UTC - in response to Message 1320962.
Last modified: 28 Dec 2012, 22:44:08 UTC

I get driver crashes once a day when doing Seti, but the card just picks up and just keeps going.


Did you try to increase watchdog timer value via Windows registry ?
Driver restart can be because of just by too lenghtly kernel (or sequence of kernels) call. If so, increasing that timer value will solve problem or will make driver restart condition less frequent.
____________

zoom314Project donor
Avatar
Send message
Joined: 30 Nov 03
Posts: 46757
Credit: 36,999,451
RAC: 3,420
United States
Message 1321018 - Posted: 28 Dec 2012, 23:01:09 UTC - in response to Message 1321002.

I get driver crashes once a day when doing Seti, but the card just picks up and just keeps going.


Did you try to increase watchdog timer value via Windows registry ?
Driver restart can be because of just by too lenghtly kernel (or sequence of kernels) call. If so, increasing that timer value will solve problem or will make driver restart condition less frequent.

If this is the DCI value of 7, then What do suggest Raistmer? Would 60 be alright?

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicDrivers\DCI\

This is the only 'timeout' that I see in this area.
____________
My Facebook, War Commander, 2015

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 5079
Credit: 74,107,638
RAC: 5,940
Australia
Message 1321072 - Posted: 29 Dec 2012, 1:08:00 UTC
Last modified: 29 Dec 2012, 1:10:12 UTC

Two things apply there:

1) TDR only applies on displays with an active display connected. So if the issue is TDR related (at all), it should only show on particular GPUs with a monitor connected. There are two inbuilt settings in x41zc for individually controlling both process priority either globally or for individual GPUs. To use them you create a mbcuda.cfg text file in the project directory & reference it the app_info.xml, as per the provided example mbcuda.cfg. As the default settings are conservative (for Pre-Fermi belownormal, pfblockspersm=1, pf=100) I doubt this is an issue unless there is something particularly unusual about the particular system, but a stripped down example to reduce the settings while retaining abovenormal process priority would look like this for global control:

[mbcuda] processpriority = abovenormal pfblockspersm = 1 pfperiodsperlaunch = 20


or for specific GPU (Cuda 3.2 build or higher required), slot and bus determined from stderr device listing:
[mbcuda] processpriority = abovenormal pfblockspersm = 1 pfperiodsperlaunch = 100 [bus1slot0] processpriority = abovenormal pfblockspersm = 1 pfperiodsperlaunch = 10


2) TDR period is pretty long by default on XP, like 10 seconds or something. As this hasn't been reported by others to particularly manifest with default settings on newer OSes with much shorter TDR timeout period, I would recommend to investigate/diagnose all hardware and BIOS settings in detail, as well as apply the reduced settings as in 1 while diagnosing.

HTH
Jason
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

zoom314Project donor
Avatar
Send message
Joined: 30 Nov 03
Posts: 46757
Credit: 36,999,451
RAC: 3,420
United States
Message 1321084 - Posted: 29 Dec 2012, 1:41:47 UTC
Last modified: 29 Dec 2012, 1:42:15 UTC

In Windows 7 x64 the Timeout is set at 7, I set it at 60, I also did the following:

what i did was adding to the registry (using "regedit") the following DWORDS: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicDrivers\ [added "TdrLevel=0" and "TdrDelay=10"] && HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicDrivers\Timeout [changed "Timeout" value to 0x60]


I put both newly created 64 bit Dwords in the same folder as DCI Timeout, after wards I rebooted to PC and Windows 7 Pro x64, if one has a 32bit Windows OS one would use by default 32bit Dwords. Whether this is the right place or not I don't know.

http://stackoverflow.com/questions/10272513/cuda-nvidia-driver-crash-while-running
____________
My Facebook, War Commander, 2015

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 5079
Credit: 74,107,638
RAC: 5,940
Australia
Message 1321205 - Posted: 29 Dec 2012, 9:14:50 UTC - in response to Message 1320796.

Completely disabling CPU again makes host much more stable. Looks like it's hardware problem after all. Need some burn-in CPU tests to check.


As well, as a side note on the original thread issues:
I had a stark reminder today on my i5 w/GTX560ti, with a BSOD, that it needed a cleanout & reapplication of heatsink goo. As it uses the stock heatsink which IMO is too small, any kind of paste tends to dry out over a few months, so needing a good going through. Combined with several months of dust bunnies that was enough for its only issues.
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3588
Credit: 48,740,604
RAC: 24,422
Russia
Message 1323615 - Posted: 2 Jan 2013, 14:56:31 UTC - in response to Message 1321084.

In Windows 7 x64 the Timeout is set at 7, I set it at 60, I also did the following:

what i did was adding to the registry (using "regedit") the following DWORDS: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicDrivers\ [added "TdrLevel=0" and "TdrDelay=10"] && HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicDrivers\Timeout [changed "Timeout" value to 0x60]


I put both newly created 64 bit Dwords in the same folder as DCI Timeout, after wards I rebooted to PC and Windows 7 Pro x64, if one has a 32bit Windows OS one would use by default 32bit Dwords. Whether this is the right place or not I don't know.

http://stackoverflow.com/questions/10272513/cuda-nvidia-driver-crash-while-running


Here is what AMD recommends to do to disable watchdog timer under Vista:


Under Windows Vista, to prevent long programs from causing a dialog to be displayed
indicating that the display driver has stopped responding, disable the Vista Timeout Detection
and Recovery (TDR) feature, which is trying to detect hangs in graphics hardware. To do this,
use regedit.exe to create the following REG_DWORD entry in the registry, and set its value to 0:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers\TdrLevel
This avoids the constant polling by the driver and the kernel to prevent long work units from
monopolizing the device. (To restore default functionality, set the TdrLevel to 3.)
Note that Microsoft strongly discourages disabling this feature, and only recommends doing
so for debugging purposes. Do so at your own risk.


But, as Jason stated, try to tune app first. This measure just to check if too long kernel call applies or not to the problem.
____________

zoom314Project donor
Avatar
Send message
Joined: 30 Nov 03
Posts: 46757
Credit: 36,999,451
RAC: 3,420
United States
Message 1323622 - Posted: 2 Jan 2013, 15:29:43 UTC - in response to Message 1323615.

In Windows 7 x64 the Timeout is set at 7, I set it at 60, I also did the following:

what i did was adding to the registry (using "regedit") the following DWORDS: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicDrivers\ [added "TdrLevel=0" and "TdrDelay=10"] && HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicDrivers\Timeout [changed "Timeout" value to 0x60]


I put both newly created 64 bit Dwords in the same folder as DCI Timeout, after wards I rebooted to PC and Windows 7 Pro x64, if one has a 32bit Windows OS one would use by default 32bit Dwords. Whether this is the right place or not I don't know.

http://stackoverflow.com/questions/10272513/cuda-nvidia-driver-crash-while-running


Here is what AMD recommends to do to disable watchdog timer under Vista:


Under Windows Vista, to prevent long programs from causing a dialog to be displayed
indicating that the display driver has stopped responding, disable the Vista Timeout Detection
and Recovery (TDR) feature, which is trying to detect hangs in graphics hardware. To do this,
use regedit.exe to create the following REG_DWORD entry in the registry, and set its value to 0:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers\TdrLevel
This avoids the constant polling by the driver and the kernel to prevent long work units from
monopolizing the device. (To restore default functionality, set the TdrLevel to 3.)
Note that Microsoft strongly discourages disabling this feature, and only recommends doing
so for debugging purposes. Do so at your own risk.


But, as Jason stated, try to tune app first. This measure just to check if too long kernel call applies or not to the problem.

I went with what I've found and I've not had one video driver crash since, so I'll not disable such and such, as I'm happy right where the pc is set at.
____________
My Facebook, War Commander, 2015

Message boards : Number crunching : Some puzzle...

Copyright © 2014 University of California