Some puzzle...

Message boards : Number crunching : Some puzzle...
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1320735 - Posted: 28 Dec 2012, 9:50:36 UTC

About my unstable NV host again...
Now it entered into period of increased instability again.
Blue screens or reboots almost immediately after login or even before login (BOINC running as service).
And this happens under both installed OSes, Win2003 Server x64 and Win7 x64.

I booted Win7 into safe mode and moved BOINC's data folder so was able to re-boot into Win2003 server w/o BSoD.
Then I started to test GPU with MSI Afterburn.
Burn test (like FurMark) ran ~10 mins, GPU temp increased over 70C, GPU load was 98% or more, one CPU core was completely busy... and no BSoDs/restarts.

But when I restored BOINC setup (that configured to run 1 CPU core + GPU) BSoD happened almost immediately.

So, the puzzle is: in what system load from FurMark/MSI Afterburn differs so radically from BOINC load?

IMHO power draw from PSU should be even higher with burn-in test...
Unfortunately, I can't measure power directly, but GPU temperature was lower with CUDA app....


SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1320735 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1320740 - Posted: 28 Dec 2012, 10:16:53 UTC - in response to Message 1320735.  
Last modified: 28 Dec 2012, 10:23:46 UTC

Assuming the card or anything else isn't broken,
If you applied Windows updates since June/July this year, then you have a fairly major technology mismatch (as far as Cuda is concerned) between Windows, and using an old driver. There are substantial changes to texture/font cache management, most of which would be resolved by using the newest WHQL [clean install advanced option] & x41zc public beta application. These synchronisation issues aren't 'correctable' using old setup, as they are deemed critical security issues (hence BSOD), and are a function of the evolving landscape of gpgpu technology.

Happy new year,
Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1320740 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1320748 - Posted: 28 Dec 2012, 11:01:37 UTC
Last modified: 28 Dec 2012, 11:09:50 UTC

The latest 310.70 NV driver has been getting good comments.
You might try a clean install of that on the Win 7 OS, as Jason suggested.

I updated my Win 7 rig (my daily driver) to it a few days ago, and it seems to be working very well. I am still running the x41z app, but will be updating that soon as well.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1320748 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1320750 - Posted: 28 Dec 2012, 11:08:42 UTC

In general auto-update disabled there on both OSes, but can't be sure when it was manually updated last time.
Cause Win7x64 not "production" OS there can experiment with it a little.

Till now it looked as purely hardware issue (leaving CPU completely idle usually decreased frequency of BSoDs). But cause it holds burn-in GPU tests quite Ok...
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1320750 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1320754 - Posted: 28 Dec 2012, 11:31:50 UTC - in response to Message 1320748.  

The latest 310.70 NV driver has been getting good comments.
You might try a clean install of that on the Win 7 OS, as Jason suggested.

I updated my Win 7 rig (my daily driver) to it a few days ago, and it seems to be working very well. I am still running the x41z app, but will be updating that soon as well.

Running legacy GPUs on the 304.xx and later drivers introduces quite a slowdown on the x41 Cuda32 and Cuda42 apps, while the Cuda5 app is even slower, the Cuda22 and Cuda23 apps don't seem to be affected.
(at least on my 9800GTX+ Win Vista x64 host, i haven't managed to get anyone with a GTX2** GPU to do similar benches yet), that's why my 9800GTX+ runs 301.42 Cuda42 drivers, my last posted bench: Message 1284483

Claggy
ID: 1320754 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1320778 - Posted: 28 Dec 2012, 12:37:03 UTC - in response to Message 1320754.  

Cause I'm going to do driver update for Win7 I can do tests on GTX260.

SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1320778 · Report as offensive
Profile TRuEQ & TuVaLu
Volunteer tester
Avatar

Send message
Joined: 4 Oct 99
Posts: 505
Credit: 69,523,653
RAC: 10
Sweden
Message 1320789 - Posted: 28 Dec 2012, 13:18:50 UTC

Does the 260 comp work with stock app without problem???

Might try driver 306.97
ID: 1320789 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1320796 - Posted: 28 Dec 2012, 13:59:05 UTC

Completely disabling CPU again makes host much more stable. Looks like it's hardware problem after all. Need some burn-in CPU tests to check.

SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1320796 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1320958 - Posted: 28 Dec 2012, 21:14:56 UTC - in response to Message 1320754.  

The latest 310.70 NV driver has been getting good comments.
You might try a clean install of that on the Win 7 OS, as Jason suggested.

I updated my Win 7 rig (my daily driver) to it a few days ago, and it seems to be working very well. I am still running the x41z app, but will be updating that soon as well.

Running legacy GPUs on the 304.xx and later drivers introduces quite a slowdown on the x41 Cuda32 and Cuda42 apps, while the Cuda5 app is even slower, the Cuda22 and Cuda23 apps don't seem to be affected.
(at least on my 9800GTX+ Win Vista x64 host, i haven't managed to get anyone with a GTX2** GPU to do similar benches yet), that's why my 9800GTX+ runs 301.42 Cuda42 drivers, my last posted bench: Message 1284483

Claggy

I believe Jason recommends 2.3 for 200 series cards.

"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1320958 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65709
Credit: 55,293,173
RAC: 49
United States
Message 1320962 - Posted: 28 Dec 2012, 21:26:09 UTC
Last modified: 28 Dec 2012, 21:27:15 UTC

The cpu as an i/o device for the gpu, sigh.

I'd tried 310.70, the beta version, I'd gotten a BSOD, it might have been a problem with My hardware before I thoroughly cleaned out the PC, I'm running 306.97 x64 on Win 7 Pro x64 and x41zc, I see regular slow downs from around 10 minutes to about 18-20 minutes per wu crunched, temps fall from in the mid to low 70's to the mid 60's when this happens, I'm also using Boinc 6.10.58 x64 and BoincTasks 1.44 x64 too, I don't know if the author sees this as important or not, but it should be looked into, all this happens on an EVGA GTX590 Classified(a model #1598 in fact), I run from 7pm to 7am in the winter and 8pn to 8am the rest of the time with the fan at 100%(not a mere 95%) using Precision X 3.04 and I do have clean 12v power going to the pcie bus as I have an EVGA Power Booster x1 pcie card in place, so I have plenty of power going to the GTX590 card, I get driver crashes once a day when doing Seti, but the card just picks up and just keeps going. This isn't a complaint...
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1320962 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1321002 - Posted: 28 Dec 2012, 22:43:27 UTC - in response to Message 1320962.  
Last modified: 28 Dec 2012, 22:44:08 UTC

I get driver crashes once a day when doing Seti, but the card just picks up and just keeps going.


Did you try to increase watchdog timer value via Windows registry ?
Driver restart can be because of just by too lenghtly kernel (or sequence of kernels) call. If so, increasing that timer value will solve problem or will make driver restart condition less frequent.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1321002 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65709
Credit: 55,293,173
RAC: 49
United States
Message 1321018 - Posted: 28 Dec 2012, 23:01:09 UTC - in response to Message 1321002.  

I get driver crashes once a day when doing Seti, but the card just picks up and just keeps going.


Did you try to increase watchdog timer value via Windows registry ?
Driver restart can be because of just by too lenghtly kernel (or sequence of kernels) call. If so, increasing that timer value will solve problem or will make driver restart condition less frequent.

If this is the DCI value of 7, then What do suggest Raistmer? Would 60 be alright?

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicDrivers\DCI\

This is the only 'timeout' that I see in this area.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1321018 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1321072 - Posted: 29 Dec 2012, 1:08:00 UTC
Last modified: 29 Dec 2012, 1:10:12 UTC

Two things apply there:

1) TDR only applies on displays with an active display connected. So if the issue is TDR related (at all), it should only show on particular GPUs with a monitor connected. There are two inbuilt settings in x41zc for individually controlling both process priority either globally or for individual GPUs. To use them you create a mbcuda.cfg text file in the project directory & reference it the app_info.xml, as per the provided example mbcuda.cfg. As the default settings are conservative (for Pre-Fermi belownormal, pfblockspersm=1, pf=100) I doubt this is an issue unless there is something particularly unusual about the particular system, but a stripped down example to reduce the settings while retaining abovenormal process priority would look like this for global control:

[mbcuda]
processpriority = abovenormal
pfblockspersm = 1
pfperiodsperlaunch = 20


or for specific GPU (Cuda 3.2 build or higher required), slot and bus determined from stderr device listing:
[mbcuda]
processpriority = abovenormal
pfblockspersm = 1
pfperiodsperlaunch = 100
[bus1slot0]
processpriority = abovenormal
pfblockspersm = 1
pfperiodsperlaunch = 10


2) TDR period is pretty long by default on XP, like 10 seconds or something. As this hasn't been reported by others to particularly manifest with default settings on newer OSes with much shorter TDR timeout period, I would recommend to investigate/diagnose all hardware and BIOS settings in detail, as well as apply the reduced settings as in 1 while diagnosing.

HTH
Jason
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1321072 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65709
Credit: 55,293,173
RAC: 49
United States
Message 1321084 - Posted: 29 Dec 2012, 1:41:47 UTC
Last modified: 29 Dec 2012, 1:42:15 UTC

In Windows 7 x64 the Timeout is set at 7, I set it at 60, I also did the following:

what i did was adding to the registry (using "regedit") the following DWORDS: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicDrivers\ [added "TdrLevel=0" and "TdrDelay=10"] && HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicDrivers\Timeout [changed "Timeout" value to 0x60]


I put both newly created 64 bit Dwords in the same folder as DCI Timeout, after wards I rebooted to PC and Windows 7 Pro x64, if one has a 32bit Windows OS one would use by default 32bit Dwords. Whether this is the right place or not I don't know.

http://stackoverflow.com/questions/10272513/cuda-nvidia-driver-crash-while-running
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1321084 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1321205 - Posted: 29 Dec 2012, 9:14:50 UTC - in response to Message 1320796.  

Completely disabling CPU again makes host much more stable. Looks like it's hardware problem after all. Need some burn-in CPU tests to check.


As well, as a side note on the original thread issues:
I had a stark reminder today on my i5 w/GTX560ti, with a BSOD, that it needed a cleanout & reapplication of heatsink goo. As it uses the stock heatsink which IMO is too small, any kind of paste tends to dry out over a few months, so needing a good going through. Combined with several months of dust bunnies that was enough for its only issues.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1321205 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1321214 - Posted: 29 Dec 2012, 10:12:13 UTC - in response to Message 1321205.  

Completely disabling CPU again makes host much more stable. Looks like it's hardware problem after all. Need some burn-in CPU tests to check.


As well, as a side note on the original thread issues:
I had a stark reminder today on my i5 w/GTX560ti, with a BSOD, that it needed a cleanout & reapplication of heatsink goo. As it uses the stock heatsink which IMO is too small, any kind of paste tends to dry out over a few months, so needing a good going through. Combined with several months of dust bunnies that was enough for its only issues.

It's winter here now, and the crunchers heat my house...
But during the summer months, any time I get a rig that starts to act up in any way.......the first thing I do is shut it down and clean the kitty furs out of the heat sinks. Many times, that is all that is wrong.
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1321214 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1323615 - Posted: 2 Jan 2013, 14:56:31 UTC - in response to Message 1321084.  

In Windows 7 x64 the Timeout is set at 7, I set it at 60, I also did the following:

what i did was adding to the registry (using "regedit") the following DWORDS: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicDrivers\ [added "TdrLevel=0" and "TdrDelay=10"] && HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicDrivers\Timeout [changed "Timeout" value to 0x60]


I put both newly created 64 bit Dwords in the same folder as DCI Timeout, after wards I rebooted to PC and Windows 7 Pro x64, if one has a 32bit Windows OS one would use by default 32bit Dwords. Whether this is the right place or not I don't know.

http://stackoverflow.com/questions/10272513/cuda-nvidia-driver-crash-while-running


Here is what AMD recommends to do to disable watchdog timer under Vista:


Under Windows Vista, to prevent long programs from causing a dialog to be displayed
indicating that the display driver has stopped responding, disable the Vista Timeout Detection
and Recovery (TDR) feature, which is trying to detect hangs in graphics hardware. To do this,
use regedit.exe to create the following REG_DWORD entry in the registry, and set its value to 0:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers\TdrLevel
This avoids the constant polling by the driver and the kernel to prevent long work units from
monopolizing the device. (To restore default functionality, set the TdrLevel to 3.)
Note that Microsoft strongly discourages disabling this feature, and only recommends doing
so for debugging purposes. Do so at your own risk.


But, as Jason stated, try to tune app first. This measure just to check if too long kernel call applies or not to the problem.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1323615 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65709
Credit: 55,293,173
RAC: 49
United States
Message 1323622 - Posted: 2 Jan 2013, 15:29:43 UTC - in response to Message 1323615.  

In Windows 7 x64 the Timeout is set at 7, I set it at 60, I also did the following:

what i did was adding to the registry (using "regedit") the following DWORDS: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicDrivers\ [added "TdrLevel=0" and "TdrDelay=10"] && HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicDrivers\Timeout [changed "Timeout" value to 0x60]


I put both newly created 64 bit Dwords in the same folder as DCI Timeout, after wards I rebooted to PC and Windows 7 Pro x64, if one has a 32bit Windows OS one would use by default 32bit Dwords. Whether this is the right place or not I don't know.

http://stackoverflow.com/questions/10272513/cuda-nvidia-driver-crash-while-running


Here is what AMD recommends to do to disable watchdog timer under Vista:


Under Windows Vista, to prevent long programs from causing a dialog to be displayed
indicating that the display driver has stopped responding, disable the Vista Timeout Detection
and Recovery (TDR) feature, which is trying to detect hangs in graphics hardware. To do this,
use regedit.exe to create the following REG_DWORD entry in the registry, and set its value to 0:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers\TdrLevel
This avoids the constant polling by the driver and the kernel to prevent long work units from
monopolizing the device. (To restore default functionality, set the TdrLevel to 3.)
Note that Microsoft strongly discourages disabling this feature, and only recommends doing
so for debugging purposes. Do so at your own risk.


But, as Jason stated, try to tune app first. This measure just to check if too long kernel call applies or not to the problem.

I went with what I've found and I've not had one video driver crash since, so I'll not disable such and such, as I'm happy right where the pc is set at.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1323622 · Report as offensive

Message boards : Number crunching : Some puzzle...


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.