CUDA WU completing too fast and not validating

Questions and Answers : GPU applications : CUDA WU completing too fast and not validating
Message board moderation

To post messages, you must log in.

AuthorMessage
agnawt

Send message
Joined: 25 Jun 07
Posts: 15
Credit: 4,838,223
RAC: 0
Israel
Message 915843 - Posted: 8 Jul 2009, 20:59:46 UTC
Last modified: 8 Jul 2009, 21:10:27 UTC

Hi all,

I have temporarily suspended all CUDA WU's until I can figure out what is going on with my GFX card. What happens is that CUDA WUs start processing and reach 1% or so then suddenly jump to 100% in around 45 seconds. They seemingly complete without errors until they are uploaded and eventually fail validation.

I have never seen a client-side "Computation error" except for one CUDA WU that I terminated through the task manager after it was paused. For some reason that paused WU prevented GPU-Z from starting up.

Here is some GFX card info (all stock clocks):



I am running Windows 7 Ultimate x64 RC1 with a Q6700 @ 3.0Ghz and 4GB DDR2-800. I have the latest Lunatics apps installed also to maximize performance.

All CPU WU's from what I can tell run without any issues. My system is *rock solid* otherwise, and I never experience any suspicious problems. I did once see a CUDA WU complete with no issues. That was when I had under-clocked the GPU to 525/1295Mhz and RAM to 825Mhz. Unfortunately I have not seen that one miracle repeat itself, even with my card continously under-clocked.

I have tried disabling Aero, because that used to raise hell with Gelato (yeah, I know, it was pre-CUDA) getting exclusive access to the GPU. Unfortunately that had not effect.

I immediately began to suspect that the card had memory issues so I ran the memtestG80 app on a 256MB segment with Aero on and off. All tests pass without any errors (100's of iterations) except the "Random blocks" test which presents absurd numbers 1543229761 errors (219 ms) and increases into ridiculous numbers after each iteration. I am hoping there is something wrong with the test; I find it hard to believe that it can address that many random memory blocks in a 2/10th second iteration, but then again, I am not well versed in CUDA and its capabilities with 128 stream processors connected via a wide bus to amazingly fast DRAM. Thus I will leave speculation to those who know more :-)

In the meantime I will also be swapping my 525W PSU for another 550W model just in case the 12V rails have gone flaky.

In what direction should I be checking besides what I have already done? Please feel free to request additional relevant system info if it could help with a diagnosis. I would post a link to all my bad CUDA WU's, however that system is currently disabled.

[edit]
I just noticed the instructions wrt to posting in this sub-form about modified app versions. I will stress that this problem happened before I installed the optimized apps and is one of the reasons why I installed them in the first place.
[/edit]
ID: 915843 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 915868 - Posted: 8 Jul 2009, 21:46:06 UTC - in response to Message 915843.  
Last modified: 8 Jul 2009, 21:49:09 UTC

Try different drivers, like 185.85, and make sure Boinc isn't installed as a service. ;-)

Claggy
ID: 915868 · Report as offensive
agnawt

Send message
Joined: 25 Jun 07
Posts: 15
Credit: 4,838,223
RAC: 0
Israel
Message 915870 - Posted: 8 Jul 2009, 22:03:45 UTC - in response to Message 915868.  

I actually had 185.85 installed before I upgraded to 186.18 (was hoping it would make things better with this issue). BOINC is not installed as a service.
ID: 915870 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 915904 - Posted: 8 Jul 2009, 23:15:57 UTC - in response to Message 915870.  

I'm thinking you might have a hardware problem, what were you GPU temps with Cuda running?, my 9800GTX+ temps are 63° to 65°C, 45°C Idle.
You could try running Furmark over an extended period of time and see what that brings up.

The next things could try eithier try another card, or try the card in a different Boinc PC, or do a Fresh install of XP or Vista on a spare Hard drive and see if it works then.

Claggy
ID: 915904 · Report as offensive
agnawt

Send message
Joined: 25 Jun 07
Posts: 15
Credit: 4,838,223
RAC: 0
Israel
Message 915960 - Posted: 9 Jul 2009, 1:07:24 UTC - in response to Message 915904.  
Last modified: 9 Jul 2009, 1:08:16 UTC

I was getting around 80°C running furmark in xtreme burning mode which reached a maximum of 83°C after 25min and seemed to settle there for the next 15min. Definitely warm but not terrible imo. It could indicate a power issue. My case's cooling is excellent and ambient temps are around 23-27°C. This is a very fresh (and careful) install of Windows 7 - don't think its an OS issue. Could be wrong ofc; I might go there if all else fails.

I will swap the PSU as I said. Are there reliable CUDA error stability tests for CUDA 2.2? If not I will try compiling and running some CUDA samples. Might even write a simple (yet viscous) error checker of my own. If I still have problems after all that I will check if I have the same issue with a spare 8800GT.
ID: 915960 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 916234 - Posted: 9 Jul 2009, 18:59:15 UTC

Here's another Cuda user called Simon whose Intel i7 with 3 GTX 295's isn't validating, he's got 100's of tasks that are invalid or validation inconclusive.

Pending tasks for computer 4813080

Claggy
ID: 916234 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 916339 - Posted: 9 Jul 2009, 22:16:29 UTC - in response to Message 916234.  

{shrug}

Underpowered machine perhaps? What kind of PSU does one need for 8 CPUs and 6 GPUs? A 1,200W?
ID: 916339 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 916360 - Posted: 10 Jul 2009, 0:01:30 UTC
Last modified: 10 Jul 2009, 0:01:54 UTC

ID: 916360 · Report as offensive
agnawt

Send message
Joined: 25 Jun 07
Posts: 15
Credit: 4,838,223
RAC: 0
Israel
Message 917456 - Posted: 13 Jul 2009, 21:15:12 UTC
Last modified: 13 Jul 2009, 21:40:11 UTC

At the end it seems to bee a problem with the 9800GTX card. Tested another card; 8800GT in the problem PC and all seems to be good. Clearly it was not the fault in any way of SETI CUDA APP. That fact became even more apparent when only 1/3 of the CUDA SDK demos could run without failing and other interesting errors such as causing my screen to "rot" pixel by pixel till I rebooted among other craziness.

Tested that same 9800GTX in another machine with a different motherboard, CPU and PSU and still had the same problem. I'd like to find out exactly what the issue is, but at the moment I'll chalk it up to bad DRAM although it could be anything on the card. That or maybe 12V @ 30amps is not enough for the greedy bastard.

I have never had any issues whatsoever with any demanding game/app etc. Never a pixel out of place. I know that CUDA is rightly very sensitive to whatever the problem is compared to when only graphical processing is occurring. There could be hundreds of errors every second and I would never notice the pixel or 1000 of them with a shade shifted one bit to the left or right etc or some slightly bad normal mapping. Still, it is annoying.

I remember there was something in Rivatuner that allowed you to unlock Shader Processors. Might try to disable/enable such blocks to see if the GPU itself is bad. Think that was back in the GF 7 era - dunno if it applies to the unified shader architectures.

In closing - this is 100% *not* a seti@home issue ;-) Thanks for all the feedback/suggestions anyways. Trying to find a bright side; now I have a good excuse for a newer CUDA 1.3 Compute capable card.
ID: 917456 · Report as offensive
Profile Gundolf Jahn

Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 446,358
RAC: 0
Germany
Message 917461 - Posted: 13 Jul 2009, 21:38:54 UTC - in response to Message 917456.  

...I'd like to find out exactly what the issue is, but at the moment I'll chalk it up to bad DRAM although it could be anything on the card...

Did you check the capacitors on the card? See this message.

Gruß,
Gundolf
ID: 917461 · Report as offensive
agnawt

Send message
Joined: 25 Jun 07
Posts: 15
Credit: 4,838,223
RAC: 0
Israel
Message 917463 - Posted: 13 Jul 2009, 21:47:34 UTC - in response to Message 917461.  

I did not - did not even consider it. Last time I replaced a bad capacitor on a pc component was a 486 motherboard with a Cyrix sx-33 CPU :-), and that was done by a pro with all the right tools.

They are all hidden inside the massive case although some are visible along the far edge. Thanks for the idea although I wont be opening it unless EVGA tell me to shove my warranty (if I still have one).
ID: 917463 · Report as offensive
Simon

Send message
Joined: 13 Aug 99
Posts: 11
Credit: 18,874,447
RAC: 0
United Kingdom
Message 917488 - Posted: 14 Jul 2009, 0:19:29 UTC - in response to Message 916234.  
Last modified: 14 Jul 2009, 1:16:16 UTC

Here's another Cuda user called Simon whose Intel i7 with 3 GTX 295's isn't validating, he's got 100's of tasks that are invalid or validation inconclusive.

Pending tasks for computer 4813080

Claggy



Hi,

Thanks for the pm, it looks like your suggestion is correct.

I added a third card (bought 2nd hand) last month and looking at the stderr_out all the errors are related to one card dev4, which if they are numbered dev0/1/2/3/4/5 it is the first card in the new GTX.

Here is a sample with a couple of points highlighted,

setiathome_CUDA: CUDA Device 4 specified, checking...
Device 4: GeForce GTX 295 is okay
SETI@home using CUDA accelerated device GeForce GTX 295
V10 modification by Raistmer
Priority of worker thread rised successfully
Priority of process adjusted successfully
Total GPU memory 939261952 free GPU memory 889634560
setiathome_enhanced 6.02 Visual Studio/Microsoft C++

Build features: Non-graphics VLAR autokill enabled FFTW x86
CPUID: Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz

Cache: L1=64K L2=256K

CPU features: FPU TSC PAE CMPXCHG8B APIC SYSENTER MTRR CMOV/CCMP MMX FXSAVE/FXRSTOR SSE SSE2 HT SSE3
libboinc: 6.4.5

Work Unit Info:
...............
WU true angle range is : 0.433579
SETI@Home Informational message -9 result_overflow
NOTE: The number of results detected exceeds the storage space allocated.

Flopcounter: 204568130.735477

Spike count: 23
Pulse count: 8
Triplet count: 0
Gaussian count: 0

Wall-clock time elapsed since last restart: 41.5 seconds
called boinc_finish

</stderr_txt>


I was due to pull it apart at the weekend to re-site one of the radiators, looks like I will be pulling the card out as well. I will try and disable that card in device manager or shut it down until it's replaced.

Cheers, Simon.

PS. Disabling the GTX from device manager seems to have isolated the problem for now, temps are all consistant across the cards at 60c so it seems as though there is a hardware problem casuing this spate of bad results.
ID: 917488 · Report as offensive
Simon

Send message
Joined: 13 Aug 99
Posts: 11
Credit: 18,874,447
RAC: 0
United Kingdom
Message 917497 - Posted: 14 Jul 2009, 1:29:29 UTC - in response to Message 916339.  

{shrug}

Underpowered machine perhaps? What kind of PSU does one need for 8 CPUs and 6 GPUs? A 1,200W?


Hi,

Currently using a 1500w unit now the three cards are installed but used to run the pair with a 1000w.

Took the decison when the third card went in to turn off hyperthreading, felt it was too much to ask a quad core to run 8 tasks and still efficiently schedule work for the 6 GPU's so it's 'only' 4cpu's and 6 GPU's although it's now back to 4 until I can sort out the existing fault. (see above)

Cheers, Simon.

ID: 917497 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 917585 - Posted: 14 Jul 2009, 12:16:33 UTC - in response to Message 917497.  
Last modified: 14 Jul 2009, 12:18:10 UTC

Hi, all QUAD's I use, have 550 Watt PSU's. The only one with an @ 3.6GHz OC'ed QX9650 and a (MSI) 9800GTX (Not OC'ed), also has a 550Watt PSU.

I can clearly see on a Watt-meter, CPU (plus ;RAM; HD; DVD on a ASUS P5E MoBo) uses 150 Watt's and when a CUDA (GPUgrid & SETI) kicks in it takes 285 Watt's.
The efficiency factor went up, to 0.98, when the CUDA is active.

Could be, 3 9800GTX cards cause "inbalanced load" of the PSU.
I'am waiting on another 9800GTX and test if it works better alone, in another PC or plug it into the PC with 1 9800GTX card, already.

If it's only 1 card, could be the card itself. (Or the PSU, MoBo?)
ID: 917585 · Report as offensive
Simon

Send message
Joined: 13 Aug 99
Posts: 11
Credit: 18,874,447
RAC: 0
United Kingdom
Message 918181 - Posted: 15 Jul 2009, 19:31:33 UTC - in response to Message 917585.  
Last modified: 15 Jul 2009, 19:32:09 UTC

Hi,

Think I may have found the problem with my 295, it was a corrupt bios in one of my cards. Don't know how it got that way because it was OK after the initial install etc, it took me five attempts to reflash it back to normal and only managed it after forcing the installed bios to be wiped clean.

Here is a link to GPUZ, the problem cards are the middle pair which show incorrectly reported memory capacity and bus width along with a missing bios version.
http://www.sjbentley.btinternet.co.uk/295.jpg

and back to normal
http://www.sjbentley.btinternet.co.uk/2952.jpg

I hope to able to test it as soon as my system manages to connect and download some wu's.

Cheers, Simon.
ID: 918181 · Report as offensive

Questions and Answers : GPU applications : CUDA WU completing too fast and not validating


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.