GTX 295 coming back with computation errors


log in

Advanced search

Questions and Answers : GPU applications : GTX 295 coming back with computation errors

1 · 2 · Next
Author Message
Spatzthecat
Send message
Joined: 31 Jul 03
Posts: 14
Credit: 6,514,745
RAC: 220
United Kingdom
Message 949994 - Posted: 26 Nov 2009, 21:42:31 UTC
Last modified: 26 Nov 2009, 21:43:11 UTC

Hi, Ive recently installed a BFG GeForce GTX 295 H2OC and am getting computation errors on Device 1, Device 0 is OK. Ive tried various drivers but no change.

Running CUDA 2.3 and the latest BOINC
Can you help
Regards
____________

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12284
Credit: 2,574,709
RAC: 731
Netherlands
Message 950007 - Posted: 26 Nov 2009, 22:11:49 UTC - in response to Message 949994.

See this FAQ for options. I'd skip 1 and 2 and go for 3 instantaneously.
____________
Jord

Fighting for the correct use of the apostrophe, together with Weird Al Yankovic

Fred W
Volunteer tester
Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 950019 - Posted: 26 Nov 2009, 23:04:39 UTC
Last modified: 26 Nov 2009, 23:05:15 UTC

Hope you have more luck than I had with my XFX 295 with exactly the same symptoms (by the way - is your's the older 2 card version or the later single-board?). MemtestG80 invariably shows a memory fault within 500 iterations on Device 1 of my GTX (walking 1's IIRC) but both Scan (the supplier) and XFX say there is nothing wrong with it - and it cost me £20.00 to get it back. Scan admitted that they tested it only by running games on it, but I would have thought that XFX would have had more advanced tools (I made it clear that it failed only when running demanding CUDA apps).

I have found that by manually setting the fan at 100% and underclocking by 10% (using EVGA Precision) and running with the side off the case I can successfully run Milkyway CUDA without errors; this keeps the GPU temps in the low 70's C where S@H CUDA takes the temps over 80C so it may be that the memory is temperature sensitive. Note that I had 6 months of running with no problems before these issues began.

Do let us know how you get on - and I hope your's did not come from Scan Computers as you will get little help from them for this issue.

F.
____________

Spatzthecat
Send message
Joined: 31 Jul 03
Posts: 14
Credit: 6,514,745
RAC: 220
United Kingdom
Message 950131 - Posted: 27 Nov 2009, 9:55:29 UTC - in response to Message 950019.

Thank you all for responding.

It is the single PCB, I have another which has been returned due to it leaking, and will be replaced as and when. So there will be 2 in the machine at some stage.

I dont think the temperature is the problem as when it is working flat out it tops out at about 51 C. I am using an AquaComputer Aquaduct 720 XT for the cooling. Heat Killer V3 CPU block. The GPU has a Danger Den Block - pre fitted by Manufacturer.
The rest of the computer is: i7 975, Asus P6T7 WS Motherboard, Thermaltake ToughPower 1500W PSU, 6GB Kingston HyperX 1600mhz Memory, Lian LI PC P80B Case.

____________

Fred W
Volunteer tester
Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 950500 - Posted: 28 Nov 2009, 15:18:15 UTC - in response to Message 950131.

Thank you all for responding.

It is the single PCB, I have another which has been returned due to it leaking, and will be replaced as and when. So there will be 2 in the machine at some stage.

I dont think the temperature is the problem as when it is working flat out it tops out at about 51 C. I am using an AquaComputer Aquaduct 720 XT for the cooling. Heat Killer V3 CPU block. The GPU has a Danger Den Block - pre fitted by Manufacturer.
The rest of the computer is: i7 975, Asus P6T7 WS Motherboard, Thermaltake ToughPower 1500W PSU, 6GB Kingston HyperX 1600mhz Memory, Lian LI PC P80B Case.

With a rig like that I assume that you are overclocking the CPU?? If so, and this may seem bizarre, try reducing the CPU overclock to see if that reduces the errors. I would also try the MemtestG80; BFG may be more sympathetic on RMA that XFX...

F.
____________

Spatzthecat
Send message
Joined: 31 Jul 03
Posts: 14
Credit: 6,514,745
RAC: 220
United Kingdom
Message 950776 - Posted: 29 Nov 2009, 12:16:50 UTC - in response to Message 950500.

Not really overclocking at the moment, wanted to get everything running as it comes out of the box first. So 3.53 is the maximum the CPU has been clocked and that is just the Asus Turbo (200mhz) even at stock 3.33 the errors occur.
I purchased the cards from Overclockers.co.uk whom to date have been very good with any problems.
I will try MemtestG80 to see if it reports errors and will let you know.
Many thanks for the input
____________

Spatzthecat
Send message
Joined: 31 Jul 03
Posts: 14
Credit: 6,514,745
RAC: 220
United Kingdom
Message 950782 - Posted: 29 Nov 2009, 12:57:32 UTC

Tried running MemtestG80 but get: This application has failed to start because cudart.dll was not found. Re-installing the application may fix this problem.

Do I need to re-install the Cuda drivers, or is the message refering to something else?
Regards
____________

Fred W
Volunteer tester
Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 950783 - Posted: 29 Nov 2009, 13:10:29 UTC - in response to Message 950782.
Last modified: 29 Nov 2009, 13:10:56 UTC

Tried running MemtestG80 but get: This application has failed to start because cudart.dll was not found. Re-installing the application may fix this problem.

Do I need to re-install the Cuda drivers, or is the message refering to something else?
Regards

You need to copy the cufft.dll and cudart.dll files from you setiathome.berkeley.edu folder to the folder where you installed MemtestG80. Note that I have found it not to work with the later v2.3 dll's - need the older versions: cufft.dll file size = 1148kB, cudart.dll file size = 275 kB.

F.
____________

Spatzthecat
Send message
Joined: 31 Jul 03
Posts: 14
Credit: 6,514,745
RAC: 220
United Kingdom
Message 950785 - Posted: 29 Nov 2009, 13:45:03 UTC - in response to Message 950783.

I dont appear to have the cufft.dll in the folder.

Where can I get the cufft.dll file size = 1148kB, cudart.dll file size = 275 kB?
Regards
____________

Fred W
Volunteer tester
Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 950796 - Posted: 29 Nov 2009, 15:00:42 UTC - in response to Message 950785.

I dont appear to have the cufft.dll in the folder.

Where can I get the cufft.dll file size = 1148kB, cudart.dll file size = 275 kB?
Regards

If you are crunching CUDA WU's then cudart.dll and cufft.dll must be on your machine. If you have only the v2.3 versions (larger file sizes) then the older versions are available from NVidia by downloading the CUDA v2.2 SDK and extracting these 2 files. Alternatively, you can PM me your eMail address and I will send them over.

F.
____________

Profile Gundolf Jahn
Send message
Joined: 19 Sep 00
Posts: 3184
Credit: 357,953
RAC: 37
Germany
Message 950807 - Posted: 29 Nov 2009, 16:53:37 UTC - in response to Message 950796.

If you are crunching CUDA WU's then cudart.dll and cufft.dll must be on your machine.

And if they are not, that would explain why your CUDA tasks error out.

Gruß,
Gundolf
____________
Computer sind nicht alles im Leben. (Kleiner Scherz)

SETI@home classic workunits 3,758
SETI@home classic CPU time 66,520 hours

Profile perryjay
Volunteer tester
Avatar
Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 15,318,105
RAC: 11,735
United States
Message 950812 - Posted: 29 Nov 2009, 17:20:38 UTC - in response to Message 950807.

Gundolf Jahn, you may have found it. This is the error message he is getting...

SETI@home using CUDA accelerated device GeForce GTX 295
Restarted at 1.00 percent.
Cuda error 'cufftExecC2C' in file 'c:/sw/gpgpu/seti/seti_boinc/client/cuda/cudaAcc_fft.cu' in line 63 : unspecified launch failure.
Cuda error 'cudaAcc_GetPowerSpectrum_kernel' in file 'c:/sw/gpgpu/seti/seti_boinc/client/cuda/cudaAcc_PowerSpectrum.cu' in line 56 : unspecified launch failure.
Cuda error 'cudaAcc_GetPowerSpectrum_kernel' in file 'c:/sw/gpgpu/seti/seti_boinc/client/cuda/cudaAcc_PowerSpectrum.cu' in line 56 : unspecified launch failure.
Cuda error 'cudaAcc_summax32_kernel' in file 'c:/sw/gpgpu/seti/seti_boinc/client/cuda/cudaAcc_summax.cu' in line 147 : unspecified launch failure.
Cuda error 'cudaAcc_summax32_kernel' in file 'c:/sw/gpgpu/seti/seti_boinc/client/cuda/cudaAcc_summax.cu' in line 147 : unspecified launch failure.
Cuda error 'cudaMemcpy(PowerSpectrumSumMax, dev_PowerSpectrumSumMax, cudaAcc_NumDataPoints / fftlen * sizeof(*dev_PowerSpectrumSumMax), cudaMemcpyDeviceToHost)' in file 'c:/sw/gpgpu/seti/seti_boinc/client/cuda/cudaAcc_summax.cu' in line 160 : unspecified launch fail
____________


PROUD MEMBER OF Team Starfire World BOINC

Spatzthecat
Send message
Joined: 31 Jul 03
Posts: 14
Credit: 6,514,745
RAC: 220
United Kingdom
Message 951052 - Posted: 30 Nov 2009, 12:23:54 UTC - in response to Message 950812.

Thanks for your efforts guys, Now running Cuda 2.2 with Boinc 6.10.18, If I enable multi GPU mode only one work unit will be working and they come back OK.
If I run in Do not use Multi GPU then 2 units will be running but only 1 will be OK.
I have run MemtestG80 in both Multi and Non Multi GPU mode and am getting many 100's errors
Regards
____________

Spatzthecat
Send message
Joined: 31 Jul 03
Posts: 14
Credit: 6,514,745
RAC: 220
United Kingdom
Message 951054 - Posted: 30 Nov 2009, 12:46:52 UTC - in response to Message 951052.

This is part of the test which shows the first set of errors

Test iteration 20 (GPU 0, 128 MiB): 0 errors so far
Moving Inversions (ones and zeros): 0 errors (15 ms)
Memtest86 Walking 8-bit: 0 errors (156 ms)
True Walking zeros (8-bit): 0 errors (79 ms)
True Walking ones (8-bit): 0 errors (78 ms)
Moving Inversions (random): 0 errors (15 ms)
Memtest86 Walking zeros (32-bit): 0 errors (313 ms)
Memtest86 Walking ones (32-bit): 0 errors (312 ms)
Random blocks: 0 errors (63 ms)
Memtest86 Modulo-20: 0 errors (937 ms)
Logic (one iteration): 0 errors (16 ms)
Logic (4 iterations): 0 errors (78 ms)
Logic (shared memory, one iteration): 0 errors (31 ms)
Logic (shared-memory, 4 iterations): 192 errors (125 ms)

Test iteration 21 (GPU 0, 128 MiB): 192 errors so far
Moving Inversions (ones and zeros): 0 errors (16 ms)
Memtest86 Walking 8-bit: 0 errors (172 ms)
True Walking zeros (8-bit): 0 errors (78 ms)
True Walking ones (8-bit): 0 errors (94 ms)
Moving Inversions (random): 0 errors (15 ms)
Memtest86 Walking zeros (32-bit): 0 errors (313 ms)
Memtest86 Walking ones (32-bit): 0 errors (297 ms)
Random blocks: 0 errors (78 ms)
Memtest86 Modulo-20: 0 errors (922 ms)
Logic (one iteration): 0 errors (31 ms)
Logic (4 iterations): 0 errors (78 ms)
Logic (shared memory, one iteration): 0 errors (31 ms)
Logic (shared-memory, 4 iterations): 0 errors (110 ms)
____________

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12284
Credit: 2,574,709
RAC: 731
Netherlands
Message 951083 - Posted: 30 Nov 2009, 14:36:48 UTC - in response to Message 951054.

Test iteration 20 (GPU 0, 128 MiB)

Why is it testing 128MB only? Or does it test memory per 128MB portion?

Did you copy the cufft.dll and cudart.dll files you need for MemtestG80 to your ..\BOINC\projects\setiathome.berkeley.edu\ directory already and run one GPU task?
____________
Jord

Fighting for the correct use of the apostrophe, together with Weird Al Yankovic

Fred W
Volunteer tester
Send message
Joined: 13 Jun 99
Posts: 2524
Credit: 11,954,210
RAC: 0
United Kingdom
Message 951097 - Posted: 30 Nov 2009, 15:07:15 UTC - in response to Message 951054.

I have these 2 batch files on my desktop for running MemtestG80 with the GTX295:

MemtestG80_0.bat contains

@echo off
echo Testing Core 0: 500 iterations...
cd c:\Program Files (x86)\MemtestG80
memtestG80 768 500 --bancomm > core0_log.txt
Call Notepad C:\Program Files (x86)\MemtestG80\core0_log.txt


MemtestG80_1.bat contains

@echo off
echo Testing Core 1: 500 iterations...
cd c:\Program Files (x86)\MemtestG80
memtestG80 768 500 -g 1 -b > core1_log.txt
Call Notepad C:\Program Files (x86)\MemtestG80\core1_log.txt

This tests 768MB of memory on each GPU (the max you can get away with). It runs 500 iterations so takes a little while to complete.

Just double-click on both desktop icons very quickly and then wait for the test to run.

It's curious that the section of the log that you posted indicates a problem with GPU0 when you have experienced problems with GPU1. However, if you follow the above, then the two logs produced will give a better indication of where the problem is.

At first glance your GTX295 would seem to be a good candidate for RMA, but try the above and see what we can see from that.

F.
____________

Spatzthecat
Send message
Joined: 31 Jul 03
Posts: 14
Credit: 6,514,745
RAC: 220
United Kingdom
Message 951099 - Posted: 30 Nov 2009, 15:11:14 UTC - in response to Message 951083.

I reinstalled Cuda 2.2 and the dll files were present, so I copied those to the directory where I installed MemtestG80 and ran the test.

I have running with Multi GPU mode enabled. I am getting good results from this 1 GPU, however for some reason I am getting 1 CPU unit where I should be getting 8.

Regards
____________

Spatzthecat
Send message
Joined: 31 Jul 03
Posts: 14
Credit: 6,514,745
RAC: 220
United Kingdom
Message 951102 - Posted: 30 Nov 2009, 15:16:27 UTC - in response to Message 951097.

HI, I dont have MemtestG80_0.bat or MemtestG80_1.bat files and no desktop icons for either.
Regards
____________

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12284
Credit: 2,574,709
RAC: 731
Netherlands
Message 951103 - Posted: 30 Nov 2009, 15:19:19 UTC - in response to Message 951102.

I think Fred made them himself. You can do so as well. :-)
____________
Jord

Fighting for the correct use of the apostrophe, together with Weird Al Yankovic

Profile Ageless
Avatar
Send message
Joined: 9 Jun 99
Posts: 12284
Credit: 2,574,709
RAC: 731
Netherlands
Message 951105 - Posted: 30 Nov 2009, 15:21:21 UTC - in response to Message 951099.
Last modified: 30 Nov 2009, 15:21:41 UTC

...however for some reason I am getting 1 CPU unit where I should be getting 8.

Probably due to the download problem we all have. The guys at Seti should come in to office in an hour or so, just wait for them to fix the problem with the download servers.
____________
Jord

Fighting for the correct use of the apostrophe, together with Weird Al Yankovic

1 · 2 · Next

Questions and Answers : GPU applications : GTX 295 coming back with computation errors

Copyright © 2014 University of California