CUDA problems (Lunatics) / zero status but no finished file


log in

Advanced search

Message boards : Number crunching : CUDA problems (Lunatics) / zero status but no finished file

1 · 2 · Next
Author Message
Profile Helsionium
Avatar
Send message
Joined: 24 Dec 06
Posts: 145
Credit: 33,654,064
RAC: 24,931
Austria
Message 1312721 - Posted: 8 Dec 2012, 21:36:37 UTC

Hello there,

ever since I replaced one of my two 560Ti GPUs with a GTX 660, a relatively large amount of tasks (I guess 20-30%) restart at least once (exited with zero status but no 'finished' file) after encountering this error:

Error on call (cudaMemcpy(PowerSpectrumSumMax, dev_PowerSpectrumSumMax, (cudaAcc_NumDataPoints / fftlen) * sizeof(*dev_PowerSpectrumSumMax), cudaMemcpyDeviceToHost)), file c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_summax.cu, line 239: unknown error

Here is a stderr output from one task that restarted five times:
http://files.helsionium.eu/stderr-2739587852.txt

It seems this problem only affects tasks that run on the new 660. What could be causing it? The same system ran fine with the two 560 Tis before. This problem isn't particularly grave - all of the affected workunits eventually complete and validate but I guess it's still having an impact on overall crunching performance, aside from being a major nuisance.

I'm using BOINC 7.0.28 on Windows 7 64-bit, the Lunatics x41g optimized app and the beta driver 310.70 (same problem with the latest WHQL driver, though).

Thank you in advance,
Helsionium
____________

Profile jason_geeProject donor
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4959
Credit: 73,034,652
RAC: 14,258
Australia
Message 1312737 - Posted: 8 Dec 2012, 22:13:42 UTC - in response to Message 1312721.
Last modified: 8 Dec 2012, 22:18:17 UTC

Hi Helsionium,
I'd start with normalish checks for the sakes of elimination. Temperature below 70 degrees C, Hard drive integrity, and DPC latency ( http://www.thesycon.de/deu/latency_check.shtml ). Also if you have enthusiast RAM set to XMP profile, check that your i7/mobo hasn;t optimistically set the command rate to T1 latency (it should be T2, 2 cycles, except in the rarest situation )

[Edit:] oh also with enthusiast RAM Vdd should be ~80% of Vram, which seems to be often set wrong if 'auto' is used in BIOS settings.

Jason
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile Helsionium
Avatar
Send message
Joined: 24 Dec 06
Posts: 145
Credit: 33,654,064
RAC: 24,931
Austria
Message 1313022 - Posted: 9 Dec 2012, 13:22:08 UTC

Alright, I checked:

- Hard drive integrity: OK (and no processes interfering with BOINC or SAH)
- DPC latency: OK, consistently below 500 microseconds, sometimes slightly above 500 if BOINC is active.
- Command rate: OK, set to T2.

About the temperature: is 70°C already too hot? Both GPUs and the CPU typically run at 68-74 degrees under full load. According to the information I had, this is well within their specifications.

About RAM Vdd: I can't really check this now, as I want to avoid restarting the machine unless absolutely necessary. Can this be a problem with RAM? Why would that affect only the tasks running on the 660 GPU and not the older, but almost equally fast 560 Ti?

The power supply should be OK as well. The system ran just fine with two overclocked 560 Tis which used more power.
____________

Profile jason_geeProject donor
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4959
Credit: 73,034,652
RAC: 14,258
Australia
Message 1313037 - Posted: 9 Dec 2012, 13:50:24 UTC - in response to Message 1313022.
Last modified: 9 Dec 2012, 13:50:39 UTC

Sounds mostly good. For Kepler GPUs 70C I would regard as on the edge, where boost clock starts throttling. Though not familiar with the 660 variant myself, the 680 definitely behaves itself better kept to below those figures.

... About RAM Vdd: I can't really check this now, as I want to avoid restarting the machine unless absolutely necessary. Can this be a problem with RAM? Why would that affect only the tasks running on the 660 GPU and not the older, but almost equally fast 560 Ti?


It's 'possible'. A few of the Windows updates since June/July, along with driver refinements, involved 'sychronisation', which is an egghead way of saying card talking to machine talking to RAM under supervision of the OS & drivers. Those areas including DMA engines are the most fundamental technology changes since pre-Fermi, along with multithreaded drivers, and where BIOS defaults have been found wanting in some cases.

I would suggest backing off frequencies a bit, video memory clocks, system RAM timings, and up the GPU core voltage to see if things change in any way, then seeing what stress tests like OCCT say with respect to artefacts. If they don't change well it'd be a good indication to look elsewhere, If they do then it'd be on the right path.

If you're comfortable with manual app_info manipulation, one thing you can do is update the application to x41zb from my GPUUG sponsored site ( http://jgopt.org under construction). It's not a fix if there are other systemic issues, but will take some variables out of the mix if further diagnosis needs to be made.

Jason
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile Helsionium
Avatar
Send message
Joined: 24 Dec 06
Posts: 145
Credit: 33,654,064
RAC: 24,931
Austria
Message 1313134 - Posted: 9 Dec 2012, 16:09:47 UTC

The problem persists even after I did this:

- reduce CPU speed from 4.3 GHz to 4.0 GHz
- reduce RAM frequency from 1600 to 1333
- disabled XMP
- installed x41zb
- tested GPU with OCCT, no errors, no artifacts

I can't change the GPU core voltage on the 660, the software doesn't allow that. I didn't find any options to adjust Vdd in the BIOS. The 660 isn't overclocked (past its factory-overclocking) and the software doesn't allow me to go lower than the default clock rates, so I can't do this. The error is probably not related to temperature, though. There was one instance of the error shortly after reboot, when temperatures were still much lower.

Just shortly before the installation of the new 660, I disabled hyperthreading because it was recommended by some in this forum. Could that be related?
____________

Profile jason_geeProject donor
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4959
Credit: 73,034,652
RAC: 14,258
Australia
Message 1313151 - Posted: 9 Dec 2012, 17:02:58 UTC - in response to Message 1313134.
Last modified: 9 Dec 2012, 17:09:53 UTC

Just shortly before the installation of the new 660, I disabled hyperthreading because it was recommended by some in this forum. Could that be related?
Could be. It certainly messes with synchronisation, but look at this telling pending one on your system:
http://setiathome.berkeley.edu/result.php?resultid=2741753543

It started on the 660 (with zb), memcopy errored out & restarted on the 560ti & completed. With all those variables eliminated, it's narrowing to either a specific 660 issue or with something connected to it (like driver or slot). Can you swap the cards in the slots & see if it follows the card, clears, or stays with the slot ? each possibility says something.

[Edit:] also, beyond that, bumping the process priority on the 660 may see a change. Either zb's inbuilt global or card specific priority could be used, or some outboard tool.
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile Tron
Send message
Joined: 16 Aug 09
Posts: 180
Credit: 2,236,055
RAC: 0
United States
Message 1313159 - Posted: 9 Dec 2012, 17:21:59 UTC

Similar problem on my linux box I suspect.

disabling "SLI" and "multiGPU" options eliminated the memcopy errors I was getting. It's worth a try.

Profile arkaynProject donor
Volunteer tester
Avatar
Send message
Joined: 14 May 99
Posts: 3618
Credit: 48,497,808
RAC: 38,831
United States
Message 1313162 - Posted: 9 Dec 2012, 17:25:30 UTC - in response to Message 1313134.
Last modified: 9 Dec 2012, 17:28:01 UTC

MSI Afterburner can adjust the voltage settings.
http://download1.msi.com/files/downloads/uti_exe/vga/MSIAfterburnerSetup230.zip

I have my Galaxy GTX 660 at +100 Core Voltage, +75 Core Clock and +200 Memory Clock.
With fan speed set to 70% it runs at 57c running 3 tasks at a time.

I also have the config file set to
processpriority = high
pfblockspersm = 15
pfperiodsperlaunch = 200
____________

Profile jason_geeProject donor
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4959
Credit: 73,034,652
RAC: 14,258
Australia
Message 1313171 - Posted: 9 Dec 2012, 17:52:17 UTC - in response to Message 1313159.

Similar problem on my linux box I suspect.

disabling "SLI" and "multiGPU" options eliminated the memcopy errors I was getting. It's worth a try.


Possible as well, on top of the multiple physical cards with default belownormal priority.

multiGPU mode there is probably akin to the Windows nVidia control panel global setting 'Multiple Display Performance Mode' or 'Compatibility Mode'... as opposed to 'Single Display Performance' mode. That sounds like synchronisation again, and By rights that *should* have no influence on memory transfers either, unless there's some real underlying problem.

At least in this case the newer build seems to be recovering instead of erroring out the task altogether. Of course it'd be nicer if it never happens ;)
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile Cliff HardingProject donor
Volunteer tester
Avatar
Send message
Joined: 18 Aug 99
Posts: 964
Credit: 50,670,641
RAC: 40,287
United States
Message 1313173 - Posted: 9 Dec 2012, 17:54:29 UTC
Last modified: 9 Dec 2012, 17:57:56 UTC

If your card is a EVGA product, I suggest that you download the EVGA Precision X v3.0.4 software from their web site. You can then adjust whatever you want for your card. You mentioned that your card temp is approx 70c, my card is running 2 tasks at approx 67c temps. I have noticed that if your card is an EVGA product that if you keep the Precision X minimized, then the temps will not rise into the 70c range, without the app running the temps will rise into 70+c. I also noticed that a lot of my errors were resolved when the app is running keeping the card cool. This may or may not resolve your issue, just my thoughts.

[edit] BTW, I'm running BOINC 7.0.40, Lunatics v0.40 (_x41g), Nvidia 310.33 [/edit]
____________


I don't buy computers, I build them!!

Profile Helsionium
Avatar
Send message
Joined: 24 Dec 06
Posts: 145
Credit: 33,654,064
RAC: 24,931
Austria
Message 1313292 - Posted: 9 Dec 2012, 21:42:31 UTC

Thank you all for your opinions.

I used an old version of MSI Afterburner that couldn't make use of all the functions. With the new one I can indeed adjust the voltage (but can still not turn the fan speed of my 660 above 74% which is strange). But if, as you say, this is already a rather high temperature I'd rather not fiddle with the voltage control unless necessary. However, according to several internet sources, the temperature is pretty much the expected temperature of my card - the Zotac GTX 660 Amp! Edition - under maximum load, maybe a few degrees higher, but that's expected too (multi-GPU system, very demanding CUDA load instead of the game-based benchmarks)

I never enabled SLI to begin with, and also switched to "single display performance mode", but that too didn't help. I've set the process priority to high via the mbcuda.cfg file.

I'm out of time now, but tomorrow I will try switching the cards around and find out if hyperthreading affects the problem or not. The good thing is that the error is relatively benign and that all of the affected tasks I followed eventually successfully validated.
____________

Profile Tron
Send message
Joined: 16 Aug 09
Posts: 180
Credit: 2,236,055
RAC: 0
United States
Message 1313355 - Posted: 9 Dec 2012, 23:18:57 UTC

I never enabled SLI to begin with


Neither had I .. but it was enabled none the less ..check that it is in fact, off.

Profile Cliff HardingProject donor
Volunteer tester
Avatar
Send message
Joined: 18 Aug 99
Posts: 964
Credit: 50,670,641
RAC: 40,287
United States
Message 1313383 - Posted: 10 Dec 2012, 0:43:10 UTC - in response to Message 1313292.

Thank you all for your opinions.

I used an old version of MSI Afterburner that couldn't make use of all the functions. With the new one I can indeed adjust the voltage (but can still not turn the fan speed of my 660 above 74% which is strange).


If I remember correctly Nvidia has an upper limit on the fan speed for the GTX660. My EVGA GTX660SC can only be manually adjusted to 74% using Precision X. BTW, I use hyper-threading with no problems. I do keep 1 core free for cpu->GPU processing (load/unload).
____________


I don't buy computers, I build them!!

Profile Tron
Send message
Joined: 16 Aug 09
Posts: 180
Credit: 2,236,055
RAC: 0
United States
Message 1313408 - Posted: 10 Dec 2012, 3:35:48 UTC

disregard my advice, my host just spit out another error matching the description. lol fist one in 3 days though ,compared to 10 -20 per day

Profile Helsionium
Avatar
Send message
Joined: 24 Dec 06
Posts: 145
Credit: 33,654,064
RAC: 24,931
Austria
Message 1313465 - Posted: 10 Dec 2012, 9:22:44 UTC

So, some new observations:

- Enabling hyperthreading didn't fix or reduce the errors, I still left it on anyways.
- The error rate seems to be lower with x41g so I reverted back. Might be a coincidence, but at the very least, x41zb is certainly not better in that regard.
- Number of tasks per card doesn't matter. I ran only one task per card over the whole night as opposed to my standard two, but the error rate was not reduced.
- I swapped the cards, the error still only affects the 660.
- A temperature issue can now definitely be ruled out, after swapping the cards, the 660 runs at a temperature of 58-61°C.

Also, I didn't find the SLI option to disable it. I didn't connect the cards with an SLI bridge, so it shouldn't be available, right?
____________

Profile jason_geeProject donor
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4959
Credit: 73,034,652
RAC: 14,258
Australia
Message 1313523 - Posted: 10 Dec 2012, 14:26:50 UTC

right, It followed the card. Sounds more & more like you need a smidgeon of core voltage. Kepler GPUs are traditionally volt starved out of the box, so not surprising.
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile Helsionium
Avatar
Send message
Joined: 24 Dec 06
Posts: 145
Credit: 33,654,064
RAC: 24,931
Austria
Message 1313537 - Posted: 10 Dec 2012, 15:08:29 UTC

OK, just to be sure, since I don't have any experience with the GTX 660 - the current core voltage is 1.175 V, core clock 1163 MHz, memory clock 1652 MHz, temperature 58-60°C. Is it safe to increase the voltage and what amount would you suggest?
____________

Profile jason_geeProject donor
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4959
Credit: 73,034,652
RAC: 14,258
Australia
Message 1313543 - Posted: 10 Dec 2012, 15:35:39 UTC - in response to Message 1313537.
Last modified: 10 Dec 2012, 15:50:26 UTC

OK, just to be sure, since I don't have any experience with the GTX 660 - the current core voltage is 1.175 V, core clock 1163 MHz, memory clock 1652 MHz, temperature 58-60°C. Is it safe to increase the voltage and what amount would you suggest?


hmmm, 1.175V is the max on unmodified 680, so it's a good point to sit back & question things, & feel the water what's going on.

Hold the voltage changes until you know more about what the max available setting is. I would instead back off the core clock a little ( say ~1050MHz) for the sakes of looking at the memory in isolation, so taking chance of core issue out of question first...

Then I'd step back memory speed & ease back up step by step to actually try *higher* memory speed until it starts to show artefacts in some scanning tool (then back off two 10Mhz notches further from 'clean'). It's possible there is some sortof memory issue going on that sometimes a small boost can overcome. That would actually make the most sense if you see changes in the memory transfer stall behaviour.

Jason

[Edit:] potential Side issues: Your PCIe slots are clocked at a fixed 100MHz right ? Bios [& Chipset drivers] up to date as well ?
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile jason_geeProject donor
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 4959
Credit: 73,034,652
RAC: 14,258
Australia
Message 1313562 - Posted: 10 Dec 2012, 16:14:38 UTC
Last modified: 10 Dec 2012, 16:15:19 UTC

As another last minute elimination idea before going to bed, clearing the cuda computecache could possibly take some old-driver generated junk out. That can be an issue sometimes if early retail box drivers were used in the past, and sometimes isn't cleared by driver update etc (even with clean install advanced option)

The path to navigate to, with Boinc stopped, in a windows explorer window is:
%APPDATA%\ComputeCache , and delete all the content here (or the whole ComputeCache folder)

Just takes another variable out of the mix.
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile Helsionium
Avatar
Send message
Joined: 24 Dec 06
Posts: 145
Credit: 33,654,064
RAC: 24,931
Austria
Message 1313597 - Posted: 10 Dec 2012, 17:56:23 UTC

Unfortunately I corrupted my BIOS due tue a very badly designed BIOS update procedure. Since I probably will not find a way to recover, I must wait a few days for a replacement motherboard. I'm thankful for all your ideas and will continue with your suggestions as soon as possible.
____________

1 · 2 · Next

Message boards : Number crunching : CUDA problems (Lunatics) / zero status but no finished file

Copyright © 2014 University of California