CUDA problems (Lunatics) / zero status but no finished file |
![]() |
| log in |
Message boards : Number crunching : CUDA problems (Lunatics) / zero status but no finished file
1 · 2 · Next
| Author | Message |
|---|---|
|
Hello there, Error on call (cudaMemcpy(PowerSpectrumSumMax, dev_PowerSpectrumSumMax, (cudaAcc_NumDataPoints / fftlen) * sizeof(*dev_PowerSpectrumSumMax), cudaMemcpyDeviceToHost)), file c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_summax.cu, line 239: unknown error Here is a stderr output from one task that restarted five times: http://files.helsionium.eu/stderr-2739587852.txt It seems this problem only affects tasks that run on the new 660. What could be causing it? The same system ran fine with the two 560 Tis before. This problem isn't particularly grave - all of the affected workunits eventually complete and validate but I guess it's still having an impact on overall crunching performance, aside from being a major nuisance. I'm using BOINC 7.0.28 on Windows 7 64-bit, the Lunatics x41g optimized app and the beta driver 310.70 (same problem with the latest WHQL driver, though). Thank you in advance, Helsionium ____________ | |
| ID: 1312721 · | |
|
Hi Helsionium, | |
| ID: 1312737 · | |
|
Alright, I checked: | |
| ID: 1313022 · | |
|
Sounds mostly good. For Kepler GPUs 70C I would regard as on the edge, where boost clock starts throttling. Though not familiar with the 660 variant myself, the 680 definitely behaves itself better kept to below those figures. ... About RAM Vdd: I can't really check this now, as I want to avoid restarting the machine unless absolutely necessary. Can this be a problem with RAM? Why would that affect only the tasks running on the 660 GPU and not the older, but almost equally fast 560 Ti? It's 'possible'. A few of the Windows updates since June/July, along with driver refinements, involved 'sychronisation', which is an egghead way of saying card talking to machine talking to RAM under supervision of the OS & drivers. Those areas including DMA engines are the most fundamental technology changes since pre-Fermi, along with multithreaded drivers, and where BIOS defaults have been found wanting in some cases. I would suggest backing off frequencies a bit, video memory clocks, system RAM timings, and up the GPU core voltage to see if things change in any way, then seeing what stress tests like OCCT say with respect to artefacts. If they don't change well it'd be a good indication to look elsewhere, If they do then it'd be on the right path. If you're comfortable with manual app_info manipulation, one thing you can do is update the application to x41zb from my GPUUG sponsored site ( http://jgopt.org under construction). It's not a fix if there are other systemic issues, but will take some variables out of the mix if further diagnosis needs to be made. Jason ____________ "It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change." Charles Darwin | |
| ID: 1313037 · | |
|
The problem persists even after I did this: | |
| ID: 1313134 · | |
Just shortly before the installation of the new 660, I disabled hyperthreading because it was recommended by some in this forum. Could that be related?Could be. It certainly messes with synchronisation, but look at this telling pending one on your system: http://setiathome.berkeley.edu/result.php?resultid=2741753543 It started on the 660 (with zb), memcopy errored out & restarted on the 560ti & completed. With all those variables eliminated, it's narrowing to either a specific 660 issue or with something connected to it (like driver or slot). Can you swap the cards in the slots & see if it follows the card, clears, or stays with the slot ? each possibility says something. [Edit:] also, beyond that, bumping the process priority on the 660 may see a change. Either zb's inbuilt global or card specific priority could be used, or some outboard tool. ____________ "It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change." Charles Darwin | |
| ID: 1313151 · | |
|
Similar problem on my linux box I suspect. | |
| ID: 1313159 · | |
|
MSI Afterburner can adjust the voltage settings. | |
| ID: 1313162 · | |
Similar problem on my linux box I suspect. Possible as well, on top of the multiple physical cards with default belownormal priority. multiGPU mode there is probably akin to the Windows nVidia control panel global setting 'Multiple Display Performance Mode' or 'Compatibility Mode'... as opposed to 'Single Display Performance' mode. That sounds like synchronisation again, and By rights that *should* have no influence on memory transfers either, unless there's some real underlying problem. At least in this case the newer build seems to be recovering instead of erroring out the task altogether. Of course it'd be nicer if it never happens ;) ____________ "It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change." Charles Darwin | |
| ID: 1313171 · | |
|
If your card is a EVGA product, I suggest that you download the EVGA Precision X v3.0.4 software from their web site. You can then adjust whatever you want for your card. You mentioned that your card temp is approx 70c, my card is running 2 tasks at approx 67c temps. I have noticed that if your card is an EVGA product that if you keep the Precision X minimized, then the temps will not rise into the 70c range, without the app running the temps will rise into 70+c. I also noticed that a lot of my errors were resolved when the app is running keeping the card cool. This may or may not resolve your issue, just my thoughts. | |
| ID: 1313173 · | |
|
Thank you all for your opinions. | |
| ID: 1313292 · | |
I never enabled SLI to begin with Neither had I .. but it was enabled none the less ..check that it is in fact, off. | |
| ID: 1313355 · | |
Thank you all for your opinions. If I remember correctly Nvidia has an upper limit on the fan speed for the GTX660. My EVGA GTX660SC can only be manually adjusted to 74% using Precision X. BTW, I use hyper-threading with no problems. I do keep 1 core free for cpu->GPU processing (load/unload). ____________ I don't buy computers, I build them!! | |
| ID: 1313383 · | |
|
disregard my advice, my host just spit out another error matching the description. lol fist one in 3 days though ,compared to 10 -20 per day | |
| ID: 1313408 · | |
|
So, some new observations: | |
| ID: 1313465 · | |
|
right, It followed the card. Sounds more & more like you need a smidgeon of core voltage. Kepler GPUs are traditionally volt starved out of the box, so not surprising. | |
| ID: 1313523 · | |
|
OK, just to be sure, since I don't have any experience with the GTX 660 - the current core voltage is 1.175 V, core clock 1163 MHz, memory clock 1652 MHz, temperature 58-60°C. Is it safe to increase the voltage and what amount would you suggest? | |
| ID: 1313537 · | |
OK, just to be sure, since I don't have any experience with the GTX 660 - the current core voltage is 1.175 V, core clock 1163 MHz, memory clock 1652 MHz, temperature 58-60°C. Is it safe to increase the voltage and what amount would you suggest? hmmm, 1.175V is the max on unmodified 680, so it's a good point to sit back & question things, & feel the water what's going on. Hold the voltage changes until you know more about what the max available setting is. I would instead back off the core clock a little ( say ~1050MHz) for the sakes of looking at the memory in isolation, so taking chance of core issue out of question first... Then I'd step back memory speed & ease back up step by step to actually try *higher* memory speed until it starts to show artefacts in some scanning tool (then back off two 10Mhz notches further from 'clean'). It's possible there is some sortof memory issue going on that sometimes a small boost can overcome. That would actually make the most sense if you see changes in the memory transfer stall behaviour. Jason [Edit:] potential Side issues: Your PCIe slots are clocked at a fixed 100MHz right ? Bios [& Chipset drivers] up to date as well ? ____________ "It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change." Charles Darwin | |
| ID: 1313543 · | |
|
As another last minute elimination idea before going to bed, clearing the cuda computecache could possibly take some old-driver generated junk out. That can be an issue sometimes if early retail box drivers were used in the past, and sometimes isn't cleared by driver update etc (even with clean install advanced option) | |
| ID: 1313562 · | |
|
Unfortunately I corrupted my BIOS due tue a very badly designed BIOS update procedure. Since I probably will not find a way to recover, I must wait a few days for a replacement motherboard. I'm thankful for all your ideas and will continue with your suggestions as soon as possible. | |
| ID: 1313597 · | |
Message boards : Number crunching : CUDA problems (Lunatics) / zero status but no finished file
| Copyright © 2013 University of California |