Message boards :
Number crunching :
CUDA problems (Lunatics) / zero status but no finished file
Message board moderation
Author | Message |
---|---|
Helsionium Send message Joined: 24 Dec 06 Posts: 156 Credit: 86,214,817 RAC: 43 |
Hello there, ever since I replaced one of my two 560Ti GPUs with a GTX 660, a relatively large amount of tasks (I guess 20-30%) restart at least once (exited with zero status but no 'finished' file) after encountering this error: Error on call (cudaMemcpy(PowerSpectrumSumMax, dev_PowerSpectrumSumMax, (cudaAcc_NumDataPoints / fftlen) * sizeof(*dev_PowerSpectrumSumMax), cudaMemcpyDeviceToHost)), file c:/[Projects]/X_CudaMB/client/cuda/cudaAcc_summax.cu, line 239: unknown error Here is a stderr output from one task that restarted five times: http://files.helsionium.eu/stderr-2739587852.txt It seems this problem only affects tasks that run on the new 660. What could be causing it? The same system ran fine with the two 560 Tis before. This problem isn't particularly grave - all of the affected workunits eventually complete and validate but I guess it's still having an impact on overall crunching performance, aside from being a major nuisance. I'm using BOINC 7.0.28 on Windows 7 64-bit, the Lunatics x41g optimized app and the beta driver 310.70 (same problem with the latest WHQL driver, though). Thank you in advance, Helsionium |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Hi Helsionium, I'd start with normalish checks for the sakes of elimination. Temperature below 70 degrees C, Hard drive integrity, and DPC latency ( http://www.thesycon.de/deu/latency_check.shtml ). Also if you have enthusiast RAM set to XMP profile, check that your i7/mobo hasn;t optimistically set the command rate to T1 latency (it should be T2, 2 cycles, except in the rarest situation ) [Edit:] oh also with enthusiast RAM Vdd should be ~80% of Vram, which seems to be often set wrong if 'auto' is used in BIOS settings. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Helsionium Send message Joined: 24 Dec 06 Posts: 156 Credit: 86,214,817 RAC: 43 |
Alright, I checked: - Hard drive integrity: OK (and no processes interfering with BOINC or SAH) - DPC latency: OK, consistently below 500 microseconds, sometimes slightly above 500 if BOINC is active. - Command rate: OK, set to T2. About the temperature: is 70°C already too hot? Both GPUs and the CPU typically run at 68-74 degrees under full load. According to the information I had, this is well within their specifications. About RAM Vdd: I can't really check this now, as I want to avoid restarting the machine unless absolutely necessary. Can this be a problem with RAM? Why would that affect only the tasks running on the 660 GPU and not the older, but almost equally fast 560 Ti? The power supply should be OK as well. The system ran just fine with two overclocked 560 Tis which used more power. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Sounds mostly good. For Kepler GPUs 70C I would regard as on the edge, where boost clock starts throttling. Though not familiar with the 660 variant myself, the 680 definitely behaves itself better kept to below those figures. ... About RAM Vdd: I can't really check this now, as I want to avoid restarting the machine unless absolutely necessary. Can this be a problem with RAM? Why would that affect only the tasks running on the 660 GPU and not the older, but almost equally fast 560 Ti? It's 'possible'. A few of the Windows updates since June/July, along with driver refinements, involved 'sychronisation', which is an egghead way of saying card talking to machine talking to RAM under supervision of the OS & drivers. Those areas including DMA engines are the most fundamental technology changes since pre-Fermi, along with multithreaded drivers, and where BIOS defaults have been found wanting in some cases. I would suggest backing off frequencies a bit, video memory clocks, system RAM timings, and up the GPU core voltage to see if things change in any way, then seeing what stress tests like OCCT say with respect to artefacts. If they don't change well it'd be a good indication to look elsewhere, If they do then it'd be on the right path. If you're comfortable with manual app_info manipulation, one thing you can do is update the application to x41zb from my GPUUG sponsored site ( http://jgopt.org under construction). It's not a fix if there are other systemic issues, but will take some variables out of the mix if further diagnosis needs to be made. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Helsionium Send message Joined: 24 Dec 06 Posts: 156 Credit: 86,214,817 RAC: 43 |
The problem persists even after I did this: - reduce CPU speed from 4.3 GHz to 4.0 GHz - reduce RAM frequency from 1600 to 1333 - disabled XMP - installed x41zb - tested GPU with OCCT, no errors, no artifacts I can't change the GPU core voltage on the 660, the software doesn't allow that. I didn't find any options to adjust Vdd in the BIOS. The 660 isn't overclocked (past its factory-overclocking) and the software doesn't allow me to go lower than the default clock rates, so I can't do this. The error is probably not related to temperature, though. There was one instance of the error shortly after reboot, when temperatures were still much lower. Just shortly before the installation of the new 660, I disabled hyperthreading because it was recommended by some in this forum. Could that be related? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Just shortly before the installation of the new 660, I disabled hyperthreading because it was recommended by some in this forum. Could that be related?Could be. It certainly messes with synchronisation, but look at this telling pending one on your system: http://setiathome.berkeley.edu/result.php?resultid=2741753543 It started on the 660 (with zb), memcopy errored out & restarted on the 560ti & completed. With all those variables eliminated, it's narrowing to either a specific 660 issue or with something connected to it (like driver or slot). Can you swap the cards in the slots & see if it follows the card, clears, or stays with the slot ? each possibility says something. [Edit:] also, beyond that, bumping the process priority on the 660 may see a change. Either zb's inbuilt global or card specific priority could be used, or some outboard tool. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Tron Send message Joined: 16 Aug 09 Posts: 180 Credit: 2,250,468 RAC: 0 |
Similar problem on my linux box I suspect. disabling "SLI" and "multiGPU" options eliminated the memcopy errors I was getting. It's worth a try. |
arkayn Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0 |
MSI Afterburner can adjust the voltage settings. http://download1.msi.com/files/downloads/uti_exe/vga/MSIAfterburnerSetup230.zip I have my Galaxy GTX 660 at +100 Core Voltage, +75 Core Clock and +200 Memory Clock. With fan speed set to 70% it runs at 57c running 3 tasks at a time. I also have the config file set to processpriority = high pfblockspersm = 15 pfperiodsperlaunch = 200 |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Similar problem on my linux box I suspect. Possible as well, on top of the multiple physical cards with default belownormal priority. multiGPU mode there is probably akin to the Windows nVidia control panel global setting 'Multiple Display Performance Mode' or 'Compatibility Mode'... as opposed to 'Single Display Performance' mode. That sounds like synchronisation again, and By rights that *should* have no influence on memory transfers either, unless there's some real underlying problem. At least in this case the newer build seems to be recovering instead of erroring out the task altogether. Of course it'd be nicer if it never happens ;) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Cliff Harding Send message Joined: 18 Aug 99 Posts: 1432 Credit: 110,967,840 RAC: 67 |
If your card is a EVGA product, I suggest that you download the EVGA Precision X v3.0.4 software from their web site. You can then adjust whatever you want for your card. You mentioned that your card temp is approx 70c, my card is running 2 tasks at approx 67c temps. I have noticed that if your card is an EVGA product that if you keep the Precision X minimized, then the temps will not rise into the 70c range, without the app running the temps will rise into 70+c. I also noticed that a lot of my errors were resolved when the app is running keeping the card cool. This may or may not resolve your issue, just my thoughts. [edit] BTW, I'm running BOINC 7.0.40, Lunatics v0.40 (_x41g), Nvidia 310.33 [/edit] I don't buy computers, I build them!! |
Helsionium Send message Joined: 24 Dec 06 Posts: 156 Credit: 86,214,817 RAC: 43 |
Thank you all for your opinions. I used an old version of MSI Afterburner that couldn't make use of all the functions. With the new one I can indeed adjust the voltage (but can still not turn the fan speed of my 660 above 74% which is strange). But if, as you say, this is already a rather high temperature I'd rather not fiddle with the voltage control unless necessary. However, according to several internet sources, the temperature is pretty much the expected temperature of my card - the Zotac GTX 660 Amp! Edition - under maximum load, maybe a few degrees higher, but that's expected too (multi-GPU system, very demanding CUDA load instead of the game-based benchmarks) I never enabled SLI to begin with, and also switched to "single display performance mode", but that too didn't help. I've set the process priority to high via the mbcuda.cfg file. I'm out of time now, but tomorrow I will try switching the cards around and find out if hyperthreading affects the problem or not. The good thing is that the error is relatively benign and that all of the affected tasks I followed eventually successfully validated. |
Tron Send message Joined: 16 Aug 09 Posts: 180 Credit: 2,250,468 RAC: 0 |
I never enabled SLI to begin with Neither had I .. but it was enabled none the less ..check that it is in fact, off. |
Cliff Harding Send message Joined: 18 Aug 99 Posts: 1432 Credit: 110,967,840 RAC: 67 |
Thank you all for your opinions. If I remember correctly Nvidia has an upper limit on the fan speed for the GTX660. My EVGA GTX660SC can only be manually adjusted to 74% using Precision X. BTW, I use hyper-threading with no problems. I do keep 1 core free for cpu->GPU processing (load/unload). I don't buy computers, I build them!! |
Tron Send message Joined: 16 Aug 09 Posts: 180 Credit: 2,250,468 RAC: 0 |
disregard my advice, my host just spit out another error matching the description. lol fist one in 3 days though ,compared to 10 -20 per day |
Helsionium Send message Joined: 24 Dec 06 Posts: 156 Credit: 86,214,817 RAC: 43 |
So, some new observations: - Enabling hyperthreading didn't fix or reduce the errors, I still left it on anyways. - The error rate seems to be lower with x41g so I reverted back. Might be a coincidence, but at the very least, x41zb is certainly not better in that regard. - Number of tasks per card doesn't matter. I ran only one task per card over the whole night as opposed to my standard two, but the error rate was not reduced. - I swapped the cards, the error still only affects the 660. - A temperature issue can now definitely be ruled out, after swapping the cards, the 660 runs at a temperature of 58-61°C. Also, I didn't find the SLI option to disable it. I didn't connect the cards with an SLI bridge, so it shouldn't be available, right? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
right, It followed the card. Sounds more & more like you need a smidgeon of core voltage. Kepler GPUs are traditionally volt starved out of the box, so not surprising. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Helsionium Send message Joined: 24 Dec 06 Posts: 156 Credit: 86,214,817 RAC: 43 |
OK, just to be sure, since I don't have any experience with the GTX 660 - the current core voltage is 1.175 V, core clock 1163 MHz, memory clock 1652 MHz, temperature 58-60°C. Is it safe to increase the voltage and what amount would you suggest? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
OK, just to be sure, since I don't have any experience with the GTX 660 - the current core voltage is 1.175 V, core clock 1163 MHz, memory clock 1652 MHz, temperature 58-60°C. Is it safe to increase the voltage and what amount would you suggest? hmmm, 1.175V is the max on unmodified 680, so it's a good point to sit back & question things, & feel the water what's going on. Hold the voltage changes until you know more about what the max available setting is. I would instead back off the core clock a little ( say ~1050MHz) for the sakes of looking at the memory in isolation, so taking chance of core issue out of question first... Then I'd step back memory speed & ease back up step by step to actually try *higher* memory speed until it starts to show artefacts in some scanning tool (then back off two 10Mhz notches further from 'clean'). It's possible there is some sortof memory issue going on that sometimes a small boost can overcome. That would actually make the most sense if you see changes in the memory transfer stall behaviour. Jason [Edit:] potential Side issues: Your PCIe slots are clocked at a fixed 100MHz right ? Bios [& Chipset drivers] up to date as well ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
As another last minute elimination idea before going to bed, clearing the cuda computecache could possibly take some old-driver generated junk out. That can be an issue sometimes if early retail box drivers were used in the past, and sometimes isn't cleared by driver update etc (even with clean install advanced option) The path to navigate to, with Boinc stopped, in a windows explorer window is: %APPDATA%\ComputeCache , and delete all the content here (or the whole ComputeCache folder) Just takes another variable out of the mix. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Helsionium Send message Joined: 24 Dec 06 Posts: 156 Credit: 86,214,817 RAC: 43 |
Unfortunately I corrupted my BIOS due tue a very badly designed BIOS update procedure. Since I probably will not find a way to recover, I must wait a few days for a replacement motherboard. I'm thankful for all your ideas and will continue with your suggestions as soon as possible. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.