Message boards :
Number crunching :
I am getting errors from my GPU-tasks. Is CUDA to blame?
Message board moderation
Author | Message |
---|---|
musicplayer Send message Joined: 17 May 10 Posts: 2430 Credit: 926,046 RAC: 0 |
Is it me doing things wrong right now, or is it perhaps you instead? Possibly there is some testing under way? I am getting a couple of computation errors on my tasks right now. The version of BOINC Manager I am using is 6.10.58. The tasks which gets computational errors are the Seti@home Enhanced 6.10 (cuda_fermi) tasks which should be using 0.49 CPU + 1.00 nVidia GPU (from my memory - since they will not run, they are listed as "Waiting to Run" anyway). The Seti@home Enhanced 6.03 still takes much longer to complete, but apparently these tasks are successful even when apparently being run by means of CUDA instead of using the processor. But one or more of my CUDA-tasks is/are waiting to run while other(s) are running. I am using one card, but apparently up to 5 tasks with Status 0.49 CPU's (Lower than Normal Priority in Windows Task Manager) are running simultaneously apparently using my graphics card based on the sound from the monitor fan. If not running, a task may instead be listed as "Waiting to Run" and may possibly be a little hard at getting started. Anyway, it apparently does not matter whether CUDA has been enabled or not, it runs anyway as Seti@home Enhanced 6.10 (cuda_fermi) using my nVidia GTX 480 card. These tasks are not .vlar tasks for that matter. Should I just watch it, or should I try adjusting my parameter settings or possibly web-settings? Any suggestions are welcome. Thanks! |
Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0 |
Any suggestions are welcome. Reboot the system perhaps? |
musicplayer Send message Joined: 17 May 10 Posts: 2430 Credit: 926,046 RAC: 0 |
Bump. Not my first day here. Please check out my earlier post here: http://setiathome.berkeley.edu/forum_thread.php?id=68947&postid=1268876 Because yesterday I was installing Windows on a new partition using an existing Windows installation which is probably the best I have. On the Start-menu there came along a shortcut to an existing BOINC Manager 6.10.58 which I was having at another place. I actually noticed this thing already then and suspected a mix-up between BOINC Manager 6.10.58 and 6.12.34. Now I am back on a working partition which should be up and running. It is not my fault in my best guess. I should mention that the nVidia graphics driver being used apparently is version 260.99 . Also I had a PrimeGrid Genefer task running here by means of CUDA for roughly half an hour a little earlier on. At least the card or processor was able to run that particular task. As always or usual it is still a combination of monitor card and processsor needed in order to be able to run these tasks. |
musicplayer Send message Joined: 17 May 10 Posts: 2430 Credit: 926,046 RAC: 0 |
Too late to edit my previous post, but the PrimeGrid Genefer task which runs by means of CUDA only is cuda32_13 . It only runs when CUDA is being enabled, otherwise it is "Waiting to Run" when not suspended. Apparently this task runs at "Normal" priority as seen by using Windows Task Manager. My guess is that the PrimeGrid tasks gets before in the queue because of the shorter report deadline for these tasks, therefore other tasks, particularly the Seti@home tasks may be listed as "Waiting to Run" if they are competing against each other for running time. And now, with CUDA once more enabled, the Seti@home enhanced 6.10 (cuda_fermi) tasks which I just received are "Ready to Start", but still not running. No file transfers. They all have completely downloaded. Also they are not listed as "Suspended" in my BOINC Manager task list. But they are not running although I have both the door as well as window open here and an electric fan blowing cold air (I checked it out) turned on double strength (two button knobs) as well. But still, if I again try re-starting that PrimeGrid Genefer task, which by the way is a World Record task, my guess is that this task will once again start running. And in fact, no surprise at all, this task does commence when changed from "Suspended" to "Running". |
musicplayer Send message Joined: 17 May 10 Posts: 2430 Credit: 926,046 RAC: 0 |
I suspended five .vlar tasks running using Seti@home Enhanced 6.03 (the usual way) leaving three PrimeGrid tasks running using the CPU. The 11 Seti@home Enhanced 6.10 (cuda_fermi) tasks I just received which were "Ready to Start" even with CUDA enabled, started immediately running once more with CUDA enabled, but still running using 0.49 CPU's, no mention of "+ 1.00 nVidia GPU". At least one of the tasks got a computation error once again. No .vlar tasks among the latter 11 tasks, though. Edit on this: Restarting the five .vlar tasks one at a time leaves me now with one such task waiting to run and four tasks running. Still three tasks running 0.49 CPU's and also those three PrimeGrid CPU tasks. Enabling CUDA makes the task "Waiting to Run" still in the same state. |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
At least one of the tasks got a computation error once again.Restart your PC, or free up some GPU memory: Cuda error 'cudaMalloc((void**) &dev_PowerSpectrum' in file 'd:/Projects/SETI/seti_boinc/client/cuda/cudaAcceleration.cu' in line 298 : out of memory. And while you're at it don't SPAM the Panic Mode Server Problem thread, Make your own thread to describe your problems. Claggy |
LadyL Send message Joined: 14 Sep 11 Posts: 1679 Credit: 5,230,097 RAC: 0 |
At least one of the tasks got a computation error once again.Restart your PC, or free up some GPU memory: With that much VRAM it shouldn't matter, but apparently it does. Some programs are known to hog VRAM. I'd reboot to clear the RAM and run GPU-z to see how much free VRAM there is. Regarding tasks running in parallel, a screenshot of Boinc Manager would help. Primegrid may be detecting lack of free VRAM (as Lunatics apps do) and keeping the tasks on hold. And do start a new thread! I'm not the Pope. I don't speak Ex Cathedra! |
musicplayer Send message Joined: 17 May 10 Posts: 2430 Credit: 926,046 RAC: 0 |
Following up on this, you may possibly remember that I recently posted in Questions and Answers about a possible sound problem. http://setiathome.berkeley.edu/forum_thread.php?id=68854 Suddenly there was no sound in my external Sound Blaster speakers and my first suspicion went to an Adobe Flash Player software update (Update Service 11.3 r300), because a dialog box came up on my screen telling me that the software had a problem and needed to be closed. I almost forgot about the problem, because I have a small USB-plug (Logitech G330) for my headphones which makes me attach the left and right plug on my headphones to a USB port at the rear of the PC while at the same time connect my Sound Blaster speakers to the RealTek sound system which is integrated on the motherboard. But why not check the task list for my current computer being used here. http://setiathome.berkeley.edu/results.php?hostid=5648308 Notice the Seti@home Enhanced 6.10 (cuda_fermi) tasks. Most of them are ending up in error even though the nVidia driver version being used is 260.99 . Since I ran a couple of these tasks having CUDA enabled and as usual having my numbers being recorded in Seti@home-MapView for me to have a closer look at a later time, I did not expect anything particular happening since I also chose to run a PrimeGrid Genefer World Record task by means of CUDA (cuda32_13). This task ran without problems for a little more than half an hour a little earlier on today. So what possibly may have happened? Did a virus strike me or whatever else? Except for the sound problem and the apparent result problem with the Seti@home Enhanced 6.09 CUDA tasks, I can not see any other particular problem with the current installation. I do have quite a bit of software in my possession. Still I would like some advice on troubleshooting, meaning with that what software I possibly may use for this problem and whether there possibly may be some hardware problem related as well. Thanks! |
LadyL Send message Joined: 14 Sep 11 Posts: 1679 Credit: 5,230,097 RAC: 0 |
We had a bit of reshuffeling threads, so I'll reiterate. Your last handfull of CUDA tasks have been erroring out with Cuda error 'cudaMalloc((void**) &dev_PowerSpectrum' in file 'd:/Projects/SETI/seti_boinc/client/cuda/cudaAcceleration.cu' in line 298 : out of memory. setiathome_CUDA: CUDA runtime ERROR in device memory allocation (Step 1 of 3). Falling back to HOST CPU processing... Plainly speaking for some reason you don't have enough free VRAM at task startup. CPU fallback is far too slow to allow completion within the time alloted to the task, so you get -177 errors. A reboot will clear the VRAM. If you then still have problems, you need to check what graphics intensive application might be grabbing (and hogging) the VRAM. I'm not familiar with Primegrid, but I do hope their apps properly release VRAM when they get preempted. I'm not the Pope. I don't speak Ex Cathedra! |
musicplayer Send message Joined: 17 May 10 Posts: 2430 Credit: 926,046 RAC: 0 |
So now I apparently have 17 Seti@home Enhanced 6.10 (cuda_fermi) tasks running 0.48 CPU's at the same time with the new "setup". I suspended my PrimeGrid tasks for now. But if I restart one or more of these LLR-based tasks (non-CUDA because of the implementation) the number of simultaneously running Seti@home tasks of the above-mentioned type will decrease in the same way. Apparently enabling/disabling the CUDA option does not matter. But is it turning the monitor fan up all the time - an indication that CUDA possibly is being used behind the scenes? Anyway, that was what I initially thought. I may have been wrong. I will have to keep a watch on this. Time will tell. Apparently the screen is sluggish, an indication that this resource is being used. Also I now get the hourglass for my mouse and I need to wait for a couple of seconds while editing. But is it then possible to disable this activity without suspending the tasks or possibly exiting BOINC Manager completely? Still, processing times may tell a little about the type of task being processed. One type of task return only spikes, pulses and possibly triplets. A second type of tasks carry out the gaussian search and a third task type which apparently now runs here as well as the first two types (am I right here?) now is doing what the .vlar tasks earlier were supposed to do - finding spikes, pulses and possibly triplets as well. I may really ask this: Who came up with this idea - was it nVidia, or was it Adobe? |
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
There's a bug in the current scheduler. Whereas prior schedulers reported "CUDA" as the coprocessor for NVIDIA GPU tasks, the current scheduler suddenly reports "NVIDIA." So your machine thinks all it needs for the task is a fraction of a CPU and no GPU because 6.X clients won't recognize one NVIDIA coprocessor as the equivalent of one CUDA coprocessor. I've contacted David and Rom trying to find out when this change happened and which BOINC clients need CUDA rather than NVIDIA and what we should do about it. Until then I've deprecated all CUDA apps. @SETIEric@qoto.org (Mastodon) |
Jim Bohan Send message Joined: 23 Dec 01 Posts: 58 Credit: 65,355,247 RAC: 6 |
I have a question... about 3-4 days ago my computer and BOINC went Bonkers! I was running 10 tasks mostly Cuda's. My system has 4 Cores (AMD) and one Nvida 260. How can that happen?? Most of the units that downloaded took about 2 seconds to download which is not normal and when they finished there were computation errors. I shut the system down and restarted, they were doing the same thing even after that. I aborted all the current and pending tasks and tonight I did an update and got more tasks. I just checked the system and I seem to be running about 8 6.03 cuda processes and one Astropulse. Have the way the GPU tasks processes changed?? In the past I would have 4 normal CPU tasks running and 1 Cuda. BTW my computer which is not a wimp, was maxed out, sluggish curser, drive light maxed out when I first started seeing the issues with Seti. If I stopped BOINC everything returned to normal. ??? Member B-52 Stratofortress Association Retired Air Force |
mole Send message Joined: 19 Jan 02 Posts: 5 Credit: 32,233,287 RAC: 15 |
Same here I think. Had a bunch of WU go nuts on a few of my computers. Had up to 30 setiathome_6.09_windows_intelx86__cuda23 running at one time. All ended in computation errors. Started around early part of this week. |
Bernie Vine Send message Joined: 26 May 99 Posts: 9954 Credit: 103,452,613 RAC: 328 |
I have a question... about 3-4 days ago my computer and BOINC went Bonkers! I was running 10 tasks mostly Cuda's. My system has 4 Cores (AMD) and one Nvida 260. How can that happen?? Most of the units that downloaded took about 2 seconds to download which is not normal and when they finished there were computation errors. I shut the system down and restarted, they were doing the same thing even after that. See Eric's post before yours. |
Gundolf Jahn Send message Joined: 19 Sep 00 Posts: 3184 Credit: 446,358 RAC: 0 |
Same here I think. Had a bunch of WU go nuts on a few of my computers. Had up to 30 setiathome_6.09_windows_intelx86__cuda23 running at one time. All ended in computation errors. Started around early part of this week. Does anyone ever read answers in the thread before posting in it? Gruß, Gundolf |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
There's a bug in the current scheduler. Whereas prior schedulers reported "CUDA" as the coprocessor for NVIDIA GPU tasks, the current scheduler suddenly reports "NVIDIA." So your machine thinks all it needs for the task is a fraction of a CPU and no GPU because 6.X clients won't recognize one NVIDIA coprocessor as the equivalent of one CUDA coprocessor. David has coded a possible fix in [trac]changeset:26042[/trac]. If that tests out OK, it should be relatively easy to deploy once they get into the lab. |
mole Send message Joined: 19 Jan 02 Posts: 5 Credit: 32,233,287 RAC: 15 |
Does anyone ever read answers in the thread before posting in it? Yes. Added my two cents. You are welcome. |
Fred J. Verster Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 |
Does anyone ever read answers in the thread before posting in it? Is this an example of these server-bugs? <core_client_version>7.0.28</core_client_version> <![CDATA[ <stderr_txt> Cuda error 'Couldn't get cuda device count ' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 146 : no CUDA-capable device is detected. setiathome_CUDA: cudaGetDeviceCount() call failed. setiathome_CUDA: No CUDA devices found setiathome_CUDA: Found 0 CUDA device(s): In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device cannot be used Cuda device initialisation retry 1 of 6, waiting 5 secs... Cuda error 'Couldn't get cuda device count ' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 146 : no CUDA-capable device is detected. setiathome_CUDA: cudaGetDeviceCount() call failed. setiathome_CUDA: No CUDA devices found setiathome_CUDA: Found 0 CUDA device(s): In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device cannot be used Cuda device initialisation retry 2 of 6, waiting 5 secs... Cuda error 'Couldn't get cuda device count ' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 146 : no CUDA-capable device is detected. setiathome_CUDA: cudaGetDeviceCount() call failed. setiathome_CUDA: No CUDA devices found setiathome_CUDA: Found 0 CUDA device(s): In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device cannot be used Cuda device initialisation retry 3 of 6, waiting 5 secs... Cuda error 'Couldn't get cuda device count ' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 146 : no CUDA-capable device is detected. setiathome_CUDA: cudaGetDeviceCount() call failed. setiathome_CUDA: No CUDA devices found setiathome_CUDA: Found 0 CUDA device(s): In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device cannot be used Cuda device initialisation retry 4 of 6, waiting 5 secs... Cuda error 'Couldn't get cuda device count ' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 146 : no CUDA-capable device is detected. setiathome_CUDA: cudaGetDeviceCount() call failed. setiathome_CUDA: No CUDA devices found setiathome_CUDA: Found 0 CUDA device(s): In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device cannot be used Cuda device initialisation retry 5 of 6, waiting 5 secs... Cuda error 'Couldn't get cuda device count ' in file 'c:/[Projects]/X_CudaMB/client/cuda/cudaAcceleration.cu' in line 146 : no CUDA-capable device is detected. setiathome_CUDA: cudaGetDeviceCount() call failed. setiathome_CUDA: No CUDA devices found setiathome_CUDA: Found 0 CUDA device(s): In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device cannot be used Cuda initialisation FAILED, Initiating Boinc temporary exit (180 secs) Preemptively Acknowledging temporary exit -> boinc_exit(): requesting safe worker shutdown -> boinc_exit(): received safe worker shutdown acknowledge -> setiathome_CUDA: Found 1 CUDA device(s): Device 1: GeForce GTX 260, 896 MiB, regsPerBlock 16384 computeCap 1.3, multiProcs 27 clockRate = 1242000 In cudaAcc_initializeDevice(): Boinc passed DevPref 1 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce GTX 260 is okay SETI@home using CUDA accelerated device GeForce GTX 260 Priority of process raised successfully Priority of worker thread raised successfully Cuda Active: Plenty of total Global VRAM (>300MiB). All early cuFft plans postponed, to parallel with first chirp. ) _ _ _)_ o _ _ (__ (_( ) ) (_( (_ ( (_ ( not bad for a human... _) Multibeam x41g Preview, Cuda 3.20 Legacy setiathome_enhanced V6 mode. Work Unit Info: ............... WU true angle range is : 2.713896 VRAM: cudaMalloc((void**) &dev_cx_DataArray, 1048576x 8bytes = 8388608bytes, offs256=0, rtotal= 8388608bytes VRAM: cudaMalloc((void**) &dev_cx_ChirpDataArray, 1179648x 8bytes = 9437184bytes, offs256=0, rtotal= 17825792bytes VRAM: cudaMalloc((void**) &dev_flag, 1x 8bytes = 8bytes, offs256=0, rtotal= 17825800bytes VRAM: cudaMalloc((void**) &dev_WorkData, 1179648x 8bytes = 9437184bytes, offs256=0, rtotal= 27262984bytes VRAM: cudaMalloc((void**) &dev_PowerSpectrum, 1048576x 4bytes = 4194304bytes, offs256=0, rtotal= 31457288bytes VRAM: cudaMalloc((void**) &dev_t_PowerSpectrum, 1048584x 4bytes = 1048608bytes, offs256=0, rtotal= 32505896bytes VRAM: cudaMalloc((void**) &dev_GaussFitResults, 1048576x 16bytes = 16777216bytes, offs256=0, rtotal= 49283112bytes VRAM: cudaMalloc((void**) &dev_PoT, 1572864x 4bytes = 6291456bytes, offs256=0, rtotal= 55574568bytes VRAM: cudaMalloc((void**) &dev_PoTPrefixSum, 1572864x 4bytes = 6291456bytes, offs256=0, rtotal= 61866024bytes VRAM: cudaMalloc((void**) &dev_NormMaxPower, 16384x 4bytes = 65536bytes, offs256=0, rtotal= 61931560bytes VRAM: cudaMalloc((void**) &dev_flagged, 1048576x 4bytes = 4194304bytes, offs256=0, rtotal= 66125864bytes VRAM: cudaMalloc((void**) &dev_outputposition, 1048576x 4bytes = 4194304bytes, offs256=0, rtotal= 70320168bytes VRAM: cudaMalloc((void**) &dev_PowerSpectrumSumMax, 262144x 12bytes = 3145728bytes, offs256=0, rtotal= 73465896bytes VRAM: cudaMallocArray( &dev_gauss_dof_lcgf_cache, 1x 8192bytes = 8192bytes, offs256=64, rtotal= 73474088bytes VRAM: cudaMallocArray( &dev_null_dof_lcgf_cache, 1x 8192bytes = 8192bytes, offs256=112, rtotal= 73482280bytes VRAM: cudaMalloc((void**) &dev_find_pulse_flag, 1x 8bytes = 8bytes, offs256=0, rtotal= 73482288bytes VRAM: cudaMalloc((void**) &dev_t_funct_cache, 1966081x 4bytes = 7864324bytes, offs256=0, rtotal= 81346612bytes cudaAcc_free() called... cudaAcc_free() running... cudaAcc_free() PulseFind freed... cudaAcc_free() Gaussfit freed... cudaAcc_free() AutoCorrelation freed... cudaAcc_free() DONE. Flopcounter: 10431918437522.469000 Spike count: 1 Pulse count: 0 Triplet count: 2 Gaussian count: 0 Worker preemptively acknowledging a normal exit.-> called boinc_finish boinc_exit(): requesting safe worker shutdown -> boinc_exit(): received safe worker shutdown acknowledge -> </stderr_txt> ]]> WUID 1047649350. I already have my doubts....... This can be something completely different.... ;-) |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
Does anyone ever read answers in the thread before posting in it? No, this is not an example of the Server Bug, It is an example of the Nvidia Sleeping Monitor Bug. Don't people ever read the Stickies? Claggy |
w1hue Send message Joined: 4 Aug 00 Posts: 69 Credit: 5,492,898 RAC: 7 |
What's a 'Stickie'?? |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.