Message boards :
Number crunching :
OpenCL (GPU) task stalls, Nvidia driver issue?
Message board moderation
Author | Message |
---|---|
Gene Send message Joined: 26 Apr 99 Posts: 150 Credit: 48,393,279 RAC: 118 |
Here's the system description: linux_x64, AMD 4-core FX4300, GTX 750Ti, boinc version 7.6.22, Nvidia driver 361.45.11; nothing overclocked, configured for 2 concurrent GPU tasks. Here is the stalled task situation; Happens about once per day, i.e. 1 in 90 GPU tasks; A running GPU task "stalls" i.e. progress % does NOT advance; Time "remaining" slowly builds; Elapsed time continues advancing, but "CPU at last checkpoint" is stuck; None of the files in /slots/xx show any change in size or access time; "nvidia-smi pmon" shows the stalled task with ZERO % utilization (gpu & mem); "top" shows the stalled task with 99% CPU share (1 core is configured for each GPU task); BOTH opencl_nvidia_100 (AstroPulse) AND opencl_nvidia_SoG have been seen to stall this way, AstroPulse only once (understandably); The SECOND concurrent task seems to be unaffected. If it finishes, another GPU task starts and seems to run normally. The good news -- it is a recoverable situation. Either of the following two procedures will work: (1) suspend the offending task; wait a few seconds for unknown boincmgr stuff to happen; resume the offending task. It will pick up at the last CPU checkpoint and, as far as I can tell, run to a normal completion. (2) Boincmgr -> Suspend GPU; wait a few seconds; Boincmgr -> Use GPU always; and the offending task will resume and run to a normal completion. Because of the stalls in both AP and MB I am suspecting the nvidia driver rather than a particular application. Is there any diagnostic/debug information that can be captured when this happens next? Given three reasonable "next step" choices, Which might be most informative about the cause or which might be most likely to "cure" the problem? (1) Install the latest (367.35) Nvidia driver; or (2) Backoff from 2 concurrent GPU tasks to 1; or (3) Revert to an earlier Nvidia driver. (I have a few in the "archives".) I am inclined to try #2 but I can be (easily) persuaded to try something else, especially if it would give useful diagnostic data. Thanks for your attention. |
Zalster Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 |
It's the driver crashing, you are correct. Try upgrading the driver to newest version from the nvidia website. I'm running the latest 2 versions and haven't had any problems with them. |
Urs Echternacht Send message Joined: 15 May 99 Posts: 692 Credit: 135,197,781 RAC: 211 |
Hi Gene, your running anonymous platform with cmdline set in app_info.xml, is that right ? What commandline options are set ? Do you have set no_priority_change in cc_config.xml in BOINC directory ? Could help when running more than one task. <options> ... <no_priority_change>1</no_priority_change> ... </options> _\|/_ U r s |
Gene Send message Joined: 26 Apr 99 Posts: 150 Credit: 48,393,279 RAC: 118 |
@Zaister -- Yeah, it's the easiest thing to try. I'll upgrade driver 361.45 -->> 367.35 (which is the current one on the nvidia.com site, released 7/15/16) later today and should know within a couple of days whether there is any change. Leaving everything else the same. @Urs -- Yes, anonymous platform. The cmd line options are in the app_config.xml as follows: /snip/
<cmdline>-sbs 384 -period_iterations_num 15</cmdline> ... <plan_class>opencl_nvidia_100</plan_class> <cmdline>-unroll 4 -ffa_block 2048 -ffa_block_fetch 1024 -oclFFT_plan 256 16 256 -hp</cmdline> ...
|
Mike Send message Joined: 17 Feb 01 Posts: 34257 Credit: 79,922,639 RAC: 80 |
How many CPU task are you running ? If you are running more than 2, its lack of CPU resources. With each crime and every kindness we birth our future. |
Urs Echternacht Send message Joined: 15 May 99 Posts: 692 Credit: 135,197,781 RAC: 211 |
<plan_class>opencl_nvidia_SoG</plan_class>hint: test to add also -oclfft_tune_gr 256 About -hp option on Linux i will cite the AstroPulse GPU app Readme : -hp : Results in bigger priority for application process (normal priority class and above normal thread priority). Can be used to increase GPU load, experimentation required for particular GPU/CPU/GPU driver combo. (not available on 64bit linux or MacOSX) On Linux and MacOSX : Due to OS permission setting rules you have the chance to achieve normal priority by setting <no_priority_change>1</no_priority_change> in <options> section of your BOINCs "cc_config.xml" file. Check BOINC manuals/wiki for details where to find and how to set this up.Same applies to setiathome app. So, you could leave that -hp out, does nothing. _\|/_ U r s |
Gene Send message Joined: 26 Apr 99 Posts: 150 Credit: 48,393,279 RAC: 118 |
The nvidia driver was updated to 367.35 on July 27, 08:40 local time and a GPU task was found in the "stalled" state 5 hours later at 13:00 local time. So I guess that narrows the possible bugs a little bit. @Mike -- I have a 4-core CPU and I've configured for max_concurrent_project = 4. In the normal running state the 2 concurrent GPU tasks get their 99% CPU resources (according to "top" which you will be familiar with as a linux utility) and 2 CPU tasks get full use of the other cores. Even in the "stalled" state the offending task gets its 99% CPU time and continues to accumulate (Linux) CPU time but makes no progress due to 0% GPU utilization. See comment below re: no_priority_change. @Urs -- Will remove the -hp cmdline option; probably got in there from long ago in copying somebody's message board suggestion. Evidently harmless but good practice to remove stuff that has no effect. The -oclfft_tune_gr 256 option is worth a benchmark run. There are so many "oclfft" options I haven't had the patience to benchmark the many values and combinations. Some old science fiction dilemma comes to mind - "We're lost in 10 dimensional hyperspace!" <no_priority_change> now set to 1 with configuration for 2 concurrent GPU tasks. I should know within 48 hours whether this changes anything in a positive way. If it does not, then I'll backoff to single GPU task mode. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.