OpenCL (GPU) task stalls, Nvidia driver issue?

Message boards : Number crunching : OpenCL (GPU) task stalls, Nvidia driver issue?
Message board moderation

To post messages, you must log in.

AuthorMessage
Gene Project Donor

Send message
Joined: 26 Apr 99
Posts: 150
Credit: 48,393,279
RAC: 118
United States
Message 1805037 - Posted: 26 Jul 2016, 23:54:10 UTC

Here's the system description: linux_x64, AMD 4-core FX4300, GTX 750Ti, boinc version 7.6.22, Nvidia driver 361.45.11; nothing overclocked, configured for 2 concurrent GPU tasks.

Here is the stalled task situation;
Happens about once per day, i.e. 1 in 90 GPU tasks;
A running GPU task "stalls" i.e. progress % does NOT advance;
Time "remaining" slowly builds;
Elapsed time continues advancing, but "CPU at last checkpoint" is stuck;
None of the files in /slots/xx show any change in size or access time;
"nvidia-smi pmon" shows the stalled task with ZERO % utilization (gpu & mem);
"top" shows the stalled task with 99% CPU share (1 core is configured for each GPU task);
BOTH opencl_nvidia_100 (AstroPulse) AND opencl_nvidia_SoG have been seen to stall this way, AstroPulse only once (understandably);
The SECOND concurrent task seems to be unaffected. If it finishes, another GPU task starts and seems to run normally.

The good news -- it is a recoverable situation. Either of the following two procedures will work:
(1) suspend the offending task; wait a few seconds for unknown boincmgr stuff to happen; resume the offending task. It will pick up at the last CPU checkpoint and, as far as I can tell, run to a normal completion.
(2) Boincmgr -> Suspend GPU; wait a few seconds; Boincmgr -> Use GPU always; and the offending task will resume and run to a normal completion.

Because of the stalls in both AP and MB I am suspecting the nvidia driver rather than a particular application.

Is there any diagnostic/debug information that can be captured when this happens next?

Given three reasonable "next step" choices, Which might be most informative about the cause or which might be most likely to "cure" the problem?
(1) Install the latest (367.35) Nvidia driver; or
(2) Backoff from 2 concurrent GPU tasks to 1; or
(3) Revert to an earlier Nvidia driver. (I have a few in the "archives".)

I am inclined to try #2 but I can be (easily) persuaded to try something else, especially if it would give useful diagnostic data.

Thanks for your attention.
ID: 1805037 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1805041 - Posted: 27 Jul 2016, 0:12:51 UTC - in response to Message 1805037.  

It's the driver crashing, you are correct.

Try upgrading the driver to newest version from the nvidia website.

I'm running the latest 2 versions and haven't had any problems with them.
ID: 1805041 · Report as offensive
Urs Echternacht
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 692
Credit: 135,197,781
RAC: 211
Germany
Message 1805098 - Posted: 27 Jul 2016, 12:18:57 UTC - in response to Message 1805037.  
Last modified: 27 Jul 2016, 12:33:56 UTC

Hi Gene,

your running anonymous platform with cmdline set in app_info.xml, is that right ?

What commandline options are set ?


Do you have set no_priority_change in cc_config.xml in BOINC directory ?
Could help when running more than one task.
    <options>
        ...
        <no_priority_change>1</no_priority_change>
        ...
    </options>

_\|/_
U r s
ID: 1805098 · Report as offensive
Gene Project Donor

Send message
Joined: 26 Apr 99
Posts: 150
Credit: 48,393,279
RAC: 118
United States
Message 1805125 - Posted: 27 Jul 2016, 15:25:25 UTC

@Zaister
-- Yeah, it's the easiest thing to try. I'll upgrade driver 361.45 -->> 367.35 (which is the current one on the nvidia.com site, released 7/15/16) later today and should know within a couple of days whether there is any change. Leaving everything else the same.

@Urs
-- Yes, anonymous platform. The cmd line options are in the app_config.xml as follows:
/snip/
    <plan_class>opencl_nvidia_SoG</plan_class>
    <cmdline>-sbs 384 -period_iterations_num 15</cmdline>
    ...
    <plan_class>opencl_nvidia_100</plan_class>
    <cmdline>-unroll 4 -ffa_block 2048 -ffa_block_fetch 1024 -oclFFT_plan 256 16 256 -hp</cmdline>
    ...


(..the second one applies to AstroPulse tasks..)

As noted above, I'll try the driver upgrade first then consider your suggestion regarding no_priority_change. It is currently not set, i.e. =0. Following the age-old admonition to program debuggers: only change one thing at a time!

Also, for other thread browsers... the "stall" condition I am experiencing is NOT just the occasional pause in the progress % that is common in some applications, such as AstroPulse, but continues for many hours for a task that normally completes in 30 minutes or so.


ID: 1805125 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34257
Credit: 79,922,639
RAC: 80
Germany
Message 1805133 - Posted: 27 Jul 2016, 16:28:21 UTC
Last modified: 27 Jul 2016, 16:29:54 UTC

How many CPU task are you running ?

If you are running more than 2, its lack of CPU resources.


With each crime and every kindness we birth our future.
ID: 1805133 · Report as offensive
Urs Echternacht
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 692
Credit: 135,197,781
RAC: 211
Germany
Message 1805325 - Posted: 28 Jul 2016, 13:00:11 UTC - in response to Message 1805125.  

<plan_class>opencl_nvidia_SoG</plan_class>
<cmdline>-sbs 384 -period_iterations_num 15</cmdline>
...
hint: test to add also -oclfft_tune_gr 256

<plan_class>opencl_nvidia_100</plan_class>
<cmdline>-unroll 4 -ffa_block 2048 -ffa_block_fetch 1024 -oclFFT_plan 256 16 256 -hp</cmdline>
...
About -hp option on Linux i will cite the AstroPulse GPU app Readme :
-hp : Results in bigger priority for application process (normal priority class and above normal thread priority).
	Can be used to increase GPU load, experimentation required for particular GPU/CPU/GPU driver combo.
  (not available on 64bit linux or MacOSX)

  On Linux and MacOSX :
  Due to OS permission setting rules you have the chance to achieve normal priority by setting <no_priority_change>1</no_priority_change> in <options>
  section of your BOINCs "cc_config.xml" file. Check BOINC manuals/wiki for details where to find and how to set this up.
Same applies to setiathome app.
So, you could leave that -hp out, does nothing.
_\|/_
U r s
ID: 1805325 · Report as offensive
Gene Project Donor

Send message
Joined: 26 Apr 99
Posts: 150
Credit: 48,393,279
RAC: 118
United States
Message 1805511 - Posted: 29 Jul 2016, 4:15:34 UTC

The nvidia driver was updated to 367.35 on July 27, 08:40 local time and a GPU task was found in the "stalled" state 5 hours later at 13:00 local time. So I guess that narrows the possible bugs a little bit.

@Mike
-- I have a 4-core CPU and I've configured for max_concurrent_project = 4. In the normal running state the 2 concurrent GPU tasks get their 99% CPU resources (according to "top" which you will be familiar with as a linux utility) and 2 CPU tasks get full use of the other cores. Even in the "stalled" state the offending task gets its 99% CPU time and continues to accumulate (Linux) CPU time but makes no progress due to 0% GPU utilization. See comment below re: no_priority_change.

@Urs
-- Will remove the -hp cmdline option; probably got in there from long ago in copying somebody's message board suggestion. Evidently harmless but good practice to remove stuff that has no effect. The -oclfft_tune_gr 256 option is worth a benchmark run. There are so many "oclfft" options I haven't had the patience to benchmark the many values and combinations. Some old science fiction dilemma comes to mind - "We're lost in 10 dimensional hyperspace!"

<no_priority_change> now set to 1 with configuration for 2 concurrent GPU tasks. I should know within 48 hours whether this changes anything in a positive way. If it does not, then I'll backoff to single GPU task mode.
ID: 1805511 · Report as offensive

Message boards : Number crunching : OpenCL (GPU) task stalls, Nvidia driver issue?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.