Windows freeze when CUDA/OpenCL apps get suspended


log in

Advanced search

Message boards : Number crunching : Windows freeze when CUDA/OpenCL apps get suspended

Author Message
Zmey Petroff
Send message
Joined: 27 Apr 00
Posts: 10
Credit: 12,096,591
RAC: 1,050
Russia
Message 1352663 - Posted: 1 Apr 2013, 5:49:46 UTC
Last modified: 1 Apr 2013, 5:52:18 UTC

Hi,

I noticed that my system gets spontaneous freezes when I suspend GPU tasks, or when they get suspended automatically ("Use GPU based on preferences" setting, which suspends GPU tasks when computer is in use).

The freeze looked like a hardware problem (first Windows UI froze, next the keyboard stopped responding to "LED toggle" buttons like NumLock; finally, mouse pointer got frozen on screen and the box went completely unresponsive). Yet, I got exactly the same symptoms after I upgraded the box and the OS (Core2 Q6600 -> Core i5, WinXP 32-bit -> Win7 64-bit, GeForce 8600 -> GT 440).

Finally I found the reason: if I force GPU apps to run always no matter whether the box is used or not, freezes disappear completely.

After some experimentation with OpenCL programming, I noticed that the freezes are most likely caused by NVidia drivers: if I terminate an OpenCL app that is actively running GPU code, I sometimes get familiar nasty freezes.

The solution is simple: your app must always gracefully shutdown GPU computations (i.e., wait till the kernels are done, read the data from GPU, release buffers, contexts, kernels, etc). For windows apps this is simple (you get a Windows message about close/shutdown event, and can do cleanup). Console apps seem to get terminated without any notifications, but that's in fact not exactly true: you can install a control handler that will receive console events.

I guess SETI CUDA/OpenCL apps are in fact all console applications on Windows platform.

I checked SETI CUDA source code for any mention of SetConsoleCtrlHandler() and found none, so here is my proposal: could you please add a handler to CUDA/OpenCL console apps under Windows, so that they could shutdown gracefully? Like this:


// global flag - can be changed by signal handler.
volatile bool canRun = true;

#ifdef _WIN32
// Avoid closing the console without us handling it properly.
#include <windows.h>

// handle all types of console events: Ctrl+C, Ctrl+Break,
// close event, system shutdown event.
BOOL WINAPI handlerRoutine( DWORD /* dwCtrlType */ )
{
canRun = false;
return TRUE;
}

//
void installSignalHandler()
{
SetConsoleCtrlHandler( handlerRoutine, TRUE );
}

#endif

...

// somewhere in init routines
#ifdef _WIN32
installSignalHandler();
#endif


...

// (somewhere in main())

while( canRun && whateverOtherConditions ) // main loop
{
// do GPU computations
...
}

// do cleanup and exit
...

____________

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 6445
Credit: 90,098,211
RAC: 73,804
Australia
Message 1352670 - Posted: 1 Apr 2013, 6:17:43 UTC - in response to Message 1352663.
Last modified: 1 Apr 2013, 6:19:07 UTC

Funny that as I've never come across that problem before.

Plus with your computer/s hidden we can't see what apps you are running and if you take a look at my PC's and compare that to what you see with yours you will see that no one can see any private details of yours to hack but what we can see may be of help to you.

Cheers.

Zmey Petroff
Send message
Joined: 27 Apr 00
Posts: 10
Credit: 12,096,591
RAC: 1,050
Russia
Message 1352674 - Posted: 1 Apr 2013, 6:45:14 UTC - in response to Message 1352670.

Ah, yes, hidden computers... I checked this option long ago but cannot find it now in the account settings.

Anyway, here are the specs of the box in question:

CPU: GenuineIntel Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz [Family 6 Model 42 Stepping 7]
GPU: NVIDIA GeForce GT 440 (2048MB) driver: 310.70
OS: Microsoft Windows 7 Ultimate x64 Edition, Service Pack 1, (06.01.7601.00)


GPU currently runs either this:
SETI@home Enhanced 6.10 windows_intelx86 (cuda_fermi)
or this:
AstroPulse v6 6.04 windows_intelx86 (opencl_nvidia_100)


____________

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3368
Credit: 46,048,045
RAC: 33,607
Russia
Message 1352688 - Posted: 1 Apr 2013, 7:42:10 UTC - in response to Message 1352663.
Last modified: 1 Apr 2013, 7:44:01 UTC

Both CUDA and OpenCL scientific apps are running under BOINC's control.
It's boinc.exe process that spawns them and terminates them.
Both app supposed to communicate with BOINC about termination and not terminate until GPU processing is finished. Special BOINC API (assumed to be portable between OSes) is used for this instead of low-level Windows (non-portable) API.
Could you please check stderr state of app (in BOINC's slot directory) after such HW freeze and reboot?
Will it contain lines about termination request and smth like "device synched" (wording slightly differs between CUDA and OpenCL apps) ?
We need to know this to decide if synching missed on your host for some reason or current precautions work but not enough for your config.
____________

Profile Wiggo
Avatar
Send message
Joined: 24 Jan 00
Posts: 6445
Credit: 90,098,211
RAC: 73,804
Australia
Message 1352692 - Posted: 1 Apr 2013, 7:53:25 UTC - in response to Message 1352688.

To show your computer/s just go to your online SETI@home preferences and change the setting there. ;-)

Cheers.

Zmey Petroff
Send message
Joined: 27 Apr 00
Posts: 10
Credit: 12,096,591
RAC: 1,050
Russia
Message 1352699 - Posted: 1 Apr 2013, 8:42:58 UTC - in response to Message 1352688.

Both app supposed to communicate with BOINC about termination and not terminate until GPU processing is finished. Special BOINC API (assumed to be portable between OSes) is used for this instead of low-level Windows (non-portable) API.
Could you please check stderr state of app (in BOINC's slot directory) after such HW freeze and reboot?
Will it contain lines about termination request and smth like "device synched" (wording slightly differs between CUDA and OpenCL apps) ?
We need to know this to decide if synching missed on your host for some reason or current precautions work but not enough for your config.

Thanks for the info, Raistmer! I suspected that BOINC didn't use signals to communicate with spawned processes, but I never gave it any deeper thought.

The problem with HW freezes is that disk caches are left out of sync. After reboot I can see that files which had been opened or written to shortly before the freeze have become corrupt: empty, truncated, or containing all zero bytes. Yet, I will reproduce the freeze when I come home. I hope stderr file will not get too corrupt. :)
____________

Zmey Petroff
Send message
Joined: 27 Apr 00
Posts: 10
Credit: 12,096,591
RAC: 1,050
Russia
Message 1352783 - Posted: 1 Apr 2013, 11:59:26 UTC - in response to Message 1352688.

Both CUDA and OpenCL scientific apps are running under BOINC's control.
It's boinc.exe process that spawns them and terminates them.
Both app supposed to communicate with BOINC about termination and not terminate until GPU processing is finished. Special BOINC API (assumed to be portable between OSes) is used for this instead of low-level Windows (non-portable) API.

Sorry for late posting, but I have remembered a couple of HW freezes that occurred at system shutdown time.

In the long run, portable ways of ending GPU tasks may not be enough for Windows. Look:
- You shut down the system. Windows sends WM_QUERYENDSESSION message to all windowed applications in unspecified order. I say "unspecified" because this order has already been changed a couple of times in different versions of windows.
- Console-based applications without a control handler are terminated as soon as their turn to receive the message comes.
- In case BOINC receives WM_QUERYENDSESSION after its tasks have been terminated, it has no ways (portable or not) to gracefully shutdown GPU tasks.

So, while I am all for code portability (seriously), I still insist on a small windows-specific "trick" - just to be on the safe side.
____________

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3368
Credit: 46,048,045
RAC: 33,607
Russia
Message 1352879 - Posted: 1 Apr 2013, 14:59:31 UTC - in response to Message 1352783.

I will try and will see what can be done in this direction. But expect some problems from BOINC API side.
At app beginning there is BOINC API call that configures BOINC's diagnostic subsystem. That in turn installs own signal handlers. In particular this leads to inability to catch exception via try/catch(...) block.
Structured exception-related option did not help with this. BOINC's control thread intercept exception before my code and terminates app.
So I'm not sure what behavior will be if app will try to install own signal handlers directly...
____________

Zmey Petroff
Send message
Joined: 27 Apr 00
Posts: 10
Credit: 12,096,591
RAC: 1,050
Russia
Message 1356506 - Posted: 13 Apr 2013, 9:21:01 UTC - in response to Message 1352879.

Just a heads up: I could not reproduce the freeze with "AstroPulse v.6.6.0.4 opencl_nvidia_100" app. It seems to always correctly shutdown the computations. Stderr log says:
Termination request detected. GPU device synched, awaiting termination...

Now, I have several "SETI@home Enhanced 6.10 (cuda_fermi)" apps, which cause familiar freezing trouble.

Here is the complete stderr log of an app that froze my system (I rebooted in safe mode and copied the files out):

setiathome_CUDA: Found 1 CUDA device(s):
Device 1 : GeForce GT 440
totalGlobalMem = -2147483648
sharedMemPerBlock = 49152
regsPerBlock = 32768
warpSize = 32
memPitch = 2147483647
maxThreadsPerBlock = 1024
clockRate = 1620000
totalConstMem = 65536
major = 2
minor = 1
textureAlignment = 512
deviceOverlap = 1
multiProcessorCount = 2
setiathome_CUDA: CUDA Device 1 specified, checking...
Device 1: GeForce GT 440 is okay
SETI@home using CUDA accelerated device GeForce GT 440
setiathome_enhanced 6.09 Visual Studio/Microsoft C++
libboinc: 6.3.22

Work Unit Info:
...............
WU true angle range is : 2.715471
Optimal function choices:
-----------------------------------------------------
name
-----------------------------------------------------
v_BaseLineSmooth (no other)
v_GetPowerSpectrum 0.00012 0.00000
v_ChirpData 0.01288 0.00000
v_Transpose4 0.00363 0.00000
FPU opt folding 0.00131 0.00000



____________

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3368
Credit: 46,048,045
RAC: 33,607
Russia
Message 1356508 - Posted: 13 Apr 2013, 9:27:46 UTC
Last modified: 13 Apr 2013, 9:28:20 UTC

Well, I added code you proposed to OpenCL apps sources anyway, don't think it will cause any slowdown.
But already released OpenCL apps didn't contain it. It will be included in forthcoming OpenCL MB7 app and new releases of AP6.
Regarding CUDA app issues contact with Jason G. on these boards.
____________

Glenn savill
Avatar
Send message
Joined: 20 Aug 99
Posts: 1498
Credit: 1,147,771
RAC: 14,122
Australia
Message 1356523 - Posted: 13 Apr 2013, 10:24:00 UTC

It mite help if he updates the nvida drivers to 314 as he says he is using 310 and that could be the problem .
____________

Claggy
Volunteer tester
Send message
Joined: 5 Jul 99
Posts: 4039
Credit: 32,691,382
RAC: 712
United Kingdom
Message 1356554 - Posted: 13 Apr 2013, 12:27:19 UTC - in response to Message 1356508.

Regarding CUDA app issues contact with Jason G. on these boards.

That output is from the Stock Cuda app, it doesn't contain the improved exit code that Jason's apps have, the author of that app is also not Jason, but Nvidia.
If Zmey Petroff has problems with Jason's x41zc apps, then he should report any problems to Jason.

Setiathome applications

Windows/x86 6.08 (cuda) 21 Jan 2009, 1:43:04 UTC

Windows/x86 6.09 (cuda23) 9 Dec 2009, 17:26:45 UTC

Windows/x86 6.10 (cuda_fermi) 8 Jun 2010, 22:50:03 UTC


The details of the problems introduced with 270+ drivers is reported here (04 Sep 2011):

Recent Driver Cuda-safe Project List

Claggy

Message boards : Number crunching : Windows freeze when CUDA/OpenCL apps get suspended

Copyright © 2014 University of California