Windows freeze when CUDA/OpenCL apps get suspended

Message boards : Number crunching : Windows freeze when CUDA/OpenCL apps get suspended
Message board moderation

To post messages, you must log in.

AuthorMessage
Zmey Petroff

Send message
Joined: 27 Apr 00
Posts: 10
Credit: 12,183,784
RAC: 0
Russia
Message 1352663 - Posted: 1 Apr 2013, 5:49:46 UTC
Last modified: 1 Apr 2013, 5:52:18 UTC

Hi,

I noticed that my system gets spontaneous freezes when I suspend GPU tasks, or when they get suspended automatically ("Use GPU based on preferences" setting, which suspends GPU tasks when computer is in use).

The freeze looked like a hardware problem (first Windows UI froze, next the keyboard stopped responding to "LED toggle" buttons like NumLock; finally, mouse pointer got frozen on screen and the box went completely unresponsive). Yet, I got exactly the same symptoms after I upgraded the box and the OS (Core2 Q6600 -> Core i5, WinXP 32-bit -> Win7 64-bit, GeForce 8600 -> GT 440).

Finally I found the reason: if I force GPU apps to run always no matter whether the box is used or not, freezes disappear completely.

After some experimentation with OpenCL programming, I noticed that the freezes are most likely caused by NVidia drivers: if I terminate an OpenCL app that is actively running GPU code, I sometimes get familiar nasty freezes.

The solution is simple: your app must always gracefully shutdown GPU computations (i.e., wait till the kernels are done, read the data from GPU, release buffers, contexts, kernels, etc). For windows apps this is simple (you get a Windows message about close/shutdown event, and can do cleanup). Console apps seem to get terminated without any notifications, but that's in fact not exactly true: you can install a control handler that will receive console events.

I guess SETI CUDA/OpenCL apps are in fact all console applications on Windows platform.

I checked SETI CUDA source code for any mention of SetConsoleCtrlHandler() and found none, so here is my proposal: could you please add a handler to CUDA/OpenCL console apps under Windows, so that they could shutdown gracefully? Like this:

// global flag - can be changed by signal handler.
volatile bool canRun = true;

#ifdef _WIN32
// Avoid closing the console without us handling it properly.
#include <windows.h>

// handle all types of console events: Ctrl+C, Ctrl+Break,
// close event, system shutdown event.
BOOL WINAPI handlerRoutine( DWORD /* dwCtrlType */ )
{
    canRun = false;
    return TRUE;
}

//
void installSignalHandler()
{
    SetConsoleCtrlHandler( handlerRoutine, TRUE );
}

#endif

...

// somewhere in init routines
#ifdef _WIN32
    installSignalHandler();
#endif


...

// (somewhere in main())

while( canRun && whateverOtherConditions ) // main loop
{
   // do GPU computations
   ...
}

// do cleanup and exit
...

ID: 1352663 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1352670 - Posted: 1 Apr 2013, 6:17:43 UTC - in response to Message 1352663.  
Last modified: 1 Apr 2013, 6:19:07 UTC

Funny that as I've never come across that problem before.

Plus with your computer/s hidden we can't see what apps you are running and if you take a look at my PC's and compare that to what you see with yours you will see that no one can see any private details of yours to hack but what we can see may be of help to you.

Cheers.
ID: 1352670 · Report as offensive
Zmey Petroff

Send message
Joined: 27 Apr 00
Posts: 10
Credit: 12,183,784
RAC: 0
Russia
Message 1352674 - Posted: 1 Apr 2013, 6:45:14 UTC - in response to Message 1352670.  

Ah, yes, hidden computers... I checked this option long ago but cannot find it now in the account settings.

Anyway, here are the specs of the box in question:

CPU: GenuineIntel Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz [Family 6 Model 42 Stepping 7]
GPU: NVIDIA GeForce GT 440 (2048MB) driver: 310.70
OS: Microsoft Windows 7 Ultimate x64 Edition, Service Pack 1, (06.01.7601.00)


GPU currently runs either this:
SETI@home Enhanced 6.10 windows_intelx86 (cuda_fermi)
or this:
AstroPulse v6 6.04 windows_intelx86 (opencl_nvidia_100)


ID: 1352674 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1352688 - Posted: 1 Apr 2013, 7:42:10 UTC - in response to Message 1352663.  
Last modified: 1 Apr 2013, 7:44:01 UTC

Both CUDA and OpenCL scientific apps are running under BOINC's control.
It's boinc.exe process that spawns them and terminates them.
Both app supposed to communicate with BOINC about termination and not terminate until GPU processing is finished. Special BOINC API (assumed to be portable between OSes) is used for this instead of low-level Windows (non-portable) API.
Could you please check stderr state of app (in BOINC's slot directory) after such HW freeze and reboot?
Will it contain lines about termination request and smth like "device synched" (wording slightly differs between CUDA and OpenCL apps) ?
We need to know this to decide if synching missed on your host for some reason or current precautions work but not enough for your config.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1352688 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1352692 - Posted: 1 Apr 2013, 7:53:25 UTC - in response to Message 1352688.  

To show your computer/s just go to your online SETI@home preferences and change the setting there. ;-)

Cheers.
ID: 1352692 · Report as offensive
Zmey Petroff

Send message
Joined: 27 Apr 00
Posts: 10
Credit: 12,183,784
RAC: 0
Russia
Message 1352699 - Posted: 1 Apr 2013, 8:42:58 UTC - in response to Message 1352688.  

Both app supposed to communicate with BOINC about termination and not terminate until GPU processing is finished. Special BOINC API (assumed to be portable between OSes) is used for this instead of low-level Windows (non-portable) API.
Could you please check stderr state of app (in BOINC's slot directory) after such HW freeze and reboot?
Will it contain lines about termination request and smth like "device synched" (wording slightly differs between CUDA and OpenCL apps) ?
We need to know this to decide if synching missed on your host for some reason or current precautions work but not enough for your config.

Thanks for the info, Raistmer! I suspected that BOINC didn't use signals to communicate with spawned processes, but I never gave it any deeper thought.

The problem with HW freezes is that disk caches are left out of sync. After reboot I can see that files which had been opened or written to shortly before the freeze have become corrupt: empty, truncated, or containing all zero bytes. Yet, I will reproduce the freeze when I come home. I hope stderr file will not get too corrupt. :)
ID: 1352699 · Report as offensive
Zmey Petroff

Send message
Joined: 27 Apr 00
Posts: 10
Credit: 12,183,784
RAC: 0
Russia
Message 1352783 - Posted: 1 Apr 2013, 11:59:26 UTC - in response to Message 1352688.  

Both CUDA and OpenCL scientific apps are running under BOINC's control.
It's boinc.exe process that spawns them and terminates them.
Both app supposed to communicate with BOINC about termination and not terminate until GPU processing is finished. Special BOINC API (assumed to be portable between OSes) is used for this instead of low-level Windows (non-portable) API.

Sorry for late posting, but I have remembered a couple of HW freezes that occurred at system shutdown time.

In the long run, portable ways of ending GPU tasks may not be enough for Windows. Look:
- You shut down the system. Windows sends WM_QUERYENDSESSION message to all windowed applications in unspecified order. I say "unspecified" because this order has already been changed a couple of times in different versions of windows.
- Console-based applications without a control handler are terminated as soon as their turn to receive the message comes.
- In case BOINC receives WM_QUERYENDSESSION after its tasks have been terminated, it has no ways (portable or not) to gracefully shutdown GPU tasks.

So, while I am all for code portability (seriously), I still insist on a small windows-specific "trick" - just to be on the safe side.
ID: 1352783 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1352879 - Posted: 1 Apr 2013, 14:59:31 UTC - in response to Message 1352783.  

I will try and will see what can be done in this direction. But expect some problems from BOINC API side.
At app beginning there is BOINC API call that configures BOINC's diagnostic subsystem. That in turn installs own signal handlers. In particular this leads to inability to catch exception via try/catch(...) block.
Structured exception-related option did not help with this. BOINC's control thread intercept exception before my code and terminates app.
So I'm not sure what behavior will be if app will try to install own signal handlers directly...
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1352879 · Report as offensive
Zmey Petroff

Send message
Joined: 27 Apr 00
Posts: 10
Credit: 12,183,784
RAC: 0
Russia
Message 1356506 - Posted: 13 Apr 2013, 9:21:01 UTC - in response to Message 1352879.  

Just a heads up: I could not reproduce the freeze with "AstroPulse v.6.6.0.4 opencl_nvidia_100" app. It seems to always correctly shutdown the computations. Stderr log says:
Termination request detected. GPU device synched, awaiting termination...

Now, I have several "SETI@home Enhanced 6.10 (cuda_fermi)" apps, which cause familiar freezing trouble.

Here is the complete stderr log of an app that froze my system (I rebooted in safe mode and copied the files out):

setiathome_CUDA: Found 1 CUDA device(s):
   Device 1 : GeForce GT 440 
           totalGlobalMem = -2147483648 
           sharedMemPerBlock = 49152 
           regsPerBlock = 32768 
           warpSize = 32 
           memPitch = 2147483647 
           maxThreadsPerBlock = 1024 
           clockRate = 1620000 
           totalConstMem = 65536 
           major = 2 
           minor = 1 
           textureAlignment = 512 
           deviceOverlap = 1 
           multiProcessorCount = 2 
setiathome_CUDA: CUDA Device 1 specified, checking...
   Device 1: GeForce GT 440 is okay
SETI@home using CUDA accelerated device GeForce GT 440
setiathome_enhanced 6.09 Visual Studio/Microsoft C++
libboinc: 6.3.22

Work Unit Info:
...............
WU true angle range is :  2.715471
Optimal function choices:
-----------------------------------------------------
name                
-----------------------------------------------------
              v_BaseLineSmooth (no other)
            v_GetPowerSpectrum 0.00012 0.00000 
                   v_ChirpData 0.01288 0.00000 
                  v_Transpose4 0.00363 0.00000 
               FPU opt folding 0.00131 0.00000 



ID: 1356506 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1356508 - Posted: 13 Apr 2013, 9:27:46 UTC
Last modified: 13 Apr 2013, 9:28:20 UTC

Well, I added code you proposed to OpenCL apps sources anyway, don't think it will cause any slowdown.
But already released OpenCL apps didn't contain it. It will be included in forthcoming OpenCL MB7 app and new releases of AP6.
Regarding CUDA app issues contact with Jason G. on these boards.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1356508 · Report as offensive
Darth Beaver Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 20 Aug 99
Posts: 6728
Credit: 21,443,075
RAC: 3
Australia
Message 1356523 - Posted: 13 Apr 2013, 10:24:00 UTC

It mite help if he updates the nvida drivers to 314 as he says he is using 310 and that could be the problem .
ID: 1356523 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1356554 - Posted: 13 Apr 2013, 12:27:19 UTC - in response to Message 1356508.  

Regarding CUDA app issues contact with Jason G. on these boards.

That output is from the Stock Cuda app, it doesn't contain the improved exit code that Jason's apps have, the author of that app is also not Jason, but Nvidia.
If Zmey Petroff has problems with Jason's x41zc apps, then he should report any problems to Jason.

Setiathome applications

Windows/x86 6.08 (cuda) 21 Jan 2009, 1:43:04 UTC

Windows/x86 6.09 (cuda23) 9 Dec 2009, 17:26:45 UTC

Windows/x86 6.10 (cuda_fermi) 8 Jun 2010, 22:50:03 UTC


The details of the problems introduced with 270+ drivers is reported here (04 Sep 2011):

Recent Driver Cuda-safe Project List

Claggy
ID: 1356554 · Report as offensive

Message boards : Number crunching : Windows freeze when CUDA/OpenCL apps get suspended


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.