Windows freeze when CUDA/OpenCL apps get suspended

Author	Message
Zmey Petroff Send message Joined: 27 Apr 00 Posts: 10 Credit: 12,183,784 RAC: 0	Message 1352663 - Posted: 1 Apr 2013, 5:49:46 UTC Last modified: 1 Apr 2013, 5:52:18 UTC Hi, I noticed that my system gets spontaneous freezes when I suspend GPU tasks, or when they get suspended automatically ("Use GPU based on preferences" setting, which suspends GPU tasks when computer is in use). The freeze looked like a hardware problem (first Windows UI froze, next the keyboard stopped responding to "LED toggle" buttons like NumLock; finally, mouse pointer got frozen on screen and the box went completely unresponsive). Yet, I got exactly the same symptoms after I upgraded the box and the OS (Core2 Q6600 -> Core i5, WinXP 32-bit -> Win7 64-bit, GeForce 8600 -> GT 440). Finally I found the reason: if I force GPU apps to run always no matter whether the box is used or not, freezes disappear completely. After some experimentation with OpenCL programming, I noticed that the freezes are most likely caused by NVidia drivers: if I terminate an OpenCL app that is actively running GPU code, I sometimes get familiar nasty freezes. The solution is simple: your app must always gracefully shutdown GPU computations (i.e., wait till the kernels are done, read the data from GPU, release buffers, contexts, kernels, etc). For windows apps this is simple (you get a Windows message about close/shutdown event, and can do cleanup). Console apps seem to get terminated without any notifications, but that's in fact not exactly true: you can install a control handler that will receive console events. I guess SETI CUDA/OpenCL apps are in fact all console applications on Windows platform. I checked SETI CUDA source code for any mention of SetConsoleCtrlHandler() and found none, so here is my proposal: could you please add a handler to CUDA/OpenCL console apps under Windows, so that they could shutdown gracefully? Like this: // global flag - can be changed by signal handler. volatile bool canRun = true; #ifdef _WIN32 // Avoid closing the console without us handling it properly. #include <windows.h> // handle all types of console events: Ctrl+C, Ctrl+Break, // close event, system shutdown event. BOOL WINAPI handlerRoutine( DWORD /* dwCtrlType */ ) { canRun = false; return TRUE; } // void installSignalHandler() { SetConsoleCtrlHandler( handlerRoutine, TRUE ); } #endif ... // somewhere in init routines #ifdef _WIN32 installSignalHandler(); #endif ... // (somewhere in main()) while( canRun && whateverOtherConditions ) // main loop { // do GPU computations ... } // do cleanup and exit ... ID: 1352663 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1352670 - Posted: 1 Apr 2013, 6:17:43 UTC - in response to Message 1352663. Last modified: 1 Apr 2013, 6:19:07 UTC Funny that as I've never come across that problem before. Plus with your computer/s hidden we can't see what apps you are running and if you take a look at my PC's and compare that to what you see with yours you will see that no one can see any private details of yours to hack but what we can see may be of help to you. Cheers. ID: 1352670 ·

Zmey Petroff Send message Joined: 27 Apr 00 Posts: 10 Credit: 12,183,784 RAC: 0	Message 1352674 - Posted: 1 Apr 2013, 6:45:14 UTC - in response to Message 1352670. Ah, yes, hidden computers... I checked this option long ago but cannot find it now in the account settings. Anyway, here are the specs of the box in question: CPU: GenuineIntel Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz [Family 6 Model 42 Stepping 7] GPU: NVIDIA GeForce GT 440 (2048MB) driver: 310.70 OS: Microsoft Windows 7 Ultimate x64 Edition, Service Pack 1, (06.01.7601.00) GPU currently runs either this: SETI@home Enhanced 6.10 windows_intelx86 (cuda_fermi) or this: AstroPulse v6 6.04 windows_intelx86 (opencl_nvidia_100) ID: 1352674 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1352688 - Posted: 1 Apr 2013, 7:42:10 UTC - in response to Message 1352663. Last modified: 1 Apr 2013, 7:44:01 UTC Both CUDA and OpenCL scientific apps are running under BOINC's control. It's boinc.exe process that spawns them and terminates them. Both app supposed to communicate with BOINC about termination and not terminate until GPU processing is finished. Special BOINC API (assumed to be portable between OSes) is used for this instead of low-level Windows (non-portable) API. Could you please check stderr state of app (in BOINC's slot directory) after such HW freeze and reboot? Will it contain lines about termination request and smth like "device synched" (wording slightly differs between CUDA and OpenCL apps) ? We need to know this to decide if synching missed on your host for some reason or current precautions work but not enough for your config. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1352688 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1352692 - Posted: 1 Apr 2013, 7:53:25 UTC - in response to Message 1352688. To show your computer/s just go to your online SETI@home preferences and change the setting there. ;-) Cheers. ID: 1352692 ·

Zmey Petroff Send message Joined: 27 Apr 00 Posts: 10 Credit: 12,183,784 RAC: 0	Message 1352699 - Posted: 1 Apr 2013, 8:42:58 UTC - in response to Message 1352688. Both app supposed to communicate with BOINC about termination and not terminate until GPU processing is finished. Special BOINC API (assumed to be portable between OSes) is used for this instead of low-level Windows (non-portable) API. Could you please check stderr state of app (in BOINC's slot directory) after such HW freeze and reboot? Will it contain lines about termination request and smth like "device synched" (wording slightly differs between CUDA and OpenCL apps) ? We need to know this to decide if synching missed on your host for some reason or current precautions work but not enough for your config. Thanks for the info, Raistmer! I suspected that BOINC didn't use signals to communicate with spawned processes, but I never gave it any deeper thought. The problem with HW freezes is that disk caches are left out of sync. After reboot I can see that files which had been opened or written to shortly before the freeze have become corrupt: empty, truncated, or containing all zero bytes. Yet, I will reproduce the freeze when I come home. I hope stderr file will not get too corrupt. :) ID: 1352699 ·

Zmey Petroff Send message Joined: 27 Apr 00 Posts: 10 Credit: 12,183,784 RAC: 0	Message 1352783 - Posted: 1 Apr 2013, 11:59:26 UTC - in response to Message 1352688. Both CUDA and OpenCL scientific apps are running under BOINC's control. It's boinc.exe process that spawns them and terminates them. Both app supposed to communicate with BOINC about termination and not terminate until GPU processing is finished. Special BOINC API (assumed to be portable between OSes) is used for this instead of low-level Windows (non-portable) API. Sorry for late posting, but I have remembered a couple of HW freezes that occurred at system shutdown time. In the long run, portable ways of ending GPU tasks may not be enough for Windows. Look: - You shut down the system. Windows sends WM_QUERYENDSESSION message to all windowed applications in unspecified order. I say "unspecified" because this order has already been changed a couple of times in different versions of windows. - Console-based applications without a control handler are terminated as soon as their turn to receive the message comes. - In case BOINC receives WM_QUERYENDSESSION after its tasks have been terminated, it has no ways (portable or not) to gracefully shutdown GPU tasks. So, while I am all for code portability (seriously), I still insist on a small windows-specific "trick" - just to be on the safe side. ID: 1352783 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1352879 - Posted: 1 Apr 2013, 14:59:31 UTC - in response to Message 1352783. I will try and will see what can be done in this direction. But expect some problems from BOINC API side. At app beginning there is BOINC API call that configures BOINC's diagnostic subsystem. That in turn installs own signal handlers. In particular this leads to inability to catch exception via try/catch(...) block. Structured exception-related option did not help with this. BOINC's control thread intercept exception before my code and terminates app. So I'm not sure what behavior will be if app will try to install own signal handlers directly... SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1352879 ·

Zmey Petroff Send message Joined: 27 Apr 00 Posts: 10 Credit: 12,183,784 RAC: 0	Message 1356506 - Posted: 13 Apr 2013, 9:21:01 UTC - in response to Message 1352879. Just a heads up: I could not reproduce the freeze with "AstroPulse v.6.6.0.4 opencl_nvidia_100" app. It seems to always correctly shutdown the computations. Stderr log says: Termination request detected. GPU device synched, awaiting termination... Now, I have several "SETI@home Enhanced 6.10 (cuda_fermi)" apps, which cause familiar freezing trouble. Here is the complete stderr log of an app that froze my system (I rebooted in safe mode and copied the files out): setiathome_CUDA: Found 1 CUDA device(s): Device 1 : GeForce GT 440 totalGlobalMem = -2147483648 sharedMemPerBlock = 49152 regsPerBlock = 32768 warpSize = 32 memPitch = 2147483647 maxThreadsPerBlock = 1024 clockRate = 1620000 totalConstMem = 65536 major = 2 minor = 1 textureAlignment = 512 deviceOverlap = 1 multiProcessorCount = 2 setiathome_CUDA: CUDA Device 1 specified, checking... Device 1: GeForce GT 440 is okay SETI@home using CUDA accelerated device GeForce GT 440 setiathome_enhanced 6.09 Visual Studio/Microsoft C++ libboinc: 6.3.22 Work Unit Info: ............... WU true angle range is : 2.715471 Optimal function choices: ----------------------------------------------------- name ----------------------------------------------------- v_BaseLineSmooth (no other) v_GetPowerSpectrum 0.00012 0.00000 v_ChirpData 0.01288 0.00000 v_Transpose4 0.00363 0.00000 FPU opt folding 0.00131 0.00000 ID: 1356506 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1356508 - Posted: 13 Apr 2013, 9:27:46 UTC Last modified: 13 Apr 2013, 9:28:20 UTC Well, I added code you proposed to OpenCL apps sources anyway, don't think it will cause any slowdown. But already released OpenCL apps didn't contain it. It will be included in forthcoming OpenCL MB7 app and new releases of AP6. Regarding CUDA app issues contact with Jason G. on these boards. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1356508 ·

Darth Beaver Send message Joined: 20 Aug 99 Posts: 6728 Credit: 21,443,075 RAC: 3	Message 1356523 - Posted: 13 Apr 2013, 10:24:00 UTC It mite help if he updates the nvida drivers to 314 as he says he is using 310 and that could be the problem . ID: 1356523 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1356554 - Posted: 13 Apr 2013, 12:27:19 UTC - in response to Message 1356508. Regarding CUDA app issues contact with Jason G. on these boards. That output is from the Stock Cuda app, it doesn't contain the improved exit code that Jason's apps have, the author of that app is also not Jason, but Nvidia. If Zmey Petroff has problems with Jason's x41zc apps, then he should report any problems to Jason. Setiathome applications Windows/x86 6.08 (cuda) 21 Jan 2009, 1:43:04 UTC Windows/x86 6.09 (cuda23) 9 Dec 2009, 17:26:45 UTC Windows/x86 6.10 (cuda_fermi) 8 Jun 2010, 22:50:03 UTC The details of the problems introduced with 270+ drivers is reported here (04 Sep 2011): Recent Driver Cuda-safe Project List Claggy ID: 1356554 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.