NV AP crash after computation finish fix attempt

Message boards : AstroPulse : NV AP crash after computation finish fix attempt
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 51661 - Posted: 20 Jul 2014, 23:11:30 UTC

If you seeing computations error when task completed and smth like this into result stderr:

class T_GPU_buffer_read_backs: total=6, N=6, <>=1, min=1 max=1
TWIN_FFA OCL_ZERO_COPY USE_OPENCL USE_OPENCL_NV OPENCL_WRITE USE_INCREASED_PRECISION SMALL_CHIRP_TABLE COMBINED_DECHIRP_KERNEL BLANKIT
rev 2488


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00403DF2 read attempt to address 0x005B903C

Engaging BOINC Windows Runtime Debugger...



********************

Try to run this build instead:
https://www.dropbox.com/s/ca1ppbs7h7q72e3/AP7_win_x86_SSE2_OpenCL_NV_r2559_try_catch_sync_block.7z
News about SETI opt app releases: https://twitter.com/Raistmer
ID: 51661 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 51773 - Posted: 31 Jul 2014, 15:38:54 UTC - in response to Message 51661.  
Last modified: 31 Jul 2014, 15:39:09 UTC

Interesting that those errors suddenly disappeared:
http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=63590&offset=0&show_names=0&state=6&appid=29

Last one from 22 July.

Anyone else experience this error? Anyone tried proposed workaround build already?
News about SETI opt app releases: https://twitter.com/Raistmer
ID: 51773 · Report as offensive
Old man
Volunteer tester

Send message
Joined: 22 Sep 09
Posts: 19
Credit: 906,325
RAC: 0
Mongolia
Message 51783 - Posted: 31 Jul 2014, 21:08:59 UTC - in response to Message 51773.  

Interesting that those errors suddenly disappeared:
http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=63590&offset=0&show_names=0&state=6&appid=29

Last one from 22 July.

Anyone else experience this error? Anyone tried proposed workaround build already?


Hey. I have installed
https://www.dropbox.com/s/ca1ppbs7h7q72e3/AP7_win_x86_SSE2_OpenCL_NV_r2559_try_catch_sync_block.7z file. Most of my tasks running now good but today i find one errored task.
http://setiweb.ssl.berkeley.edu/beta/result.php?resultid=17420137

I'm not sure whether the file is in the right place. Where is it supposed to be?
ID: 51783 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 51790 - Posted: 1 Aug 2014, 9:37:38 UTC - in response to Message 51783.  


I'm not sure whether the file is in the right place. Where is it supposed to be?


To use this build one should go to Anonymous platform mechanism provided by BOINC (that is, to create app_info.xml file using provided in package *.aistub config example). Look SETI main forums for many examples how to configure anonymous platform.
Currently your host continues to operate with SETI server provided binary.
News about SETI opt app releases: https://twitter.com/Raistmer
ID: 51790 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 52096 - Posted: 28 Aug 2014, 15:40:42 UTC
Last modified: 28 Aug 2014, 15:43:28 UTC

This issue presents in 7.03 too:

http://setiweb.ssl.berkeley.edu/beta/result.php?resultid=17553887

rev 2662


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004041B5 read attempt to address 0x00A9B5DC

Engaging BOINC Windows Runtime Debugger...



********************


And how good result from same host looks:

rev 2662
GPU device sync requested... ...GPU device synched
18:56:41 (2340): called boinc_finish(0)


So it crashed even before final GPU synching or system ignored stderr flush request...
News about SETI opt app releases: https://twitter.com/Raistmer
ID: 52096 · Report as offensive
jason_gee
Volunteer tester

Send message
Joined: 11 Dec 08
Posts: 198
Credit: 658,573
RAC: 0
Australia
Message 52101 - Posted: 28 Aug 2014, 17:15:43 UTC - in response to Message 52096.  
Last modified: 28 Aug 2014, 17:19:06 UTC

...So it crashed even before final GPU synching or system ignored stderr flush request...


Yesh, mechanism is:
- requested shutdown
- fflush(null) etc...
-- above either flushes nothing (actually app local buffers to OS buffers (MS C runtime helper threads), OS version and disk cache setting dependant (write-through versus write-back caching), or to disk if commit mode was engaged in MS C runtime.
-- above has arbitrary timing not under app control (e.g. Win7 default Windows write buffer delayed flushing --> 20+ seconds quite often when unloading programs, Intel RAID chipset driver write-back setting, or other possibilities, including system contention delays, outdated chipset drivers & others)
- OS &/or C-Runtime buffers still contain some of the stderr (and possibly occasionally some results/state too, unknown)
- Boinc times out and uses TerminateProcess(), which kills the application, Its DLL globals including helper threads in the C-Runtime and some drivers.
-- truncates file(s), and if sync didn't occur can crash drivers too.

Reference:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686714(v=vs.85).aspx
The TerminateProcess function is used to unconditionally cause a process to exit. The state of global data maintained by dynamic-link libraries (DLLs) may be compromised if TerminateProcess is used rather than ExitProcess.
This function stops execution of all threads within the process and requests cancellation of all pending I/O. The terminated process cannot exit until all pending I/O has been completed or canceled.


and
http://msdn.microsoft.com/en-us/library/9yky46tz.aspx
Buffers are normally maintained by the operating system, which determines the optimal time to write the data automatically to disk: when a buffer is full, when a stream is closed, or when a program terminates normally without closing the stream. The commit-to-disk feature of the run-time library lets you ensure that critical data is written directly to disk rather than to the operating-system buffers. Without rewriting an existing program, you can enable this feature by linking the program's object files with COMMODE.OBJ. In the resulting executable file, calls to _flushall write the contents of all buffers to disk. Only _flushall and fflush are affected by COMMODE.OBJ.

Chaos: When the present determines the future, but the approximate present does not approximately determine the future.
Edward Lorenz
ID: 52101 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 52102 - Posted: 28 Aug 2014, 17:30:26 UTC - in response to Message 52101.  

1) commode.obj linked in.
2) should BOINC delay termination being in CriticalSection ?
News about SETI opt app releases: https://twitter.com/Raistmer
ID: 52102 · Report as offensive
jason_gee
Volunteer tester

Send message
Joined: 11 Dec 08
Posts: 198
Credit: 658,573
RAC: 0
Australia
Message 52103 - Posted: 28 Aug 2014, 17:46:31 UTC - in response to Message 52102.  
Last modified: 28 Aug 2014, 17:53:31 UTC

1) commode.obj linked in.
2) should BOINC delay termination being in CriticalSection ?


1) OK. potentially how much data might be being flushed ? Still 8+MiB state/fold buffers ? Boinc client's timeouts may not be enough

2) Boinc client doesn't check if app is in critical sections, it just has a timeout then kills the app. Conceivably your app may not even get any time-slice (inside or outside of a critical section ) and still be killed :-O

So --> Critical section method alone does not work, because it does not account for multithreading properly, and client as implemented thinks its the OS. A request->acknowledge protocol/contract scheme needs to be employed there as we do with GPUs already locally in app.

i.e. a 'Contract' with two sides like real boss+worker
- client requests an exit, and promises to give enough time for workers to get out
- worker promises to clean up quickly & get out

instead of:
- client requests an exit, then blocks the door with a bulldozer
- worker tries to get out, has 10 seconds including cleanup.
- client burns down the factory with the workers inside.

[Edit:] Boinc client shouldn't be using aggressive 'old-style' functions/approaches like TerminateProcess(). It should monitor, complain, delay starting new tasks, wait patiently, complain some more, then report exactly what & why it burns the factory down if it has to.
Chaos: When the present determines the future, but the approximate present does not approximately determine the future.
Edward Lorenz
ID: 52103 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 52109 - Posted: 28 Aug 2014, 20:05:46 UTC - in response to Message 52103.  
Last modified: 28 Aug 2014, 20:16:31 UTC

Could you explain please why BOINC would kill the app before boinc_finish(0) is called?

EDIT: to explain my question: there is no crashes throughout work. If some switch occurs it's always graceful one.
Crash occurs only after complete computation finish but before boinc_finish() call.

EDIT2: well, taking into account that stderr could not be flushed I can't guarantie that crash occurs before boinc_finish() actually... Looks like I need to add Sleep(few seconds) call to be sure first part of synching printing sequence really saved on HDD...
News about SETI opt app releases: https://twitter.com/Raistmer
ID: 52109 · Report as offensive
Profile Mike
Volunteer tester
Avatar

Send message
Joined: 16 Jun 05
Posts: 2530
Credit: 1,074,556
RAC: 0
Germany
Message 52111 - Posted: 28 Aug 2014, 20:29:39 UTC

I`m wondering why it happens only to some hosts.
You never know how long sleep before shut down needs to be.
With each crime and every kindness we birth our future.
ID: 52111 · Report as offensive
jason_gee
Volunteer tester

Send message
Joined: 11 Dec 08
Posts: 198
Credit: 658,573
RAC: 0
Australia
Message 52112 - Posted: 28 Aug 2014, 20:32:22 UTC - in response to Message 52109.  

Could you explain please why BOINC would kill the app before boinc_finish(0) is called?


I can try, though I'd be explaining faulty design approaches, which can be madness.

Two ways that can look like boinc finish didn't get reached:
1) actually before boinc_finish()
First either app indicated in state it was finished, or client requested a quit (for either client/system shutdown, suspend... )
Enters timeout loop...(OK)
Next, other client scheduling process, other task(s) finishing, new ones starting, likely project comms. (probably peak system contention)
- App doesn;t get scheduled due to high contention, so doesn;t reach boinc_exit, or finish

2) after Boinc finish call, but stderr didn't complete writing the buffer to disk before the text 'called boinc_finish()' was written, and crash dump gets written there instead due to corrupted DLL global data (cancelled IO buffers). Most likely with commode linked in, you still get the crash dump successfully.

The second case is an example of bizarre behaviour that says it's doing one thing and is really doing another.

Anyway, neither of those cases, nor killing anything at all (as in life) without very good reasons are ever successful approaches. Trying every other option first, actually asking the OS or the app if it's even got some CPU to shut down would be better.

We know there are or have been cases of 'stuck tasks', but brutality doesn't solve it, only introduce more bizarre behaviour.

So it comes down to the client was implemented as though it has control of things that it doesn't, The OS and the user do. The imperative 'do this', 'do that' control approach died out with single Core systems & supporting OSes.
Chaos: When the present determines the future, but the approximate present does not approximately determine the future.
Edward Lorenz
ID: 52112 · Report as offensive
jason_gee
Volunteer tester

Send message
Joined: 11 Dec 08
Posts: 198
Credit: 658,573
RAC: 0
Australia
Message 52113 - Posted: 28 Aug 2014, 20:36:47 UTC - in response to Message 52111.  

I`m wondering why it happens only to some hosts.
You never know how long sleep before shut down needs to be.


Because there is disk, memory, network, and in GPU app case -- graphics subsystem drivers and hardware--- involved, as well as other applications. Drivers vary in quality (latency in particular), and firmware can have bugs that introduce delays.

It would only really take a loaded system to stick on a Wifi connection dropout or disabling for a short time, then you throw out every waiting task by a second. Then throw in someone actually using the machine, well you get the picture.
Chaos: When the present determines the future, but the approximate present does not approximately determine the future.
Edward Lorenz
ID: 52113 · Report as offensive
jason_gee
Volunteer tester

Send message
Joined: 11 Dec 08
Posts: 198
Credit: 658,573
RAC: 0
Australia
Message 52114 - Posted: 28 Aug 2014, 20:46:40 UTC
Last modified: 28 Aug 2014, 20:57:26 UTC

EDIT: to explain my question: there is no crashes throughout work. If some switch occurs it's always graceful one.
Crash occurs only after complete computation finish but before boinc_finish() call.

EDIT2: well, taking into account that stderr could not be flushed I can't guarantie that crash occurs before boinc_finish() actually... Looks like I need to add Sleep(few seconds) call to be sure first part of synching printing sequence really saved on HDD...


Good idea.

Also you might be able to cap progress in state at 99.99% until the very last moment after GPU syncing [flushing,] & cleanup is done.

[Finished file creation too perhaps, That should avoid boinc detecting completion too early via the 'exited with no fnished file' problem]
Chaos: When the present determines the future, but the approximate present does not approximately determine the future.
Edward Lorenz
ID: 52114 · Report as offensive
Urs Echternacht
Volunteer tester
Avatar

Send message
Joined: 18 Jan 06
Posts: 1038
Credit: 18,734,730
RAC: 0
Germany
Message 52116 - Posted: 28 Aug 2014, 23:23:06 UTC

Jason, what you described here sounds like the "finish file present too long"- error that happens sporadic on Mac OS X and Linux hosts.

Was that decision to use a hardcoded timeout made during the pure CPU crunching era ?
_\|/_
U r s
ID: 52116 · Report as offensive
jason_gee
Volunteer tester

Send message
Joined: 11 Dec 08
Posts: 198
Credit: 658,573
RAC: 0
Australia
Message 52118 - Posted: 29 Aug 2014, 0:16:46 UTC - in response to Message 52116.  
Last modified: 29 Aug 2014, 0:45:10 UTC

Jason, what you described here sounds like the "finish file present too long"- error that happens sporadic on Mac OS X and Linux hosts.

Was that decision to use a hardcoded timeout made during the pure CPU crunching era ?


Quite probably related, yes (though the nature of the design issues can have weird and different effects like that, it's even mutated over time on Windows).

Most likely designed and implemented well before my time, though there have been various commits over the years poking at some aspects in 'bandaid fix' form, around the right areas in the client and api.

In those days most hosts would have also been single core also, and C-Runtime libraries single threaded & statically linked, most often with no dynamic linking (as mandatory with GPUs nowadays at least for driver calls, even if stubs can be made static these days). I strongly suspect that increasing issues on Linux and Mac might be related to GCC's move to multithreaded runtimes as well, running afoul of the same system limitations.

I'm not familiar [at all] with Darwin/FreeBSD/OSX kernel and libraries, but expect like Linux they'd be slowly pushing toward desktop responsiveness optimisations like Windows started with XP onwards. That involves buffering a lot of IO in extra levels, and doing them later when the system demands are not high. Android naturally follows a similar model, though has enough problems with hardware variation to hide a lot of those issues.

[Edit:]Key point to consider is that multithreading in the runtime libraries, drivers and Kernels, use 'helper threads' to offload IO tasks for later completion via IO completion routines (callbacks etc). That's a departure from heavy hardware interrupt driven into 'Deferred procedure call' (software interrupt) territory. In extreme examples that's used to accumulate IO for flash devices to reduce write wear, and to offload tasks to low power dedicated cores.

Killing the app kills its threads, kills whatever incomplete IO was happening (perhaps including queued/postponed deletion of a finished file), potentially leaving a dangling leak, or worse freeing some memory under a DMA transfer (like GPU-CPU). That last example was the cause of Cuda 'Sticky Downclocks' back in the day. So an example of how the same issue has weird side effects.

That's why I suggest that killing processes should be an absolute last resort when all other options fail, and in both Boinc Client and Api it seems to be standard practice after minimal effort.
Chaos: When the present determines the future, but the approximate present does not approximately determine the future.
Edward Lorenz
ID: 52118 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 52170 - Posted: 30 Aug 2014, 20:00:29 UTC
Last modified: 30 Aug 2014, 20:01:06 UTC

<wrong thread>
News about SETI opt app releases: https://twitter.com/Raistmer
ID: 52170 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 52208 - Posted: 2 Sep 2014, 11:58:00 UTC
Last modified: 2 Sep 2014, 11:59:56 UTC

@all who experience such issue with AP 7.03 Nv app

Please try this build under anonymous platform and report tasks that were crashed.
https://www.dropbox.com/s/vlm2nxu8jzb1h7n/AP7_win_x86_SSE2_OpenCL_NV_r2672_crash_after_finish_debug.7z?dl=0
News about SETI opt app releases: https://twitter.com/Raistmer
ID: 52208 · Report as offensive
Profile Raistmer
Volunteer tester
Avatar

Send message
Joined: 18 Aug 05
Posts: 2423
Credit: 15,878,738
RAC: 0
Russia
Message 52240 - Posted: 4 Sep 2014, 16:38:46 UTC

OK, I collected required logs from my own host:

http://setiweb.ssl.berkeley.edu/beta/result.php?resultid=17585594

Putative fix will be available near weekend.
News about SETI opt app releases: https://twitter.com/Raistmer
ID: 52240 · Report as offensive

Message boards : AstroPulse : NV AP crash after computation finish fix attempt


 
©2019 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.