"Zombie" AP tasks - still alive in AP v7

Author	Message
Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1589880 - Posted: 22 Oct 2014, 2:01:09 UTC Well, I had another BOINC crash at 4:47 this morning, just 3 days after the last one, but on a different machine, my xw9400. Same apparent trigger, but with a single AP already running on each of the other 3 GPUs, I ended up with 4 AP zombies, tasks 3793460876, 3793445511, 3793445524 and 3793460874. Based on the event log, that last one appears to have been the trigger. I also found an MB task, 3793085246 with a "boinc_finish_called" file it its slot directory. I suspect that it was just unlucky enough to get caught in its termination phase when the crash happened. In any event, deleting the "finish" file before restarting BOINC also allowed it to restart and then finish again normally. I got to thinking about the apparent gap of 6+ months between these BOINC crashes and the subsequent 2 crashes in 3 days on different machines. What was different during those 6+ months? One promising theory that I had was that the period roughly coincided with the span where we were processing mostly older 2008 and 2009 data. Then we recently jumped ahead to more recent tapes, 2010-2014. That theory almost works, but I found one flaw in it. In reviewing the BOINC crash occurrences, I found one that I had forgotten about, on June 30. That one turned out to be a 2009 file. So far, it's the only fly in the ointment, though. Anyway, as long as I've dug this additional info out, I'll go ahead and post it in the hopes that it may yet prove useful when more clues surface. Here a list of all my BOINC crashes which generated AP zombie tasks. The list shows the date and host ID, followed by the dataset name of the AP task which appears to have triggered the crash. (I didn't capture the stdoutdae file for the December 30, 2013, crash, so I don't know for sure which "zombie" was the last one to start.) 20131230_7057115: ap_16oc13ac_B3_P0_00113_20131229_06439.wu_2 (don't know which of these 2 tasks was trigger) 20131230_7057115: ap_16oc13ad_B6_P1_00200_20131229_05567.wu_1 (don't know which of these 2 tasks was trigger) 20140104_7057115: ap_17oc13ac_B1_P0_00131_20140103_01567.wu_1 20140209_7057115: ap_10ap13aa_B5_P1_00191_20140208_30452.wu_1 20140310_6980751: ap_28my13ad_B1_P0_00265_20140309_08199.wu_2 (morning crash) 20140310_6980751: ap_28my13ad_B3_P1_00211_20140310_25141.wu_1 (evening crash) 20140630_6980751: ap_13mr09ab_B3_P0_00183_20140628_15930.wu_0 20141018_7057115: ap_21no10ab_B0_P0_00325_20141016_19978.wu_2 20141021_6980751: ap_06se14aa_B6_P1_00113_20141020_01141.wu_1 ID: 1589880 ·

Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489	Message 1589953 - Posted: 22 Oct 2014, 4:31:56 UTC Last modified: 22 Oct 2014, 4:32:52 UTC I don't wish to rain on your parade but a couple of those tasks have validated with your wingperson and the others are just waiting for your wingpersons to return their results. ;-) Cheers. ID: 1589953 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1589992 - Posted: 22 Oct 2014, 5:45:49 UTC - in response to Message 1589953. I don't wish to rain on your parade but a couple of those tasks have validated with your wingperson and the others are just waiting for your wingpersons to return their results. ;-) Cheers. Dry as bone here in California. ;^) If you're referring to the tasks from the last BOINC crash, the reason they're fine is that I deleted all the "finish" files from the slot directories before restarting BOINC. If you look down through the Stderr for each of them, you'll see two calls to boinc_finish, such as these in task 3793460876: 05:29:08 (1344): called boinc_finish(0) 11:40:54 (1236): called boinc_finish(0) The first was generated when the zombie task originally completed (after continuing to run for about 45 minutes following BOINC's crash at 04:47). The second is for the completion after I discovered the crash, deleted the original finish file and restarted BOINC. The task goes back to the last checkpoint, then generally finishes again in a few minutes. ID: 1589992 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1592392 - Posted: 26 Oct 2014, 10:41:17 UTC Current function that checks exit condition looks like: inline void ExitCheck(){ check_repeat: if (boinc_status.quit_request \|\| boinc_status.abort_request \|\| !canRun) { /* fprintf(stderr,"DEBUG: polled for exit/suspend request: exit needed. Flags are: boinc_status.quit_request=%d, \ boinc_status.abort_request=%d, canRun=%d\n", boinc_status.quit_request,boinc_status.abort_request,canRun);/ DoSyncExit(); }else if(boinc_status.suspended){ Sleep(100); //R:await in sleep 100ms / fprintf(stderr,"DEBUG: polled for exit/suspend request: sleep needed. Flags are: boinc_status.quit_request=%d, \ boinc_status.abort_request=%d, canRun=%d\n", boinc_status.quit_request,boinc_status.abort_request,canRun);/ goto check_repeat;//R: check again if exit required or sleep continues }else{ / fprintf(stderr,"DEBUG: polled for exit/suspend request: exit NOT needed. Flags are: boinc_status.quit_request=%d, \ boinc_status.abort_request=%d, canRun=%d\n", boinc_status.quit_request,boinc_status.abort_request,canRun);*/ } } DoSyncExit() calls DoSync() that in turn quite verbose about its actions and spams in stderr. If no corresponding lines in stderr and OpenCL AP/MB (they share exit check code) then DoSyncExit() missed. If recent BOINC behavior changes rewuires some different flags to check I'm not aware of let me know. Also please describe exact test case to reproduce to see this "zombie" issue. ID: 1592392 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1592527 - Posted: 26 Oct 2014, 17:07:38 UTC - in response to Message 1592392. Last modified: 26 Oct 2014, 17:38:23 UTC Also please describe exact test case to reproduce to see this "zombie" issue. There appear to be two issues at work here. The BOINC crashes themselves seem to be both rare and random. Although every one has been immediately preceded by the start of an AP task, I don't know if they're reproducible and don't have any test case that I could point to. Also, because the zombie AP tasks themselves are the result of the BOINC crash, and not necessarily the cause, I have no specific test case for them. However, the zombie task situation is reproducible, as I detailed in "Zombie" AP tasks - still alive when BOINC should have killed them. In order to simulate a BOINC crash and create the zombie task situation, you must first have one or more AP GPU tasks running under the BOINC client and BOINC Manager. Then exit the BOINC Manager while leaving the BOINC client running (uncheck the box in the "exit dialog"). Change the password in the gui_rpc_auth.cfg file. Attempt to restart the BOINC Manager. You should get a "Connection Error" message stating, "Authorization failed connecting to running client. Make sure you start this program in the same directory as the client." Ignore it. Now shut down the BOINC Manager again, but this time tell it to "stop running tasks" (in the "exit dialog"). You should now find your AP tasks running standalone, without benefit of the BOINC client or BOINC Manager. If you let them run to completion they will produce a "finish" file. If you restart BOINC Manager and the BOINC client while the zombie tasks are still running, it will likely attempt to start another instance of the running AP task(s) but will end up putting them into a "waiting to run" status because the zombie tasks have control of the lockfile. Edit: HAL9000 also simulated a BOINC crash using a taskkill command, as shown in an earlier message in this thread. ID: 1592527 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1592558 - Posted: 26 Oct 2014, 18:13:29 UTC - in response to Message 1592527. Hm... your way seems to be more about BOINC development, not app. But I agree app should exit if no BOINC client govern it long enough. Hence killing boinc.exe should stop all its tasks after some (not too big) amount of time. I'll try to reproduce boinc.exe task kill and then rise issue on BOINC dev list if case will be reproducible. ID: 1592558 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1592565 - Posted: 26 Oct 2014, 18:25:56 UTC - in response to Message 1592558. Yes, it's reproducible. ID: 1592565 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1592567 - Posted: 26 Oct 2014, 18:27:39 UTC - in response to Message 1592558. Hm... your way seems to be more about BOINC development, not app. But I agree app should exit if no BOINC client govern it long enough. Hence killing boinc.exe should stop all its tasks after some (not too big) amount of time. I'll try to reproduce boinc.exe task kill and then rise issue on BOINC dev list if case will be reproducible. Yes, it's not the BOINC crash itself that matters so much, actual or simulated, it's the fact that the AP tasks keep running, eventually generating a "finish" file, which then get marked as errors with "Finish file present too long" when BOINC is later restarted. In my last post in that earlier thread, I also noted my test results on several of my hosts, with different OS/hardware/app combinations. I found that MB tasks running on an ATI GPU also exhibited this same "zombie" behavior. ID: 1592567 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1592568 - Posted: 26 Oct 2014, 18:29:50 UTC - in response to Message 1592565. Yes, it's reproducible. That's good. Reproducibility is certainly a key element. :^) ID: 1592568 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1592589 - Posted: 26 Oct 2014, 19:15:06 UTC - in response to Message 1592558. Hm... your way seems to be more about BOINC development, not app. But I agree app should exit if no BOINC client govern it long enough. Hence killing boinc.exe should stop all its tasks after some (not too big) amount of time. I'll try to reproduce boinc.exe task kill and then rise issue on BOINC dev list if case will be reproducible. I was just thinking "oh no!". If there is code telling the science app to exit if it can not find the BOINC client. Then would it also exit when running in offline test mode. Such as doing a benchmark/debugging test run, or does that not apply when there is no parent application? SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1592589 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1592628 - Posted: 26 Oct 2014, 21:16:40 UTC - in response to Message 1592565. Yes, it's reproducible. It's been that way for a long time. My Old Dell had that problem in XP with MBs. After a couple days the BOINC Manager would crash leaving the App running. It if ran long enough the task would finish and then be deemed invalid. I never had a task finish valid after the Manager crashed. I stopped using that machine long ago. ID: 1592628 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1592655 - Posted: 26 Oct 2014, 22:06:08 UTC - in response to Message 1592589. Hm... your way seems to be more about BOINC development, not app. But I agree app should exit if no BOINC client govern it long enough. Hence killing boinc.exe should stop all its tasks after some (not too big) amount of time. I'll try to reproduce boinc.exe task kill and then rise issue on BOINC dev list if case will be reproducible. I was just thinking "oh no!". If there is code telling the science app to exit if it can not find the BOINC client. Then would it also exit when running in offline test mode. Such as doing a benchmark/debugging test run, or does that not apply when there is no parent application? Launched standalone is different from being orphaned/abandoned. ID: 1592655 ·

BilBg Volunteer tester Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0	Message 1593488 - Posted: 28 Oct 2014, 12:04:33 UTC - in response to Message 1592589. Apps detect at the start if they are "running in offline test mode" Look in your 'Testdatas' and you will see: "Can't open init data file - running in standalone mode" or "Can't set up shared mem: -1. Will run in standalone mode." Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â ID: 1593488 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1595021 - Posted: 31 Oct 2014, 18:28:52 UTC David has checked in an api fix: http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=f0c39bdf5117d8f7dd5092033971d7f700bd22dc API: fix bug where app doesn't exit if client dies while app in critical section There were two parts to this: - In the timer thread, we need to check for client death even if we're in a critical section. If both conditions hold, set the no_heartbeat status flag. - In boinc_end_critical_section(), check no_heartbeat and exit if set. Also: the various checks in boinc_end_critical_section() (quit, abort, no heartbeat) should be conditioned on options.direct_process_action. Otherwise wrappers that use critical sections won't do the right thing. Claggy ID: 1595021 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1595069 - Posted: 31 Oct 2014, 19:42:43 UTC - in response to Message 1595021. David has checked in an api fix: http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=f0c39bdf5117d8f7dd5092033971d7f700bd22dc API: fix bug where app doesn't exit if client dies while app in critical section There were two parts to this: - In the timer thread, we need to check for client death even if we're in a critical section. If both conditions hold, set the no_heartbeat status flag. - In boinc_end_critical_section(), check no_heartbeat and exit if set. Also: the various checks in boinc_end_critical_section() (quit, abort, no heartbeat) should be conditioned on options.direct_process_action. Otherwise wrappers that use critical sections won't do the right thing. Claggy So... now all that is left is to recompile all the apps, verify it works or nothing new is broken, & then deploy. You know, the simple part. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1595069 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1595087 - Posted: 31 Oct 2014, 20:14:10 UTC - in response to Message 1595069. So... now all that is left is to recompile all the apps, verify it works or nothing new is broken, & then deploy. You know, the simple part. Yup. ID: 1595087 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1595186 - Posted: 31 Oct 2014, 22:30:08 UTC - in response to Message 1595087. Ah, it was new bugfix... I missed that from David's post on mail list. Well, then this issue will remains until some versioned revision appears that includes that fix. ID: 1595186 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1595188 - Posted: 31 Oct 2014, 22:30:48 UTC - in response to Message 1595186. Ah, it was new bugfix... I missed that from David's post on mail list. Well, then this issue will remains until some versioned revision appears that includes that fix. f0c39bdf5117d8f7dd5092033971d7f700bd22dc ID: 1595188 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1595196 - Posted: 31 Oct 2014, 22:45:46 UTC - in response to Message 1595188. Ah, it was new bugfix... I missed that from David's post on mail list. Well, then this issue will remains until some versioned revision appears that includes that fix. f0c39bdf5117d8f7dd5092033971d7f700bd22dc It's not a useful version number to refer to. We discussed that issue already. ID: 1595196 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1595198 - Posted: 31 Oct 2014, 22:51:05 UTC - in response to Message 1595196. Ah, it was new bugfix... I missed that from David's post on mail list. Well, then this issue will remains until some versioned revision appears that includes that fix. f0c39bdf5117d8f7dd5092033971d7f700bd22dc It's not a useful version number to refer to. We discussed that issue already. It's a very precise patch number which gives you exactly the code changes you need. You are looking for client version numbers, which are irrelevant to this code path. What you need are API version numbers, which haven't been created and are likely - on current experience - never to exist. We have to live in that environment. ID: 1595198 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.