"Zombie" AP tasks - still alive in AP v7

Message boards : Number crunching : "Zombie" AP tasks - still alive in AP v7
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1589880 - Posted: 22 Oct 2014, 2:01:09 UTC

Well, I had another BOINC crash at 4:47 this morning, just 3 days after the last one, but on a different machine, my xw9400. Same apparent trigger, but with a single AP already running on each of the other 3 GPUs, I ended up with 4 AP zombies, tasks 3793460876, 3793445511, 3793445524 and 3793460874. Based on the event log, that last one appears to have been the trigger. I also found an MB task, 3793085246 with a "boinc_finish_called" file it its slot directory. I suspect that it was just unlucky enough to get caught in its termination phase when the crash happened. In any event, deleting the "finish" file before restarting BOINC also allowed it to restart and then finish again normally.

I got to thinking about the apparent gap of 6+ months between these BOINC crashes and the subsequent 2 crashes in 3 days on different machines. What was different during those 6+ months? One promising theory that I had was that the period roughly coincided with the span where we were processing mostly older 2008 and 2009 data. Then we recently jumped ahead to more recent tapes, 2010-2014.

That theory almost works, but I found one flaw in it. In reviewing the BOINC crash occurrences, I found one that I had forgotten about, on June 30. That one turned out to be a 2009 file. So far, it's the only fly in the ointment, though.

Anyway, as long as I've dug this additional info out, I'll go ahead and post it in the hopes that it may yet prove useful when more clues surface. Here a list of all my BOINC crashes which generated AP zombie tasks. The list shows the date and host ID, followed by the dataset name of the AP task which appears to have triggered the crash. (I didn't capture the stdoutdae file for the December 30, 2013, crash, so I don't know for sure which "zombie" was the last one to start.)

20131230_7057115: ap_16oc13ac_B3_P0_00113_20131229_06439.wu_2 (don't know which of these 2 tasks was trigger)
20131230_7057115: ap_16oc13ad_B6_P1_00200_20131229_05567.wu_1 (don't know which of these 2 tasks was trigger)
20140104_7057115: ap_17oc13ac_B1_P0_00131_20140103_01567.wu_1
20140209_7057115: ap_10ap13aa_B5_P1_00191_20140208_30452.wu_1
20140310_6980751: ap_28my13ad_B1_P0_00265_20140309_08199.wu_2 (morning crash)
20140310_6980751: ap_28my13ad_B3_P1_00211_20140310_25141.wu_1 (evening crash)
20140630_6980751: ap_13mr09ab_B3_P0_00183_20140628_15930.wu_0
20141018_7057115: ap_21no10ab_B0_P0_00325_20141016_19978.wu_2
20141021_6980751: ap_06se14aa_B6_P1_00113_20141020_01141.wu_1
ID: 1589880 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1589953 - Posted: 22 Oct 2014, 4:31:56 UTC
Last modified: 22 Oct 2014, 4:32:52 UTC

I don't wish to rain on your parade but a couple of those tasks have validated with your wingperson and the others are just waiting for your wingpersons to return their results. ;-)

Cheers.
ID: 1589953 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1589992 - Posted: 22 Oct 2014, 5:45:49 UTC - in response to Message 1589953.  

I don't wish to rain on your parade but a couple of those tasks have validated with your wingperson and the others are just waiting for your wingpersons to return their results. ;-)

Cheers.

Dry as bone here in California. ;^)

If you're referring to the tasks from the last BOINC crash, the reason they're fine is that I deleted all the "finish" files from the slot directories before restarting BOINC. If you look down through the Stderr for each of them, you'll see two calls to boinc_finish, such as these in task 3793460876:

05:29:08 (1344): called boinc_finish(0)
11:40:54 (1236): called boinc_finish(0)

The first was generated when the zombie task originally completed (after continuing to run for about 45 minutes following BOINC's crash at 04:47). The second is for the completion after I discovered the crash, deleted the original finish file and restarted BOINC. The task goes back to the last checkpoint, then generally finishes again in a few minutes.
ID: 1589992 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1592392 - Posted: 26 Oct 2014, 10:41:17 UTC

Current function that checks exit condition looks like:

 inline void ExitCheck(){
check_repeat:
if (boinc_status.quit_request || boinc_status.abort_request || !canRun) {
/*	fprintf(stderr,"DEBUG: polled for exit/suspend request: exit needed. Flags are: boinc_status.quit_request=%d, \
boinc_status.abort_request=%d, canRun=%d\n",
		boinc_status.quit_request,boinc_status.abort_request,canRun);*/ 
	DoSyncExit();
}else if(boinc_status.suspended){
	Sleep(100); //R:await in sleep 100ms
/*		fprintf(stderr,"DEBUG: polled for exit/suspend request: sleep needed. Flags are: boinc_status.quit_request=%d, \
boinc_status.abort_request=%d, canRun=%d\n",
		boinc_status.quit_request,boinc_status.abort_request,canRun);*/ 
	goto check_repeat;//R: check again if exit required or sleep continues
}else{
/*		fprintf(stderr,"DEBUG: polled for exit/suspend request: exit NOT needed. Flags are: boinc_status.quit_request=%d, \
boinc_status.abort_request=%d, canRun=%d\n",
		boinc_status.quit_request,boinc_status.abort_request,canRun);*/ 
}
}


DoSyncExit() calls DoSync() that in turn quite verbose about its actions and spams in stderr. If no corresponding lines in stderr and OpenCL AP/MB (they share exit check code) then DoSyncExit() missed.

If recent BOINC behavior changes rewuires some different flags to check I'm not aware of let me know.

Also please describe exact test case to reproduce to see this "zombie" issue.
ID: 1592392 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1592527 - Posted: 26 Oct 2014, 17:07:38 UTC - in response to Message 1592392.  
Last modified: 26 Oct 2014, 17:38:23 UTC

Also please describe exact test case to reproduce to see this "zombie" issue.

There appear to be two issues at work here. The BOINC crashes themselves seem to be both rare and random. Although every one has been immediately preceded by the start of an AP task, I don't know if they're reproducible and don't have any test case that I could point to.

Also, because the zombie AP tasks themselves are the result of the BOINC crash, and not necessarily the cause, I have no specific test case for them. However, the zombie task situation is reproducible, as I detailed in "Zombie" AP tasks - still alive when BOINC should have killed them.

In order to simulate a BOINC crash and create the zombie task situation, you must first have one or more AP GPU tasks running under the BOINC client and BOINC Manager. Then exit the BOINC Manager while leaving the BOINC client running (uncheck the box in the "exit dialog"). Change the password in the gui_rpc_auth.cfg file. Attempt to restart the BOINC Manager. You should get a "Connection Error" message stating, "Authorization failed connecting to running client. Make sure you start this program in the same directory as the client." Ignore it. Now shut down the BOINC Manager again, but this time tell it to "stop running tasks" (in the "exit dialog"). You should now find your AP tasks running standalone, without benefit of the BOINC client or BOINC Manager. If you let them run to completion they will produce a "finish" file. If you restart BOINC Manager and the BOINC client while the zombie tasks are still running, it will likely attempt to start another instance of the running AP task(s) but will end up putting them into a "waiting to run" status because the zombie tasks have control of the lockfile.

Edit: HAL9000 also simulated a BOINC crash using a taskkill command, as shown in an earlier message in this thread.
ID: 1592527 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1592558 - Posted: 26 Oct 2014, 18:13:29 UTC - in response to Message 1592527.  

Hm... your way seems to be more about BOINC development, not app.
But I agree app should exit if no BOINC client govern it long enough. Hence killing boinc.exe should stop all its tasks after some (not too big) amount of time.

I'll try to reproduce boinc.exe task kill and then rise issue on BOINC dev list if case will be reproducible.
ID: 1592558 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1592565 - Posted: 26 Oct 2014, 18:25:56 UTC - in response to Message 1592558.  

Yes, it's reproducible.
ID: 1592565 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1592567 - Posted: 26 Oct 2014, 18:27:39 UTC - in response to Message 1592558.  

Hm... your way seems to be more about BOINC development, not app.
But I agree app should exit if no BOINC client govern it long enough. Hence killing boinc.exe should stop all its tasks after some (not too big) amount of time.

I'll try to reproduce boinc.exe task kill and then rise issue on BOINC dev list if case will be reproducible.

Yes, it's not the BOINC crash itself that matters so much, actual or simulated, it's the fact that the AP tasks keep running, eventually generating a "finish" file, which then get marked as errors with "Finish file present too long" when BOINC is later restarted.

In my last post in that earlier thread, I also noted my test results on several of my hosts, with different OS/hardware/app combinations. I found that MB tasks running on an ATI GPU also exhibited this same "zombie" behavior.
ID: 1592567 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1592568 - Posted: 26 Oct 2014, 18:29:50 UTC - in response to Message 1592565.  

Yes, it's reproducible.

That's good. Reproducibility is certainly a key element. :^)
ID: 1592568 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1592589 - Posted: 26 Oct 2014, 19:15:06 UTC - in response to Message 1592558.  

Hm... your way seems to be more about BOINC development, not app.
But I agree app should exit if no BOINC client govern it long enough. Hence killing boinc.exe should stop all its tasks after some (not too big) amount of time.

I'll try to reproduce boinc.exe task kill and then rise issue on BOINC dev list if case will be reproducible.

I was just thinking "oh no!". If there is code telling the science app to exit if it can not find the BOINC client. Then would it also exit when running in offline test mode. Such as doing a benchmark/debugging test run, or does that not apply when there is no parent application?
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1592589 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1592628 - Posted: 26 Oct 2014, 21:16:40 UTC - in response to Message 1592565.  

Yes, it's reproducible.

It's been that way for a long time. My Old Dell had that problem in XP with MBs. After a couple days the BOINC Manager would crash leaving the App running. It if ran long enough the task would finish and then be deemed invalid. I never had a task finish valid after the Manager crashed. I stopped using that machine long ago.
ID: 1592628 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1592655 - Posted: 26 Oct 2014, 22:06:08 UTC - in response to Message 1592589.  

Hm... your way seems to be more about BOINC development, not app.
But I agree app should exit if no BOINC client govern it long enough. Hence killing boinc.exe should stop all its tasks after some (not too big) amount of time.

I'll try to reproduce boinc.exe task kill and then rise issue on BOINC dev list if case will be reproducible.

I was just thinking "oh no!". If there is code telling the science app to exit if it can not find the BOINC client. Then would it also exit when running in offline test mode. Such as doing a benchmark/debugging test run, or does that not apply when there is no parent application?


Launched standalone is different from being orphaned/abandoned.
ID: 1592655 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1593488 - Posted: 28 Oct 2014, 12:04:33 UTC - in response to Message 1592589.  

Apps detect at the start if they are "running in offline test mode"

Look in your 'Testdatas' and you will see:
"Can't open init data file - running in standalone mode"
or
"Can't set up shared mem: -1. Will run in standalone mode."
 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1593488 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 1595021 - Posted: 31 Oct 2014, 18:28:52 UTC

David has checked in an api fix:

http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=f0c39bdf5117d8f7dd5092033971d7f700bd22dc

API: fix bug where app doesn't exit if client dies while app in critical section

There were two parts to this:
- In the timer thread, we need to check for client death even if
we're in a critical section.
If both conditions hold, set the no_heartbeat status flag.
- In boinc_end_critical_section(), check no_heartbeat and exit if set.

Also: the various checks in boinc_end_critical_section()
(quit, abort, no heartbeat) should be conditioned on
options.direct_process_action.
Otherwise wrappers that use critical sections won't do the right thing.


Claggy
ID: 1595021 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1595069 - Posted: 31 Oct 2014, 19:42:43 UTC - in response to Message 1595021.  

David has checked in an api fix:

http://boinc.berkeley.edu/gitweb/?p=boinc-v2.git;a=commit;h=f0c39bdf5117d8f7dd5092033971d7f700bd22dc

API: fix bug where app doesn't exit if client dies while app in critical section

There were two parts to this:
- In the timer thread, we need to check for client death even if
we're in a critical section.
If both conditions hold, set the no_heartbeat status flag.
- In boinc_end_critical_section(), check no_heartbeat and exit if set.

Also: the various checks in boinc_end_critical_section()
(quit, abort, no heartbeat) should be conditioned on
options.direct_process_action.
Otherwise wrappers that use critical sections won't do the right thing.


Claggy

So... now all that is left is to recompile all the apps, verify it works or nothing new is broken, & then deploy. You know, the simple part.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1595069 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1595087 - Posted: 31 Oct 2014, 20:14:10 UTC - in response to Message 1595069.  

So... now all that is left is to recompile all the apps, verify it works or nothing new is broken, & then deploy. You know, the simple part.

Yup.
ID: 1595087 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1595186 - Posted: 31 Oct 2014, 22:30:08 UTC - in response to Message 1595087.  

Ah, it was new bugfix... I missed that from David's post on mail list.
Well, then this issue will remains until some versioned revision appears that includes that fix.
ID: 1595186 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1595188 - Posted: 31 Oct 2014, 22:30:48 UTC - in response to Message 1595186.  

Ah, it was new bugfix... I missed that from David's post on mail list.
Well, then this issue will remains until some versioned revision appears that includes that fix.

f0c39bdf5117d8f7dd5092033971d7f700bd22dc
ID: 1595188 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1595196 - Posted: 31 Oct 2014, 22:45:46 UTC - in response to Message 1595188.  

Ah, it was new bugfix... I missed that from David's post on mail list.
Well, then this issue will remains until some versioned revision appears that includes that fix.

f0c39bdf5117d8f7dd5092033971d7f700bd22dc


It's not a useful version number to refer to. We discussed that issue already.
ID: 1595196 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1595198 - Posted: 31 Oct 2014, 22:51:05 UTC - in response to Message 1595196.  

Ah, it was new bugfix... I missed that from David's post on mail list.
Well, then this issue will remains until some versioned revision appears that includes that fix.

f0c39bdf5117d8f7dd5092033971d7f700bd22dc

It's not a useful version number to refer to. We discussed that issue already.

It's a very precise patch number which gives you exactly the code changes you need.

You are looking for client version numbers, which are irrelevant to this code path. What you need are API version numbers, which haven't been created and are likely - on current experience - never to exist. We have to live in that environment.
ID: 1595198 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : "Zombie" AP tasks - still alive in AP v7


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.