"Zombie" AP tasks - still alive in AP v7

Message boards : Number crunching : "Zombie" AP tasks - still alive in AP v7
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1588746 - Posted: 18 Oct 2014, 16:56:09 UTC

As it's been over 6 months since I last saw this problem pop up, my original "Zombie" AP tasks - still alive when BOINC should have killed them is locked. But since I've now had this happen with AP v7, a new thread is probably called for anyway.

The circumstances are pretty much the same. BOINC crashed on my T7400 at 2:51 AM local time, immediately after starting to process AP v7 task 3789149958. That task continued to run, without benefit of BOINC, until 3:23 AM, when the results were posted and the "finish" file created. Another AP v7 task, 3789137125, was running on a different GPU at the time of the BOINC crash and continued running until 2:56 AM. All MB tasks shut down cleanly when contact with BOINC was lost.

As it had been more than 6 months since I experienced one of these, and since my brain was still in first gear at ~8:00 AM when I discovered the chilly room with the low noise level, I forgot to delete the "finish" files for the 2 AP tasks before I restarted BOINC. Naturally, the tasks errored out with the "finish file present too long" message.

In any event, it appears that whatever changes and enhancements were made to AP for v7, none of them managed to fix the problem of AP tasks not noticing when the BOINC client goes AWOL.
ID: 1588746 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1588757 - Posted: 18 Oct 2014, 17:37:12 UTC - in response to Message 1588746.  

As it's been over 6 months since I last saw this problem pop up, my original "Zombie" AP tasks - still alive when BOINC should have killed them is locked. But since I've now had this happen with AP v7, a new thread is probably called for anyway.

The circumstances are pretty much the same. BOINC crashed on my T7400 at 2:51 AM local time, immediately after starting to process AP v7 task 3789149958. That task continued to run, without benefit of BOINC, until 3:23 AM, when the results were posted and the "finish" file created. Another AP v7 task, 3789137125, was running on a different GPU at the time of the BOINC crash and continued running until 2:56 AM. All MB tasks shut down cleanly when contact with BOINC was lost.

As it had been more than 6 months since I experienced one of these, and since my brain was still in first gear at ~8:00 AM when I discovered the chilly room with the low noise level, I forgot to delete the "finish" files for the 2 AP tasks before I restarted BOINC. Naturally, the tasks errored out with the "finish file present too long" message.

In any event, it appears that whatever changes and enhancements were made to AP for v7, none of them managed to fix the problem of AP tasks not noticing when the BOINC client goes AWOL.

Given this is a BOINC or OS issue with child processes not getting correctly killed. I don't see why changing a science app would solve this.

I most commonly see this happen on a machine where BOINC has been running for a few weeks. I will close BOINC, or restart the system, & the science apps will not have been killed. I have a batch file to run taskkill /IM appname.exe /T /F for when this happens.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1588757 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1588772 - Posted: 18 Oct 2014, 18:34:50 UTC - in response to Message 1588757.  

Given this is a BOINC or OS issue with child processes not getting correctly killed. I don't see why changing a science app would solve this.

Comments in my original 194 (0xc2) EXIT_ABORTED_BY_CLIENT - "finish file present too long" from Jason and Richard raised the issue of a "heartbeat" mechanism that the science apps use to detect whether BOINC is still running. I also found that other tasks that did successfully shut down following a BOINC crash generated a "No heartbeat from core client for 30 sec - exiting" message in the Stderr. These included MB GPU, MB CPU, and AP CPU tasks, leaving only the AP GPU tasks oblivious to the BOINC crash. That's why I think it tends to point to the science app.

I most commonly see this happen on a machine where BOINC has been running for a few weeks. I will close BOINC, or restart the system, & the science apps will not have been killed. I have a batch file to run taskkill /IM appname.exe /T /F for when this happens.

It this instance, BOINC had only been running since 6:00 PM the previous evening. I don't think I ever saw any correlation with BOINC run time on the previous occasions. It always seemed to be a semi-random event, with the only constant being that it always happened (at least according to the Event Log) immediately after a new AP GPU task was started.

BilBg had suggested in the earlier thread that running a batch file at startup could be used to delete any unwanted "finish" files, but that's only useful if a system restart is necessary. In these cases, though, it's only BOINC that crashes and it's necessary to actually remember to delete them, rather than rely on an automatic process.
ID: 1588772 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1588844 - Posted: 19 Oct 2014, 16:46:54 UTC

I can easily reproduce the BOINC crashing issue by simply killing BOINC.
http://www.hal6000.com/seti/images/stuck_sci_apps.png

It has been about 12min since I killed BOINC & the apps are still humming right along. I did this several times & for this last instance I had told BOINC to suspend GPU processing prior to killing BOINC to see if it had any effect. It did not. The CPU & GPU apps look like they would run to completion in this scenario.
Perhaps when the processes are orphaned they are getting a heartbeat source from another application or the mechanism stops when they are orphaned. That would be an issue for the devs to wrestle. I just get paid to break things, not fix them.

The apps not stopping when BOINC is told to shutdown I think would be a separate, but possibly related, issue.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1588844 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1588848 - Posted: 19 Oct 2014, 17:00:43 UTC - in response to Message 1588844.  

Do you have the "Manager exit dialog" enabled in the BOINC Manager options? If not, and if the "Stop running tasks when exiting the BOINC Manager" box wasn't checked the last time the Exit Confirmation dialog was displayed, the tasks will keep running because the BOINC client is still running. Then, when you restart the BOINC Manager, it will pick up the client and the running tasks. In the case of the zombie AP tasks, they keep running standalone, without benefit of the BOINC client, and if the BOINC Manager is restarted while they're still running, it won't actually see them.
ID: 1588848 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1588854 - Posted: 19 Oct 2014, 17:10:42 UTC - in response to Message 1588848.  

Do you have the "Manager exit dialog" enabled in the BOINC Manager options? If not, and if the "Stop running tasks when exiting the BOINC Manager" box wasn't checked the last time the Exit Confirmation dialog was displayed, the tasks will keep running because the BOINC client is still running. Then, when you restart the BOINC Manager, it will pick up the client and the running tasks. In the case of the zombie AP tasks, they keep running standalone, without benefit of the BOINC client, and if the BOINC Manager is restarted while they're still running, it won't actually see them.

I normally run BOINC without even using BOINC Manager. Instead I choose to start & stop BOINC from a command line.
When you exit BOINC Manager & leave the BOINC client running. You will see boinc.exe still running. In my image you can clearly see that boinc.exe nor boincmgr.exe is running. Also the command line I used to kill boinc.exe using taskkill.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1588854 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1588863 - Posted: 19 Oct 2014, 17:23:17 UTC - in response to Message 1588854.  

I normally run BOINC without even using BOINC Manager. Instead I choose to start & stop BOINC from a command line.
When you exit BOINC Manager & leave the BOINC client running. You will see boinc.exe still running. In my image you can clearly see that boinc.exe nor boincmgr.exe is running. Also the command line I used to kill boinc.exe using taskkill.

Well, that was weird. I was in the middle of editting my earlier post and I apparently got logged off. In any event, what I was planning to add was that I hadn't scrolled down in your image and therefore hadn't noticed that you used the "taskkill" command, rather than a normal exit from BOINC Manager. I also missed that boinc.exe wasn't still running.

In your scenario, when you restart the BOINC client, does it successfully pick up the orphaned tasks? In my tests (see "Zombie" AP tasks - still alive when BOINC should have killed them), even if I restarted BOINC while the zombie tasks were still running, BOINC wouldn't pick them up. Rebooting the machine was necessary to bring them back under BOINC's control.
ID: 1588863 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1588868 - Posted: 19 Oct 2014, 17:31:26 UTC

Although I've never studied the technical details in the code, it's clear that there are a number of ways in which science applications (setiathome, astropulse) keep in touch with the BOINC core client. Note that the apps never talk directly to BOINC Manager - the core client does both, using two different technologies.

The core client can certainly give direct instructions to applications - in particular, to shut themselves down. [Maybe BOINC itself is shutting down, and the apps need to follow suit - or it's time for another project to have a turn.]

Although I've asked a couple of times, no developer has ever answered my question about how the instruction to shut down is passed. One can imagine the client "sending a message" - which, like a human telephone call, might get missed. Or one can imagine it "raising a flag" - something which remains visible until the application returns from whatever it's been doing, and can pay attention to it. Clearly, there's more scope for the first technique to go wrong if the application doesn't pay enough attention.

The heartbeat mechanism isn't about passing messages, but simply confirming that everything is working normally. There are actually 'old' and 'new' versions of this: the term 'heartbeat' strictly applies only to the old one, in use up to about BOINC v7.0.36: after that, the applications are supposed to monitor the PID (process identifier) of the client. But either should do, and heartbeats should still be used unless both the client and the application understand the new PID mechanism.

But whatever the inter-process communication in use, it's clear that the application - Astropulse, in this case - should take responsibility for shutting itself down if the BOINC client 'goes away' - crashes, or is deliberately killed in the course of performing this sort of test. We have been through a number of changes in this area with Astropulse over the last couple of months, but maybe there's still a need to double-check that the basic procedure works in all cases.
ID: 1588868 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1588870 - Posted: 19 Oct 2014, 17:35:02 UTC - in response to Message 1588863.  

I normally run BOINC without even using BOINC Manager. Instead I choose to start & stop BOINC from a command line.
When you exit BOINC Manager & leave the BOINC client running. You will see boinc.exe still running. In my image you can clearly see that boinc.exe nor boincmgr.exe is running. Also the command line I used to kill boinc.exe using taskkill.

Well, that was weird. I was in the middle of editting my earlier post and I apparently got logged off. In any event, what I was planning to add was that I hadn't scrolled down in your image and therefore hadn't noticed that you used the "taskkill" command, rather than a normal exit from BOINC Manager. I also missed that boinc.exe wasn't still running.

In your scenario, when you restart the BOINC client, does it successfully pick up the orphaned tasks? In my tests (see "Zombie" AP tasks - still alive when BOINC should have killed them), even if I restarted BOINC while the zombie tasks were still running, BOINC wouldn't pick them up. Rebooting the machine was necessary to bring them back under BOINC's control.

The science apps are child processes of BOINC. If you look at the properties for the science apps in Process Explorer you can see the instance of BOINC they are associated with on the Image tab. Look for Parent: boinc.exe(nnnn) Where nnnn is the PID(Process ID).
When that Parent app is gone it is gone. The apps shouldn't look for or attach to an instance of BOINC they find. That would be very problematic on machines that run multiple instances of BOINC.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1588870 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1588873 - Posted: 19 Oct 2014, 17:44:46 UTC

There's certainly enough evidence to convince Dr. Anderson to add code to BOINC initialization to avoid that "Finish file present too long" error at that time.

For this case, with BOINC 7.2.42 running on Windows 8.1, the mechanism which should have caused the r2721 NV OpenCL build to recognize that BOINC had died is that BOINC's Process ID should have no longer been valid. The timer thread from the BOINC API code built into the science app checks that PID once per second and the app should shut down if it is missing for 10 seconds.

Jeff, can you supply any more info about the "BOINC crashed on my T7400 at 2:51 AM local time" indications? I have no idea how Win 8.1 might have handled that, but on older Windows I consider it quite possible that a messagebox saying "This application is no longer responding and will be shut down." may very well not free the application PID until after the user reacts. If so, the BOINC developers may need to rethink that method of detecting when the BOINC client has died.

For CPU builds, when the timer thread recognizes that BOINC has died the science application is directly shut down. For GPU builds that would mean no cleanup of the GPU kernels, etc., so the processing loop is run in a "critical section" mode, and flags the timer thread should set are periodically checked to see if the app should clean up and exit. It's possible something in that logic has slipped as a side effect of the changes leading up to r2721.
                                                                  Joe
ID: 1588873 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1588881 - Posted: 19 Oct 2014, 17:58:51 UTC - in response to Message 1588870.  

The science apps are child processes of BOINC. If you look at the properties for the science apps in Process Explorer you can see the instance of BOINC they are associated with on the Image tab. Look for Parent: boinc.exe(nnnn) Where nnnn is the PID(Process ID).
When that Parent app is gone it is gone. The apps shouldn't look for or attach to an instance of BOINC they find. That would be very problematic on machines that run multiple instances of BOINC.

Yes, under normal circumstances, I expect to see all the running tasks shown as children of boinc.exe and, since by default I run BOINC Manager from bootup to shutdown, boinc.exe as a child of boincmgr.exe. However, with the zombie tasks, that association goes away and never comes back. I agree that it could be problematic if it did, which is why it makes sense to me that the science apps should shut down if their parent shuts down, and in my tests, almost all of them do.

I notice in the Process Explorer image you posted that the AP apps appear to be children of Process Explorer itself, so I'm curious as to what happens when you restart the BOINC client? If it doesn't reconnect to the running AP tasks, what happens to those tasks when they finish? Does BOINC successfully recognize the "finish" file when it's created?
ID: 1588881 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1588885 - Posted: 19 Oct 2014, 18:10:25 UTC - in response to Message 1588873.  

Jeff, can you supply any more info about the "BOINC crashed on my T7400 at 2:51 AM local time" indications? I have no idea how Win 8.1 might have handled that, but on older Windows I consider it quite possible that a messagebox saying "This application is no longer responding and will be shut down." may very well not free the application PID until after the user reacts. If so, the BOINC developers may need to rethink that method of detecting when the BOINC client has died.

I'll try taking a look at the Windows logs on that machine, probably a bit later today. As I recall, when I dug into them once before for the same sort of BOINC crash, there was little to find other than a few messages indicating Windows Explorer had also had a failure of some sort and had been restarted at about the same time. It wasn't really possible to tell which came first, the BOINC crash or the Explorer problem, but they seemed to be related. In any event, while Explorer restarts cleanly, BOINC doesn't. When it happens in the middle of the night, I don't discover it until the next morning when I realize that the room where that machine is located is much quieter and much cooler than it should be! :^) No message box remaining on the screen, if there even was one to begin with.
ID: 1588885 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1588888 - Posted: 19 Oct 2014, 18:15:04 UTC - in response to Message 1588881.  

The science apps are child processes of BOINC. If you look at the properties for the science apps in Process Explorer you can see the instance of BOINC they are associated with on the Image tab. Look for Parent: boinc.exe(nnnn) Where nnnn is the PID(Process ID).
When that Parent app is gone it is gone. The apps shouldn't look for or attach to an instance of BOINC they find. That would be very problematic on machines that run multiple instances of BOINC.

Yes, under normal circumstances, I expect to see all the running tasks shown as children of boinc.exe and, since by default I run BOINC Manager from bootup to shutdown, boinc.exe as a child of boincmgr.exe. However, with the zombie tasks, that association goes away and never comes back. I agree that it could be problematic if it did, which is why it makes sense to me that the science apps should shut down if their parent shuts down, and in my tests, almost all of them do.

I notice in the Process Explorer image you posted that the AP apps appear to be children of Process Explorer itself, so I'm curious as to what happens when you restart the BOINC client? If it doesn't reconnect to the running AP tasks, what happens to those tasks when they finish? Does BOINC successfully recognize the "finish" file when it's created?

The image is a bit misleading. In that image the science apps are not actually associated with processes explorer. If they were they could be offset in the tree. AP7_r2692_x64_AVX_CPU is being displayed as at the same level as the other root processes.
An issue with Processes Explorer is that sometimes it shows the - instead of the + for the tree. Under proexp.exe there is procexp64.exe.

Once the task is finished being processes I would expect the science app to close. However, I didn't want to wait a few hours to verify that. I was more interested in creating a reproducible test scenario.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1588888 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1588903 - Posted: 19 Oct 2014, 18:42:25 UTC - in response to Message 1588868.  

The heartbeat mechanism isn't about passing messages, but simply confirming that everything is working normally. There are actually 'old' and 'new' versions of this: the term 'heartbeat' strictly applies only to the old one, in use up to about BOINC v7.0.36: after that, the applications are supposed to monitor the PID (process identifier) of the client. But either should do, and heartbeats should still be used unless both the client and the application understand the new PID mechanism.

But whatever the inter-process communication in use, it's clear that the application - Astropulse, in this case - should take responsibility for shutting itself down if the BOINC client 'goes away' - crashes, or is deliberately killed in the course of performing this sort of test. We have been through a number of changes in this area with Astropulse over the last couple of months, but maybe there's still a need to double-check that the basic procedure works in all cases.

Thanks for that explanation, Richard. When I was originally running my tests back in March, it appeared that the tasks that did successfully shut down may have been using the "old" heartbeat mechanism, as a message stating "No heartbeat from core client for 30 sec - exiting" would appear in the Stderr.

Also, in my last post in my original "Zombie" thread, I summarized my tests on various systems, as follows:
My original tests were on NVIDIA GPUs running stock apps under Win 8.1. Here are the results from testing on three of my other machines:

6980751 (Win XP, NVIDIA GPUs, Lunatics apps): AP GPU task continues to run w/o BOINC, MB GPU and MB CPU tasks shut down properly
6979886 (Win Vista, NVIDIA GPU, Lunatics apps): AP GPU task continues to run w/o BOINC, MB GPU and MB CPU tasks shut down properly
6912878 (Win 7, ATI GPU, stock apps): AP GPU and MB GPU task continue to run w/o BOINC, MB CPU tasks shut down properly

So it appears that AP GPU tasks fail to shut down under every configuration that I can test and stock MB GPU tasks fail to shut down when running on an ATI GPU (at least under Win 7).

As you can see, I also ran into the problem with stock MB GPU tasks running on an ATI GPU under Win 7. I don't know if that app has changed since then, but inasmuch as I stopped running that box several weeks ago, I can't retest the scenario.
ID: 1588903 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1589012 - Posted: 19 Oct 2014, 22:58:39 UTC - in response to Message 1588885.  

Jeff, can you supply any more info about the "BOINC crashed on my T7400 at 2:51 AM local time" indications? I have no idea how Win 8.1 might have handled that, but on older Windows I consider it quite possible that a messagebox saying "This application is no longer responding and will be shut down." may very well not free the application PID until after the user reacts. If so, the BOINC developers may need to rethink that method of detecting when the BOINC client has died.

I'll try taking a look at the Windows logs on that machine, probably a bit later today.

Okay, finally got around to looking at those Windows logs and, as I suspected, the only relevant entry I could find is this one:

Log Name:      Application
Source:        Microsoft-Windows-Winlogon
Date:          10/18/2014 2:51:45 AM
Event ID:      1002
Task Category: None
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      T7400
Description:
The shell stopped unexpectedly and explorer.exe was restarted.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-Winlogon" Guid="{DBE9B383-7CF3-4331-91CC-A3CB16A3B538}" EventSourceName="Winlogon" />
    <EventID Qualifiers="16384">1002</EventID>
    <Version>0</Version>
    <Level>4</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2014-10-18T09:51:45.000000000Z" />
    <EventRecordID>8943</EventRecordID>
    <Correlation />
    <Execution ProcessID="0" ThreadID="0" />
    <Channel>Application</Channel>
    <Computer>T7400</Computer>
    <Security />
  </System>
  <EventData>
    <Data>explorer.exe</Data>
  </EventData>
</Event>

It doesn't indicate why Explorer stopped "unexpectedly", just that it was restarted. That event occurs approximately 2 seconds after the last entries in the BOINC log prior to the apparent BOINC crash:
18-Oct-2014 02:51:43 [SETI@home] Computation for task 24se14aa.26295.17654.438086664195.12.91_0 finished
18-Oct-2014 02:51:43 [SETI@home] [cpu_sched] Preempting 04se14aa.7333.4157.438086664199.12.169_0 (removed from memory)
18-Oct-2014 02:51:43 [SETI@home] Starting task ap_21no10ab_B0_P0_00325_20141016_19978.wu_2
18-Oct-2014 02:51:43 [SETI@home] [cpu_sched] Starting task ap_21no10ab_B0_P0_00325_20141016_19978.wu_2 using astropulse_v7 version 705 (opencl_nvidia_100) in slot 4
18-Oct-2014 08:14:06 [---] Starting BOINC client version 7.2.42 for windows_x86_64

As you can see, the last entries shown are for the start of an AP task. This has been the case for every one of these BOINC crashes that I've encountered.
ID: 1589012 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1589029 - Posted: 19 Oct 2014, 23:57:22 UTC - in response to Message 1588903.  

The heartbeat mechanism isn't about passing messages, but simply confirming that everything is working normally. There are actually 'old' and 'new' versions of this: the term 'heartbeat' strictly applies only to the old one, in use up to about BOINC v7.0.36: after that, the applications are supposed to monitor the PID (process identifier) of the client. But either should do, and heartbeats should still be used unless both the client and the application understand the new PID mechanism.

But whatever the inter-process communication in use, it's clear that the application - Astropulse, in this case - should take responsibility for shutting itself down if the BOINC client 'goes away' - crashes, or is deliberately killed in the course of performing this sort of test. We have been through a number of changes in this area with Astropulse over the last couple of months, but maybe there's still a need to double-check that the basic procedure works in all cases.

Thanks for that explanation, Richard. When I was originally running my tests back in March, it appeared that the tasks that did successfully shut down may have been using the "old" heartbeat mechanism, as a message stating "No heartbeat from core client for 30 sec - exiting" would appear in the Stderr.

Just a quick follow-up on "old" vs. "new" heartbeat mechanism. Back in March, I was still running BOINC 7.2.33 when the "No heartbeat..." message was produced. Currently, on the machine where BOINC crashed yesterday, I'm running BOINC 7.2.42 but got the same message for the MB GPU tasks that were running at the time of the crash, since x41zc app hasn't changed. However, I notice that the stock MB CPU tasks such as this one included "02:51:48 (4180): BOINC client no longer exists - exiting" in the Stderr. That timestamp appears to be about 5 seconds after the time that I think BOINC crashed.
ID: 1589029 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1589039 - Posted: 20 Oct 2014, 0:50:27 UTC - in response to Message 1589012.  

Okay, finally got around to looking at those Windows logs and, as I suspected, the only relevant entry I could find is this one:

It doesn't indicate why Explorer stopped "unexpectedly", just that it was restarted. That event occurs approximately 2 seconds after the last entries in the BOINC log prior to the apparent BOINC crash:
18-Oct-2014 02:51:43 [SETI@home] Computation for task 24se14aa.26295.17654.438086664195.12.91_0 finished
18-Oct-2014 02:51:43 [SETI@home] [cpu_sched] Preempting 04se14aa.7333.4157.438086664199.12.169_0 (removed from memory)
18-Oct-2014 02:51:43 [SETI@home] Starting task ap_21no10ab_B0_P0_00325_20141016_19978.wu_2
18-Oct-2014 02:51:43 [SETI@home] [cpu_sched] Starting task ap_21no10ab_B0_P0_00325_20141016_19978.wu_2 using astropulse_v7 version 705 (opencl_nvidia_100) in slot 4
18-Oct-2014 08:14:06 [---] Starting BOINC client version 7.2.42 for windows_x86_64

As you can see, the last entries shown are for the start of an AP task. This has been the case for every one of these BOINC crashes that I've encountered.


There is an issue with science apps not stopping when they should. However, BOINC & most defiantly Windows Explorer should not crash when tasks start.
If I were having crashes that coincided with tasks starting I would look for some bit of hardware that is having issues. If this last happened 6 months ago when it was changing from cold to warm and now it is occurring when it is going from warm to cold I am thinking something thermal expanding/contracting.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1589039 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1589056 - Posted: 20 Oct 2014, 1:27:20 UTC - in response to Message 1589039.  

There is an issue with science apps not stopping when they should. However, BOINC & most defiantly Windows Explorer should not crash when tasks start.
If I were having crashes that coincided with tasks starting I would look for some bit of hardware that is having issues. If this last happened 6 months ago when it was changing from cold to warm and now it is occurring when it is going from warm to cold I am thinking something thermal expanding/contracting.

I think the key words there are "should not", rather than "could not". On my daily driver running Vista, Windows Explorer is actually the most crash-prone application I have. Often, simply right-clicking on a folder in the tree will cause a lockup and crash. Other times, I have no idea what triggers it. In any event, in this case, it appears to be BOINC that crashes first, taking its parent process, explorer.exe down with it. It would be helpful if the Windows logs provided more info, but they don't (or perhaps I just don't know where to find it).

As to weather, if only it were that easy to correlate the occurrences with something external. However, it's happened once each in December, February, March, and now October on my #1 cruncher. It also happened once on my #2 cruncher in March. And here on the central coast of California, ambient temperatures are more likely to fluctuate dramatically from day to day or from hour to hour, depending on the influence of the marine layer, than from season to season. I certainly can't rule anything out, but the fact that the one consistent element is the start of an AP GPU task, I have to think that that's one piece of the puzzle. (Although I run both MB and AP tasks, I still run hundreds of AP tasks through those two boxes over time, and with the crashes being fairly rare, I'm sure there must be another element required, but at this point I haven't any other clues.)
ID: 1589056 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1589071 - Posted: 20 Oct 2014, 2:07:47 UTC - in response to Message 1589056.  

There is an issue with science apps not stopping when they should. However, BOINC & most defiantly Windows Explorer should not crash when tasks start.
If I were having crashes that coincided with tasks starting I would look for some bit of hardware that is having issues. If this last happened 6 months ago when it was changing from cold to warm and now it is occurring when it is going from warm to cold I am thinking something thermal expanding/contracting.

I think the key words there are "should not", rather than "could not". On my daily driver running Vista, Windows Explorer is actually the most crash-prone application I have. Often, simply right-clicking on a folder in the tree will cause a lockup and crash. Other times, I have no idea what triggers it. In any event, in this case, it appears to be BOINC that crashes first, taking its parent process, explorer.exe down with it. It would be helpful if the Windows logs provided more info, but they don't (or perhaps I just don't know where to find it).

As to weather, if only it were that easy to correlate the occurrences with something external. However, it's happened once each in December, February, March, and now October on my #1 cruncher. It also happened once on my #2 cruncher in March. And here on the central coast of California, ambient temperatures are more likely to fluctuate dramatically from day to day or from hour to hour, depending on the influence of the marine layer, than from season to season. I certainly can't rule anything out, but the fact that the one consistent element is the start of an AP GPU task, I have to think that that's one piece of the puzzle. (Although I run both MB and AP tasks, I still run hundreds of AP tasks through those two boxes over time, and with the crashes being fairly rare, I'm sure there must be another element required, but at this point I haven't any other clues.)

With the low occurrence of 5 times between 2 machines in a year it does make it hard to troubleshoot. If you wanted to install some debug logging software. Then you could have very detailed crash logs.

Having logs that only state "whatever.exe has stopped unexpectedly" can drive you bonkers. My old i7-860 would sometimes shutdown or restart without warming. The only log was "windows shutdown unexpectedly at time stamp". I dealt with that for years before finding out, just about a week ago, it was a faulty chassis power switch. Which would sometimes close when not pushed. I only found out that's what it was when it failed completely.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1589071 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1589073 - Posted: 20 Oct 2014, 2:28:27 UTC - in response to Message 1589071.  

With the low occurrence of 5 times between 2 machines in a year it does make it hard to troubleshoot. If you wanted to install some debug logging software. Then you could have very detailed crash logs.

Having logs that only state "whatever.exe has stopped unexpectedly" can drive you bonkers. My old i7-860 would sometimes shutdown or restart without warming. The only log was "windows shutdown unexpectedly at time stamp". I dealt with that for years before finding out, just about a week ago, it was a faulty chassis power switch. Which would sometimes close when not pushed. I only found out that's what it was when it failed completely.

Yep. Windows 8.1 seems to do a lot more logging than previous versions, but they still don't seem to tell me diddly squat! In any event, I don't think the low frequency of occurrence would justify the extra overhead that I would expect a debugging package to introduce. Actually, there's probably some log switch in BOINC that might generate an additional line or two subsequent to the Starting Task message but, again, probably not worth the extra overhead without more frequent crashes. Anyway, it's the "Finish file present too long" and zombie tasks issues that I'm more interested in resolving at this point, since I think there's enough solid evidence to act on there, rather than the BOINC crashes themselves.
ID: 1589073 · Report as offensive
1 · 2 · 3 · Next

Message boards : Number crunching : "Zombie" AP tasks - still alive in AP v7


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.