"Zombie" AP tasks - still alive when BOINC should have killed them

Message boards : Number crunching : "Zombie" AP tasks - still alive when BOINC should have killed them
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1495261 - Posted: 25 Mar 2014, 22:18:40 UTC

In an earlier thread, 194 (0xc2) EXIT_ABORTED_BY_CLIENT - "finish file present too long", I had reported on some AP GPU tasks that continued to run even after BOINC had crashed. Last Friday afternoon I also found similar "Zombie" AP tasks that continued to run following a deliberate shutdown of the BOINC Manager and client.

Due to a complete hard drive failure on my T7400, host 7057115, the previous Sunday night, by Friday afternoon I had reloaded a replacement HD from the ground up, with OS (Win8.1), drivers, BOINC, and S@H. I was able to get the scheduler to resend all but 7 (VLARs) of the 140+ "lost" tasks for the host. Once they were all downloaded, I went about trying to reconfigure BOINC and S@H back to what I thought I remembered having on the old HD. As part of that process, I shut down and restarted BOINC several times, to pick up changes to the remote_hosts.cfg file, the gui_rpc_auth.cfg file, the firewall ports, etc.

Following what I think was the third BOINC shutdown, I happened to notice that the 3 GPUs were still under a partial load, and the CPUs were showing intermittent activity, as well. So I fired up Process Explorer and discovered 3 AP GPU tasks chugging along without benefit of BOINC, one on each GPU:



I then restarted BOINC, thinking that it would pick up the "undead" tasks, but Process Explorer still showed them running "outside the box". Notice, also, that there are 9 MB GPU tasks running, 3 per GPU. This would be normal if there weren't any AP GPU tasks running, as I try to run 3 MB tasks or 2 MB + 1 AP per GPU. Now it appeared that each GPU had 3 MB tasks running under BOINC's control along with 1 AP task with no supervision at all:



This time I tried exiting BOINC Manager without stopping the tasks running under the BOINC client and, as expected at this point, found that only boincmgr.exe had disappeared from Process Explorer:



When I restarted the BOINC Manager this time, I noticed something interesting in the task list:



Initially, there were 3 AP tasks and 6 MB tasks shown as "Running", with 3 additional MB tasks "Waiting to run". As I was puzzling over this, all 3 AP tasks changed to "Waiting to run" and the 3 MB tasks changed to "Running", as shown above. However, judging by Process Explorer, all 12 tasks appeared to be consuming resources. Checking the stderr.txt files for the AP tasks, I found 7 or 8 iterations of:

Running on device number: 0
DATA_CHUNK_UNROLL at default:2
DATA_CHUNK_UNROLL at default:2
16:32:27 (1464): Can't acquire lockfile (32) - waiting 35s
16:33:02 (1464): Can't acquire lockfile (32) - exiting
16:33:02 (1464): Error: The process cannot access the file because it is being used by another process. (0x20)


It appeared that BOINC might have been trying to run another instance of each AP task under its control, but couldn't because the tasks running outside its control had a grip on the lockfile.

At this point, I figured the safest thing to do, before the AP tasks ended with possible "finish file" errors, was to just reboot the machine. This finally succeeded in bringing the AP tasks back under BOINC's control:



All 3 AP tasks eventually completed normally, were quickly validated and have already vanished from my task detail pages (although I can still provide complete Stderr output if anybody wants it).

Later on Friday evening, I decided to see if I could recreate the "Zombie" scenario with some additional AP tasks that were by then running on the GPUs. I tried various sequences of shutting down BOINC Manager, sometimes specifying that the running tasks should stop also, and sometimes letting the client continue to run. In no instance were any AP tasks left alive when I told BOINC to stop the running tasks, so that left me rather stumped.

Sunday morning it occurred to me that there was one scenario that I had experienced during my first round of BOINC-on BOINC-off configuration file changes, prior to discovery of the free-range AP tasks. When I first changed the password in the gui_rpc_auth.cfg file, I had done it with only BOINC Manager shut down, not the BOINC client. When I then tried to restart the BOINC Manager, I got a "Connection Error" message stating, "Authorization failed connecting to running client. Make sure you start this program in the same directory as the client." Well, the directories really were the same, but apparently now the password wasn't. So then I shut the BOINC Manager down once again, this time telling it to stop the running tasks, as well. As far as I knew, it had done that successfully, despite the fact that it said it couldn't connect to the client, because the next restart of BOINC Manager did not experience any problems, connection or otherwise.

It's possible, then, that this was the scenario that left the AP tasks running when BOINC thought that it had killed them. Unfortunately, by the time I came up with this theory on Sunday, AP GPU tasks were a distant memory and, unless I happen to snag an AP resend, it looks like it might be almost a week before I can actually test it out. I did recreate the mismatched password scenario on Sunday evening, but the only tasks running at the time were MB GPU, MB CPU and AP CPU. All shut down nicely when told politely to do so.

Anyway, for now this is just a theory of how these Zombie AP tasks might occur, but, unless somebody else still has an AP task running on an NVIDIA GPU and wants to take a crack at proving/disproving the theory, I'll just have to wait for the next round of APs.
ID: 1495261 · Report as offensive
ralph

Send message
Joined: 19 Feb 12
Posts: 19
Credit: 31,993,767
RAC: 9
United States
Message 1495639 - Posted: 26 Mar 2014, 17:26:59 UTC - in response to Message 1495261.  
Last modified: 26 Mar 2014, 17:32:11 UTC

I read this post yesterday with some interest as I have had some problems this past week when shutting done Boinc before a needed reboot. Boinc manager was not reloading properly -- Boinc manager was loading without projects. I could eventually get Boinc loaded properly but, only with some extra effort.

Whether I am having a similar problem today, I'm not sure, but this morning I did a shut down of Boinc and then a reboot of the system.

During the reboot I got the following message (this was repeatable on additional reboots)


* setting up boinc core client: boinc

* setting up scheduling for boinc core client and children:

* speech-dispatcher disabled; edit /ect/default/speech-dispatcher saned disabled; /ect/default/saned restoring resolved state


after that message I can hear the boot continuing but without a graphic display on the monitor.

I have never seen anything like this before -- and I am wonder if this is a result of locked files persisting during/following the Boinc shut down and reboot.

I cannot get the computer past this point but can boot to other OS on this disk and/or with other bootable disks ~~~ all with proper graphics ~~~~~ I'm on my laptop at this point contemplating my next move. Computers really are a love/hate relationship!

Ideas anybody?
ID: 1495639 · Report as offensive
Sirius B Project Donor
Volunteer tester
Avatar

Send message
Joined: 26 Dec 00
Posts: 24879
Credit: 3,081,182
RAC: 7
Ireland
Message 1495648 - Posted: 26 Mar 2014, 17:59:00 UTC - in response to Message 1495639.  

Remove hard drive and attach to another system as secondary drive. Go to boinc folder and copy over all wu's, then delete wu's from hard drive.

Reattach & bootup, if same problem then it could be boinc if not then it was wu's.

To confirm that, copy back the wu's. If it occurs again, just redo & delete wu's
ID: 1495648 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1496418 - Posted: 28 Mar 2014, 2:07:27 UTC - in response to Message 1495261.  

Sunday morning it occurred to me that there was one scenario that I had experienced during my first round of BOINC-on BOINC-off configuration file changes, prior to discovery of the free-range AP tasks. When I first changed the password in the gui_rpc_auth.cfg file, I had done it with only BOINC Manager shut down, not the BOINC client. When I then tried to restart the BOINC Manager, I got a "Connection Error" message stating, "Authorization failed connecting to running client. Make sure you start this program in the same directory as the client." Well, the directories really were the same, but apparently now the password wasn't. So then I shut the BOINC Manager down once again, this time telling it to stop the running tasks, as well. As far as I knew, it had done that successfully, despite the fact that it said it couldn't connect to the client, because the next restart of BOINC Manager did not experience any problems, connection or otherwise.

Well, I didn't have to wait as long as I feared to get an AP GPU task to test my theory with. And my initial test worked just as I thought it might. With BOINC Manager running but unable to connect to the client, which was controlling 7 MB CPU tasks, 8 MB GPU tasks and 1 AP GPU task, I exited BOINC Manager selecting the option to stop the running tasks. Judging by Process Explorer, it took at least 10 seconds for all the tasks to terminate, except for the one AP task, which continued to run all by itself:



When I restarted BOINC Manager, all the previously running tasks restarted, including a "phantom" AP task, while the original AP task continued to run standalone.



Note that the AP task shown as running under the BOINC Manager and client is not using any CPU resources, while the Zombie task is using 6.85%. The phantom task was only shown for about 35 seconds, until BOINC gave up trying to gain control of the lockfile. It vanished from Process Explorer and its status in BOINC Manager changed to "Waiting to run", at which point another MB task kicked off. Then, each time one of the MB tasks finished, BOINC tried again to run the AP task but kept running up against the lockfile issue. I watched that cycle repeat 3 times before shutting everything down and rebooting.

This would seem to indicate that AP GPU tasks may not recognize when the BOINC client shuts down, if the client wasn't running under the control of the BOINC Manager. Since the earlier reported occurrences of this phenomenon resulted from BOINC crashes, I would guess that there's something in the way the dominoes fall in one of those crashes that results in a similar scenario.

In my test, obviously, I used the BOINC Manager to shut down the client, even though they weren't connected (a bit disturbing in itself that this is possible, I think), but I would guess that manually shutting down the BOINC client even without BOINC Manager might have the same effect. However, I don't really know what the "cleanest" way would be to do that. (Task Manager or Process Explorer could kill it, but I'm afraid there might be collateral damage if I tried that.) I should probably leave that to one of the more experienced testers to try.

By the way, the AP task in my test, 3460974473 is now completed and awaiting validation, if anyone wants to check out the Stderr.
ID: 1496418 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1497217 - Posted: 29 Mar 2014, 19:07:21 UTC

Despite the lack of interest in this problem, I figured I still should go ahead and test it out in my other configurations while the APs were available. I wanted to find out if the OS and GPU type had any bearing on the results.

My original tests were on NVIDIA GPUs running stock apps under Win 8.1. Here are the results from testing on three of my other machines:

6980751 (Win XP, NVIDIA GPUs, Lunatics apps): AP GPU task continues to run w/o BOINC, MB GPU and MB CPU tasks shut down properly
6979886 (Win Vista, NVIDIA GPU, Lunatics apps): AP GPU task continues to run w/o BOINC, MB GPU and MB CPU tasks shut down properly
6912878 (Win 7, ATI GPU, stock apps): AP GPU and MB GPU task continue to run w/o BOINC, MB CPU tasks shut down properly

So it appears that AP GPU tasks fail to shut down under every configuration that I can test and stock MB GPU tasks fail to shut down when running on an ATI GPU (at least under Win 7).
ID: 1497217 · Report as offensive

Message boards : Number crunching : "Zombie" AP tasks - still alive when BOINC should have killed them


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.