|
You need to look in the seti apps stderr.txt file to see why it stopped. Thats in one of the numbered folders in the slots folder, which is in the BOINC folder.
Usually you'll see the message "No heartbeat from core client - exiting". That means the science app (seti) didn't hear from the boinc client for 30 seconds. It does this by way of a heartbeat - a message BOINC sends to each active application that essentially says "I'm alive and well". Idea is, lack of the heartbeat is supposed to indicate that BOINC isn't running anymore, so the science apps should exit. After 30 seconds that is.
If you look thru all your log messages, you'll see that when one WU restarts, its usually 30 seconds after another WU finished with a new WU starting. Thats witb multi processor systems where you run more than one WU at a time.
Looking at the check-in notes and program source shows some cleanup and startup problems were "dealt" with, but not necessarily fixed. I just tested it on a Windows machine and I think theres a problem cleaning up a completed WU and starting a new one. Or a combination of the two.
Specifically, if BOINC has a problem setting up the new WU, it "sleeps" for an interval longer than the heartbeat timeout period. It "sleeps" for 35 seconds, but the heartbeat timeout is only 30 seconds. End result is that the seti app working on the other WU thinks BOINC "disappeared" and exits. After putting a "no heartbeat" message into the stderr.txt file. And BOINC responds to that with the "Exited with zero status but no 'finished' file" message.
One thing you can try, keep track of where each WU runs. That is, which one starts in slots/0 and which one starts in slots/1. Then see if you get the restart errors only in one slot. Like if the WU in slots/1 finishes, the other WU in slots/0 "dies" and has to be restarted. You can check this after the WU gets restarted, just look for the slots/*/stderr.txt file that now has a "no heartbeat" message.
Maybe theres a pattern.
|