Joined: 24 May 99
2004-12-21 22:59:17 [SETI@home] Result 21ap04aa.29482.881.723592.58_1 exited with zero status but no 'finished' file
2004-12-21 22:59:17 [SETI@home] If this happens repeatedly you may need to reset the project.
2004-12-21 22:59:18 [SETI@home] Restarting result 21ap04aa.29482.881.723592.58_1 using setiathome version 4.02
Joined: 19 May 99
This problem is well know for months. On my dual processor Macs almost every fourth WU ends with this message (always the same processor!). Nobody seems to be interested in this topic though, although it does cost lots of credit. One example here (My machine is the one that claimes 16.46): http://setiweb.ssl.berkeley.edu/workunit.php?wuid=9083091.
This WU was at 12,000 seconds when the WU from the other processor finished. I received the message "...exited with zero status but no 'finished' file" and the WU started calculating from 0 seconds again. Percentage and scientific output is not affected by this, just the total time. So, instead of 19,777 seconds my computer calculated with only 7777 seconds and therefore the claimed credit is less than half of what it should be.
According to the rules for credits the second lowest claimed credit will be given to all users who returned a valid result. In this case all users will loose appr. 25 credit points due to this error.
Joined: 16 May 99
You need to look in the seti apps stderr.txt file to see why it stopped. Thats in one of the numbered folders in the slots folder, which is in the BOINC folder.
Usually you'll see the message "No heartbeat from core client - exiting". That means the science app (seti) didn't hear from the boinc client for 30 seconds. It does this by way of a heartbeat - a message BOINC sends to each active application that essentially says "I'm alive and well". Idea is, lack of the heartbeat is supposed to indicate that BOINC isn't running anymore, so the science apps should exit. After 30 seconds that is.
If you look thru all your log messages, you'll see that when one WU restarts, its usually 30 seconds after another WU finished with a new WU starting. Thats witb multi processor systems where you run more than one WU at a time.
Looking at the check-in notes and program source shows some cleanup and startup problems were "dealt" with, but not necessarily fixed. I just tested it on a Windows machine and I think theres a problem cleaning up a completed WU and starting a new one. Or a combination of the two.
Specifically, if BOINC has a problem setting up the new WU, it "sleeps" for an interval longer than the heartbeat timeout period. It "sleeps" for 35 seconds, but the heartbeat timeout is only 30 seconds. End result is that the seti app working on the other WU thinks BOINC "disappeared" and exits. After putting a "no heartbeat" message into the stderr.txt file. And BOINC responds to that with the "Exited with zero status but no 'finished' file" message.
One thing you can try, keep track of where each WU runs. That is, which one starts in slots/0 and which one starts in slots/1. Then see if you get the restart errors only in one slot. Like if the WU in slots/1 finishes, the other WU in slots/0 "dies" and has to be restarted. You can check this after the WU gets restarted, just look for the slots/*/stderr.txt file that now has a "no heartbeat" message.
Maybe theres a pattern.
©2017 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.