194 (0xc2) EXIT_ABORTED_BY_CLIENT - "finish file present too long"

Message boards : Number crunching : 194 (0xc2) EXIT_ABORTED_BY_CLIENT - "finish file present too long"
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1459246 - Posted: 31 Dec 2013, 20:00:01 UTC

Yesterday, for reasons I haven't been able to identify, BOINC apparently crashed on one of my machines (7057115). When I discovered it, about 8 hours later, I simply restarted BOINC. All of the tasks that had been running at the time of the crash appeared to restart fine, but 22 seconds later, the 2 restarted AP tasks (one on each GPU) both failed, with 194 (0xc2) EXIT_ABORTED_BY_CLIENT. The STDERR for both show:

<message>
finish file present too long
</message>

As luck would have it, one AP restarted at 98.20% and the other at 97.30%. The STDERR for both appears to have been basically complete before the BOINC crash, and they appear to have restarted in their termination phase. This is actually the second time (that I'm aware of) that I've gotten the "194" error under similar circumstances. The previous occurrence was about 6 weeks ago.

I only found one other reference to this sort of problem, in Message 1416127. Jason mentioned trying to document some of these instances, so I guess these are a couple more that can add to the list.

It seems unfortunate that a task can be trashed by the client for a timing issue, when it has basically completed all its useful work. Although, in this case, the situation was caused by a BOINC crash, it seems as though it could also occur at any time that BOINC is shut down and restarted. Not everybody runs 24/7. For me, 5 of my 7 machines shut down every weekday at noon and don't restart until 6 P.M. (to avoid peak period electric rates). I suspect it's only a matter of time before this "bug" hits one or more of them on a restart.
ID: 1459246 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1459305 - Posted: 31 Dec 2013, 22:22:54 UTC

FFA thread block override value:12288
FFA thread fetchblock override value:4096


That might be the problem.

FFA thread block override value:12288
FFA thread fetchblock override value:6144

those are my settings on my R9 290X which is a much bigger card than yours.


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1459305 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1459328 - Posted: 31 Dec 2013, 23:10:53 UTC - in response to Message 1459305.  

FFA thread block override value:12288
FFA thread fetchblock override value:4096


That might be the problem.

FFA thread block override value:12288
FFA thread fetchblock override value:6144

those are my settings on my R9 290X which is a much bigger card than yours.

Well, I've been running with those settings since early-September and successfully completed nearly 1,000 AP GPU tasks, with only 6 errors (2 of which were my fault and 4 of which were these "194" errors). I'll admit I don't really understand what those settings are doing (the documentation is rather weak), but I think the problem is actually what was identified by Joe Segur in the earlier message I referenced, that:

One of the last things an app does before exiting is write an empty "finish file", and that error indicates the app didn't exit within ten seconds after writing that file. I think the error was likely caused by shutting BOINC down after the file was written but before the app had gotten to its usual exit

That would seem to mean that there's nothing we can do at the host end to avoid getting nailed occasionally.
ID: 1459328 · Report as offensive
Profile Fred E.
Volunteer tester

Send message
Joined: 22 Jul 99
Posts: 768
Credit: 24,140,697
RAC: 0
United States
Message 1459332 - Posted: 31 Dec 2013, 23:12:12 UTC

I'm not sure about those settings as the cause. I had the same thing happen yesterday. Boinc froze up, I rebooted, BOINC started normally but errored out an Astropulse gpu job after 15-20 seconds. I'm running a different app (r1843) and have different settings for my 670:
FFA thread block override value:6144
FFA thread fetchblock override value:1536

My stderr has very little in it:

http://setiathome.berkeley.edu/result.php?resultid=3306975364
Another Fred
Support SETI@home when you search the Web with GoodSearch or shop online with GoodShop.
ID: 1459332 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1459340 - Posted: 31 Dec 2013, 23:25:43 UTC - in response to Message 1459332.  

I'm not sure about those settings as the cause. I had the same thing happen yesterday. Boinc froze up, I rebooted, BOINC started normally but errored out an Astropulse gpu job after 15-20 seconds. I'm running a different app (r1843) and have different settings for my 670:
FFA thread block override value:6144
FFA thread fetchblock override value:1536

My stderr has very little in it:

http://setiathome.berkeley.edu/result.php?resultid=3306975364

I'd almost bet that your empty STDERR might indicate that your task restarted at an even higher completion rate than my 2 tasks. The reason I say that is that your Run time and CPU time are both 0.00. With my tasks, the one that restarted at 97.30% showed fairly normal Run time, but the one that restarted at 98.20% only showed a Run time of 21.33, the time from the restart. That would indicate to me that the program's termination housekeeping had cleared those timers in the task that was restarted later. Yours may have had everything totally wiped out.
ID: 1459340 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1474737 - Posted: 9 Feb 2014, 17:21:14 UTC

Got up this morning and found that BOINC had crashed on my main cruncher, 7057115, so I decided to do some research before restarting BOINC. Here's what I found (all times are local, U.S. Pacific Standard Time):

The last entry in the stdoutdae.txt is:
09-Feb-2014 01:12:42 [SETI@home] Starting task ap_10ap13aa_B5_P1_00191_20140208_30452.wu_1 using astropulse_v6 version 604 (opencl_nvidia_100) in slot 7

Along with that AP task, there were 2 other AP tasks running, each on a different GPU. In checking the active slot directories, the ones for the 3 AP tasks all have a boinc_finish_called file present.

Here are the last several lines from the stderr.txt file for each of the 3 AP slot directories:

SLOT 0
Found 30 single pulses and 30 repeating pulses, exiting.
  percent blanked: 3.08
class T_remove_radar:	total=2.89e+009,	N=1,	<>=2.89e+009,	min=2.89e+009,	max=2.89e+009
class T_main_loop_L1:	total=4.65e+012,	N=83,	<>=5.61e+010,	min=4.32e+010,	max=7.62e+010
 class T_FFT_forward:	total=2.13e+010,	N=137113,	<>=1.55e+005,	min=1.24e+005,	max=7.76e+006
 class T_remove_radar_randomize:	total=3.81e+011,	N=1369126,	<>=2.78e+005,	min=6.80e+002,	max=1.15e+007
 class T_build_chirp_table:	total=0.00e+000,	N=0,	<>=0.00e+000,	min=1.84e+019,	max=0.00e+000
 class T_DataWrite:	total=5.27e+008,	N=5352,	<>=9.84e+004,	min=3.76e+004,	max=1.10e+006
  class T_DataWrite_ns:	total=0,	N=0,	<>=0,	min=0	max=0
 class T_oclReadBuf:	total=5.21e+006,	N=137113,	<>=3.70e+001,	min=3.20e+001,	max=8.56e+002
   class T_ChirpWrite:	total=0.00e+000,	N=0,	<>=0.00e+000,	min=1.84e+019,	max=0.00e+000
    class T_ChirpWrite_ns:	total=0,	N=0,	<>=0,	min=0	max=0
 class T_dechirp:	total=3.69e+010,	N=137113,	<>=2.69e+005,	min=2.42e+005,	max=6.79e+006
  class Dechirp_ns:	total=0,	N=0,	<>=0,	min=0	max=0
  class Half_ns:	total=0,	N=0,	<>=0,	min=0	max=0
 class T_PC_single_pulse_kernel_FFA_update:	total=3.18e+012,	N=137113,	<>=2.32e+007,	min=2.24e+007,	max=4.08e+007
  class PC_ns:	total=0,	N=0,	<>=0,	min=0	max=0
class T_oclReadBuf:	total=5.21e+006,	N=137113,	<>=3.70e+001,	min=3.20e+001,	max=8.56e+002
class T_oclWriteBuf:	total=5.29e+008,	N=5352,	<>=9.89e+004,	min=3.77e+004,	max=1.10e+006
  class T_FFT_inverse:	total=2.21e+010,	N=137113,	<>=1.61e+005,	min=1.40e+005,	max=6.66e+006
 class T_ffa:	total=9.85e+011,	N=662,	<>=1.49e+009,	min=4.88e+008,	max=8.76e+009
class T_GPU_buffer_read_backs:	total=50,	N=50,	<>=1,	min=1	max=1
USE_OPENCL	OPENCL_WRITE	USE_INCREASED_PRECISION	SMALL_CHIRP_TABLE	COMBINED_DECHIRP_KERNEL	
rev 1316
01:30:25 (3732): called boinc_finish

SLOT 6
    single pulses: 4
repetitive pulses: 4
  percent blanked: 6.35
class T_remove_radar:	total=2.87e+009,	N=1,	<>=2.87e+009,	min=2.87e+009,	max=2.87e+009
class T_main_loop_L1:	total=6.79e+012,	N=111,	<>=6.12e+010,	min=5.57e+010,	max=8.14e+010
 class T_FFT_forward:	total=3.20e+010,	N=182040,	<>=1.76e+005,	min=1.25e+005,	max=6.22e+007
 class T_remove_radar_randomize:	total=1.05e+012,	N=1817736,	<>=5.75e+005,	min=6.80e+002,	max=1.72e+007
 class T_build_chirp_table:	total=0.00e+000,	N=0,	<>=0.00e+000,	min=1.84e+019,	max=0.00e+000
 class T_DataWrite:	total=1.65e+009,	N=13320,	<>=1.24e+005,	min=3.76e+004,	max=1.75e+006
  class T_DataWrite_ns:	total=0,	N=0,	<>=0,	min=0	max=0
 class T_oclReadBuf:	total=7.93e+006,	N=182040,	<>=4.30e+001,	min=3.20e+001,	max=2.74e+005
   class T_ChirpWrite:	total=0.00e+000,	N=0,	<>=0.00e+000,	min=1.84e+019,	max=0.00e+000
    class T_ChirpWrite_ns:	total=0,	N=0,	<>=0,	min=0	max=0
 class T_dechirp:	total=5.06e+010,	N=182040,	<>=2.78e+005,	min=2.41e+005,	max=1.90e+006
  class Dechirp_ns:	total=0,	N=0,	<>=0,	min=0	max=0
  class Half_ns:	total=0,	N=0,	<>=0,	min=0	max=0
 class T_PC_single_pulse_kernel_FFA_update:	total=3.19e+012,	N=182040,	<>=1.75e+007,	min=1.61e+007,	max=1.10e+008
  class PC_ns:	total=0,	N=0,	<>=0,	min=0	max=0
class T_oclReadBuf:	total=7.93e+006,	N=182040,	<>=4.30e+001,	min=3.20e+001,	max=2.74e+005
class T_oclWriteBuf:	total=1.66e+009,	N=13320,	<>=1.24e+005,	min=3.77e+004,	max=1.75e+006
  class T_FFT_inverse:	total=3.01e+010,	N=182040,	<>=1.66e+005,	min=1.40e+005,	max=1.05e+006
 class T_ffa:	total=2.32e+012,	N=1998,	<>=1.16e+009,	min=4.03e+008,	max=1.17e+010
class T_GPU_buffer_read_backs:	total=11,	N=11,	<>=1,	min=1	max=1
USE_OPENCL	OPENCL_WRITE	USE_INCREASED_PRECISION	SMALL_CHIRP_TABLE	COMBINED_DECHIRP_KERNEL	
rev 1316
01:36:19 (1904): called boinc_finish

SLOT 7
    single pulses: 20
repetitive pulses: 30
  percent blanked: 0.00
class T_remove_radar:	total=2.91e+009,	N=1,	<>=2.91e+009,	min=2.91e+009,	max=2.91e+009
class T_main_loop_L1:	total=4.29e+012,	N=111,	<>=3.86e+010,	min=3.82e+010,	max=6.89e+010
 class T_FFT_forward:	total=2.44e+010,	N=182040,	<>=1.34e+005,	min=1.24e+005,	max=8.15e+006
 class T_remove_radar_randomize:	total=1.68e+009,	N=1817736,	<>=9.25e+002,	min=6.80e+002,	max=1.04e+006
 class T_build_chirp_table:	total=0.00e+000,	N=0,	<>=0.00e+000,	min=1.84e+019,	max=0.00e+000
 class T_DataWrite:	total=0.00e+000,	N=0,	<>=0.00e+000,	min=1.84e+019,	max=0.00e+000
  class T_DataWrite_ns:	total=0,	N=0,	<>=0,	min=0	max=0
 class T_oclReadBuf:	total=6.85e+006,	N=182040,	<>=3.70e+001,	min=3.20e+001,	max=1.68e+004
   class T_ChirpWrite:	total=0.00e+000,	N=0,	<>=0.00e+000,	min=1.84e+019,	max=0.00e+000
    class T_ChirpWrite_ns:	total=0,	N=0,	<>=0,	min=0	max=0
 class T_dechirp:	total=4.60e+010,	N=182040,	<>=2.53e+005,	min=2.44e+005,	max=2.54e+006
  class Dechirp_ns:	total=0,	N=0,	<>=0,	min=0	max=0
  class Half_ns:	total=0,	N=0,	<>=0,	min=0	max=0
 class T_PC_single_pulse_kernel_FFA_update:	total=4.16e+012,	N=182040,	<>=2.28e+007,	min=2.25e+007,	max=1.80e+008
  class PC_ns:	total=0,	N=0,	<>=0,	min=0	max=0
class T_oclReadBuf:	total=6.85e+006,	N=182040,	<>=3.70e+001,	min=3.20e+001,	max=1.68e+004
class T_oclWriteBuf:	total=0.00e+000,	N=0,	<>=0.00e+000,	min=1.84e+019,	max=0.00e+000
  class T_FFT_inverse:	total=2.76e+010,	N=182040,	<>=1.52e+005,	min=1.46e+005,	max=2.42e+006
 class T_ffa:	total=2.71e+010,	N=1,	<>=2.71e+010,	min=2.71e+010,	max=2.71e+010
class T_GPU_buffer_read_backs:	total=31,	N=31,	<>=1,	min=1	max=1
USE_OPENCL	OPENCL_WRITE	USE_INCREASED_PRECISION	SMALL_CHIRP_TABLE	COMBINED_DECHIRP_KERNEL	
rev 1316
01:39:35 (2928): called boinc_finish

Note that the "finish" times are all much later than the last entry in the event log. However, I noted that the timestamps for the boinc_task_state.xml files are all 1:12 AM, the same as the last entry in the log.

It seems likely that when I restart BOINC on that machine the 3 AP tasks will fail with the "finish file present too long", since it's now been almost 8 hours since they were written. However, I'm wondering what would happen if I delete those "finish" files before restarting BOINC. Does anybody have any sense of what might happen if I do that?

I'll wait perhaps another hour before restarting, in case anybody wants to weigh in with any suggestions. (By the way, I've made copies of all the files in all three slot directories, in case that might help with later analysis.)
ID: 1474737 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1474738 - Posted: 9 Feb 2014, 17:24:42 UTC
Last modified: 9 Feb 2014, 17:26:25 UTC

I belive it´s the same error i related in this msg: http://setiathome.berkeley.edu/forum_thread.php?id=73970&postid=1473277
Follow the answers i receive, seems like we have a "hell of coincidence" that trigers the error.
ID: 1474738 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1474739 - Posted: 9 Feb 2014, 17:28:34 UTC - in response to Message 1474738.  

Yes, I think it's exactly the same error. What led to the restart of your AP task? Was it a BOINC crash or some other cause?
ID: 1474739 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1474746 - Posted: 9 Feb 2014, 17:35:06 UTC - in response to Message 1474739.  
Last modified: 9 Feb 2014, 17:35:24 UTC

My only clue it´s a AV/windows update or something similar (maybe Java or Above who knows?) that runs automaticaly on the background since the host was running alone (the room was empty - i check with the security camera to be sure) at that hour.

The Boinc just mark the WU with error and continues to crunch normal (no crash), after that no other similar error apears on the host, so that why a call a "hell of coincidece bug" who only apears on some very specific situations, like explained by Jason on one of it´s posts.
ID: 1474746 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1474760 - Posted: 9 Feb 2014, 18:00:49 UTC - in response to Message 1474746.  

Okay, thanks, Juan.

At the time BOINC crashed, there were also 11 MB tasks running on the machine, 5 on CPUs and 6 on GPUs (2 on each, along with the 1 AP on each). Looking at the timestamps in the slot directories for those tasks, I don't see anything beyond 1:13 AM., which would appear to indicate that all of those stopped at the same time BOINC did. However, the 3 AP tasks appeared to keep running! At least until they called boinc_finish, but by then there was no BOINC available to answer the call. I find that VERY interesting.
ID: 1474760 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1474762 - Posted: 9 Feb 2014, 18:04:57 UTC - in response to Message 1474760.  

Okay, thanks, Juan.

At the time BOINC crashed, there were also 11 MB tasks running on the machine, 5 on CPUs and 6 on GPUs (2 on each, along with the 1 AP on each). Looking at the timestamps in the slot directories for those tasks, I don't see anything beyond 1:13 AM., which would appear to indicate that all of those stopped at the same time BOINC did. However, the 3 AP tasks appeared to keep running! At least until they called boinc_finish, but by then there was no BOINC available to answer the call. I find that VERY interesting.

I was just thinking the same thing.

Most application testing is done 'standalone', so that precise case of BOINC crashing and the science application not noticing is perhaps hard to test. But the application should notice that BOINC is no longer running, and shut itself down so that everything is consistent on restart.
ID: 1474762 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1474773 - Posted: 9 Feb 2014, 18:19:04 UTC - in response to Message 1474762.  

This is actually the 4th time I've had this happen on that machine. Last Nov 20, Dec 29, and Jan 5. In the first two instances, there were 2 GPU APs running (the machine only had 2 GPUs at the time), while the last one had only a single one. Two of the instances occurred at various times during the night (as did the most recent one), while the other happened in late morning.

I don't allow Windows (8.1) to do any automatic updating, even Windows Defender, so I don't think that could be triggering the actual BOINC crash. However, I don't think I've ever had BOINC crash on that machine except when at least one AP GPU task was running. And I don't recall BOINC crashing on any of my other machines. So perhaps there's some food for thought.
ID: 1474773 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1474776 - Posted: 9 Feb 2014, 18:22:27 UTC
Last modified: 9 Feb 2014, 18:24:20 UTC

Forget to mention, at the time of the error there where 3 AP running in the 780FTW of this host and 2 on the CPU (50% of the I5), all others compleated normaly. I use below priority and win 7/64 ultimate on this host.

I know it´s hard to imagine a background task who uses more than the 2 Cores allready freed specialy because i was ussing r2083 at that time. My host have 8GB of memory and use no virtual memory (no swap file).
ID: 1474776 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1474780 - Posted: 9 Feb 2014, 18:30:09 UTC

Okay, I think that, in the absence of any other suggestions thus far, I'm going to restart BOINC on that machine. First, though, I'll try deleting the "boinc_finish_called" file from two of the slot directories and leave it alone in the third, just for comparison purposes. Should be interesting!
ID: 1474780 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1474791 - Posted: 9 Feb 2014, 18:48:24 UTC - in response to Message 1474780.  

I'll try deleting the "boinc_finish_called" file from two of the slot directories and leave it alone in the third, just for comparison purposes.

That approach appears to have worked like a charm! The task where I left the "boinc_finish_called" file in place, 3378152671, quickly failed with the expected computation error. The two tasks where I deleted the file, 3378140958 and 3378159065, appear to have finished normally, making a "second" call to boinc_finish after the restart. Of course I won't know for sure if they'll actually validate until a wingman reports on each of those, but it looks promising so far.
ID: 1474791 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1474799 - Posted: 9 Feb 2014, 19:15:29 UTC - in response to Message 1474791.  

I agree - interesting observation, and should be reproducible in testing. It leaves us with two separate questions:

1) Why did BOINC crash, and could it be doing so more often than we've suspected?
2) Why didn't the AP app notice?

BTW, my understanding is that if BOINC crashes, but BOINC Manager stays running, then the Manager will attempt to restart the Client. But in the case that I reported recently, it was Windows Explorer that crashed, taking out both client and manager at the same time, so no automatic restart was possible.
ID: 1474799 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1474811 - Posted: 9 Feb 2014, 19:35:03 UTC - in response to Message 1474799.  
Last modified: 9 Feb 2014, 19:35:38 UTC

I agree - interesting observation, and should be reproducible in testing. It leaves us with two separate questions:

1) Why did BOINC crash, and could it be doing so more often than we've suspected?
2) Why didn't the AP app notice?

BTW, my understanding is that if BOINC crashes, but BOINC Manager stays running, then the Manager will attempt to restart the Client. But in the case that I reported recently, it was Windows Explorer that crashed, taking out both client and manager at the same time, so no automatic restart was possible.


One thing I came across while building MB Cuda for Linux some time back, was that some changes were in progress to replace the heartbeat mechanism. At the time, in Boincapi code, whatever the replacement was wasn't working at all, and would cause the app to exit.

So my guess is whatever mechanism was devised there isn't quite working yet. Noticing multiple other breakages, for the case of the private Linux Cuda build I reverted to 7.0.65 boincapi

(windows builds of course use much older modified boincapi, which I had fully intended to update if those multiple breakages and unresolved legacy problems weren't there. but they are there and so I didn't. )
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1474811 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1474814 - Posted: 9 Feb 2014, 19:51:29 UTC - in response to Message 1474799.  

But in the case that I reported recently, it was Windows Explorer that crashed, taking out both client and manager at the same time, so no automatic restart was possible.

Looks like Explorer was at the root of my crash, too. FWIW, here's a Windows log entry which appears to match the time of the last BOINC log entry:

Log Name:      Application
Source:        Microsoft-Windows-Winlogon
Date:          2/9/2014 1:12:43 AM
Event ID:      1002
Task Category: None
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      T7400
Description:
The shell stopped unexpectedly and explorer.exe was restarted.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-Winlogon" Guid="{DBE9B383-7CF3-4331-91CC-A3CB16A3B538}" EventSourceName="Winlogon" />
    <EventID Qualifiers="16384">1002</EventID>
    <Version>0</Version>
    <Level>4</Level>
    <Task>0</Task>
    <Opcode>0</Opcode>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2014-02-09T09:12:43.000000000Z" />
    <EventRecordID>6305</EventRecordID>
    <Correlation />
    <Execution ProcessID="0" ThreadID="0" />
    <Channel>Application</Channel>
    <Computer>T7400</Computer>
    <Security />
  </System>
  <EventData>
    <Data>explorer.exe</Data>
  </EventData>
</Event>

Don't know if that might be helpful to anybody, or not.
ID: 1474814 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1474816 - Posted: 9 Feb 2014, 20:01:22 UTC - in response to Message 1474811.  

I agree - interesting observation, and should be reproducible in testing. It leaves us with two separate questions:

1) Why did BOINC crash, and could it be doing so more often than we've suspected?
2) Why didn't the AP app notice?

BTW, my understanding is that if BOINC crashes, but BOINC Manager stays running, then the Manager will attempt to restart the Client. But in the case that I reported recently, it was Windows Explorer that crashed, taking out both client and manager at the same time, so no automatic restart was possible.

One thing I came across while building MB Cuda for Linux some time back, was that some changes were in progress to replace the heartbeat mechanism. At the time, in Boincapi code, whatever the replacement was wasn't working at all, and would cause the app to exit.

So my guess is whatever mechanism was devised there isn't quite working yet. Noticing multiple other breakages, for the case of the private Linux Cuda build I reverted to 7.0.65 boincapi

(windows builds of course use much older modified boincapi, which I had fully intended to update if those multiple breakages and unresolved legacy problems weren't there. but they are there and so I didn't. )

David himself reported the heartbeat mechanism as flawed some seven years ago, and flagged it for replacement - the reasoning is in ticket [trac]#336[/trac].

If there are problems with the PID/app_init.xml replacement, it would be helpful to feed them back in via boinc_alpha.
ID: 1474816 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1474841 - Posted: 9 Feb 2014, 21:49:57 UTC - in response to Message 1474816.  
Last modified: 9 Feb 2014, 21:50:15 UTC

David himself reported the heartbeat mechanism as flawed some seven years ago, and flagged it for replacement - the reasoning is in ticket [trac]#336[/trac].

If there are problems with the PID/app_init.xml replacement, it would be helpful to feed them back in via boinc_alpha.


Boinc alpha being a mechanism for those participating in Boinc alpha testing to report issues ? I'm not a Boinc alpha tester, and nor do I particularly want to be, thanks anyway.

The Boinc development documentation instructs to email the assigned department head, which I'll do so again once the problems I already reported are fixed.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1474841 · Report as offensive
1 · 2 · 3 · Next

Message boards : Number crunching : 194 (0xc2) EXIT_ABORTED_BY_CLIENT - "finish file present too long"


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.