194 (0xc2) EXIT_ABORTED_BY_CLIENT - "finish file present too long"

Message boards : Number crunching : 194 (0xc2) EXIT_ABORTED_BY_CLIENT - "finish file present too long"
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1475079 - Posted: 10 Feb 2014, 6:59:15 UTC - in response to Message 1474816.  

Richard Haselgrove wrote:
...
If there are problems with the PID/app_init.xml replacement, it would be helpful to feed them back in via boinc_alpha.

Agreed, but Jeff's case won't help. The project AP OpenCL build is from Rev. 1316 in the SETI repository, dated 25 Jun 2012. The BOINC API change was made 11 Oct 2012 according to one of the last entries in checkin_notes. So the old heartbeat mechanism was in use.
                                                                   Joe
ID: 1475079 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1476342 - Posted: 13 Feb 2014, 3:04:20 UTC - in response to Message 1474791.  

The two tasks where I deleted the file, 3378140958 and 3378159065, appear to have finished normally, making a "second" call to boinc_finish after the restart. Of course I won't know for sure if they'll actually validate until a wingman reports on each of those, but it looks promising so far.

Just a quick follow-up to confirm that those two tasks did validate successfully. Deleting the "boinc_finish_called" file from each slot directory before restarting BOINC not only allowed the AP tasks to restart successfully and finish without triggering an error, that action doesn't seem to have caused any problems with the result files, either. Seems like a viable workaround until the underlying source of the problem gets identified and fixed.
ID: 1476342 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1482530 - Posted: 27 Feb 2014, 23:05:39 UTC

ID: 1482530 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1482735 - Posted: 28 Feb 2014, 9:38:50 UTC
Last modified: 28 Feb 2014, 9:41:54 UTC

Another one, can i do something to avoid?

http://setiathome.berkeley.edu/result.php?resultid=3411026804

But you talk about AP WU, in my case the eror apears on the MB WU.
ID: 1482735 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1482736 - Posted: 28 Feb 2014, 9:40:49 UTC

please try if the commode build helps.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1482736 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1482775 - Posted: 28 Feb 2014, 14:26:06 UTC
Last modified: 28 Feb 2014, 15:02:19 UTC

Done, let´s see what we get.

That´s rises the old question, Why? You know my hosts cruches a lot of WU for months without producing this error then from no where it apears... but just in some WU without warning.

Yes i need to agree, software development is not a exact science. Thanks for the tip. :)
ID: 1482775 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1482785 - Posted: 28 Feb 2014, 14:53:02 UTC - in response to Message 1482775.  

Done, let´s see what we get.

That´s rises the old question, Why? You know my hosts cruches a lot of WU for months without producing this error then from no where it apears... but just in some WU withour warning.

Yes i need to agree, software development is not a exact science. Thanks for the tip. :)



Simple. You don't notice your house was built on a swamp until the cracks appear in the walls 5 years after moving in.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1482785 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1482788 - Posted: 28 Feb 2014, 15:01:57 UTC

LOL - I like the analogy.
ID: 1482788 · Report as offensive
Profile BilBg
Volunteer tester
Avatar

Send message
Joined: 27 May 07
Posts: 3720
Credit: 9,385,827
RAC: 0
Bulgaria
Message 1484065 - Posted: 3 Mar 2014, 16:35:22 UTC - in response to Message 1476342.  

Deleting the "boinc_finish_called" file from each slot directory before restarting BOINC not only allowed the AP tasks to restart successfully and finish without triggering an error, ...

Since I see no normal reason for boinc_finish_called file to exist when BOINC is not running
I suggest using a .bat file (DEL_boinc_finish_called.bat) put in BOINC Data directory, create shortcut to it, put the shortcut in Startup and/or run it manually:
@If Exist slots\*.*  Del /s slots\boinc_finish_called  >>DEL_boinc_finish_called.Log
:@Pause


I tested with a copy of slots\ in empty directory:
Deleted file - h:\BOINC-Data\- Bil -\_TTT_\slots\0\boinc_finish_called
Deleted file - h:\BOINC-Data\- Bil -\_TTT_\slots\2\boinc_finish_called


It seems you do not start BOINC automatically at startup but for those that do probably need some start_delay:
<cc_config>
   <options>
      <start_delay>55</start_delay>
   </options>
</cc_config>

 


- ALF - "Find out what you don't do well ..... then don't do it!" :)
 
ID: 1484065 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1487038 - Posted: 10 Mar 2014, 17:31:40 UTC

Discovered this morning that BOINC had crashed during the night on one of my machines (6980751). This is a different machine than the one where I've reported on this problem previously. However, as before, there was an AP running on a GPU at the time BOINC went down. When I checked the slot directory, sure enough, I found a boinc_finish_called file sitting there. I deleted that file before restarting BOINC and the AP restarted (at 96.40%) and finished normally. That task, 3430588916, is now waiting for validation.

I found essentially the same set of circumstances with this BOINC crash as I did with the one on the other machine that I reported in detail on in Message 1474737. That is to say, the last entry in the Event Log prior to the crash is the "Starting task" message for the AP GPU task at 3:00:11 AM. All other running tasks at the time of the crash (7 MB GPU, 4 MB CPU, 3 AP CPU) appear to have exited shortly thereafter, with a "No heartbeat from core client for 30 sec - exiting" message in the Stderr. However, the one AP GPU task apparently continued to run for another 51 minutes before finally calling boinc_finish at 03:51:36.

One interesting difference in this case is that I've just recently started running Lunatics on this particular machine, whereas the previously reported incidents on the other machine were all running the stock AP app. So, apparently, the problem exists in both stock and Lunatics, but only for GPU tasks, not CPU tasks.

In summary, the BOINC crashes seem to occur immediately after an AP GPU task is started, but any AP GPU tasks running at the time of the crash seem to keep running to a normal finish, at which point the call to boinc_finish is left unanswered, since BOINC is by then long gone. Deleting the boinc_finish_called file from the Slot directory for each AP GPU task before restarting BOINC succeeds in allowing the tasks to restart and finish normally (again), without triggering the "finish file present too long" error.
ID: 1487038 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1487613 - Posted: 12 Mar 2014, 2:27:51 UTC

Just for the record, I guess, here's another one. Pretty much the same circumstances. BOINC crashed and the last entry in the Event Log was the "Starting task" message for an AP GPU task at 21:54:44. That task, 3431345398, turned out to be 100% blanked (too much RFI) and called boinc_finish just 3 seconds later, at 21:54:47, but BOINC was already gone. Unfortunately, it was almost 11 hours later before I discovered the outage, deleted the boinc_finish_called file from the AP task's slot directory, and got that machine back in business. No tasks lost, just 11 hours of processing time. :-( For what it's worth, this AP task was running on a different GPU than the one involved in the previous crash on that machine.

Clearly there seems to be some connection between these BOINC crashes and the start of an AP GPU task but, despite these two happening within less than 24 hours, they still seem to be pretty rare. (These are the first two on that particular machine.) Between these two most recent crashes, at least 4 AP GPU tasks were successfully processed without any issues, another one has completed since the last crash, and 3 more are running without any problems right now, as I write this. So there must be some combination of circumstances that have to exist to trigger the crash, besides just the start of an AP GPU task. On the other machine where I've seen this happen, weeks or months pass between crashes, with dozens or even hundreds of AP tasks running without incident. It's certainly mystifying!
ID: 1487613 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1493996 - Posted: 23 Mar 2014, 13:39:32 UTC - in response to Message 1482736.  

please try if the commode build helps.


@William I try your sugestion but...

I get another one today http://setiathome.berkeley.edu/result.php?resultid=3451019700 and i use the: commode special test buils so they don´t actualy fix the problem.
ID: 1493996 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1493999 - Posted: 23 Mar 2014, 13:50:15 UTC - in response to Message 1493996.  

please try if the commode build helps.


@William I try your sugestion but...

I get another one today http://setiathome.berkeley.edu/result.php?resultid=3451019700 and i use the: commode special test buils so they don´t actualy fix the problem.

I'm not sure whether that's the same problem or not. It looks as if the program reached the normal completion point, but couldn't close down properly for some reason. So, it started again, and crashed on the restart.

Still, lots of lovely debug information logged, so I'll save that for Jason - looks like the replacement wingmate is a very fast host, so it may disappear too soon.
ID: 1493999 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1494007 - Posted: 23 Mar 2014, 14:13:00 UTC - in response to Message 1493999.  
Last modified: 23 Mar 2014, 14:17:37 UTC

It looks as if the program reached the normal completion point, but couldn't close down properly for some reason. So, it started again, and crashed on the restart.

That´s what i belive happening when i see in the logs:
...
boinc_exit(): requesting safe worker shutdown ->
Worker preemptively acknowledging a normal exit.->
called boinc_finish
FILE_LOCK::unlock(): close failed.: No such file or directory
Exit Status: 0
...
Restarted at 79.99 percent,...

<edit> 441 sec of GPU time it´s the normal total crunching time for 2.7 AR WU in this host, i check that on others similar WU.
ID: 1494007 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1494009 - Posted: 23 Mar 2014, 14:16:04 UTC - in response to Message 1493999.  
Last modified: 23 Mar 2014, 14:30:52 UTC

I'm not sure whether that's the same problem or not. It looks as if the program reached the normal completion point, but couldn't close down properly for some reason. So, it started again, and crashed on the restart.

Still, lots of lovely debug information logged, so I'll save that for Jason - looks like the replacement wingmate is a very fast host, so it may disappear too soon.


If not the exact same error, That'll be related. That's a driver crash subsequent to a bus timeout during a memory transfer from the Device to the host.



A possible difference here is that the commit mode build is successfully (Yay :D) writing+flushing+committing the full stderr content not visible before.

The error here is consistent with a number of low level possibilities ranging from extreme PCIe bus saturation hitting a (hardware) timeout, through to excessive GPU overclock &/or overheating.

- Callstack -
ChildEBP RetAddr Args to Child
0018ec1c 76241194 000001b0 ffffffff 00000000 041a0048 ntdll!ZwWaitForSingleObject+0x0 { Top of call stack, OS part of driver is waiting }
0018ec34 76241148 000001b0 ffffffff 00000000 0018ec58 kernel32!WaitForSingleObjectEx+0x0
0018ec48 70dfba96 000001b0 ffffffff 0018ed18 7087afc9 kernel32!WaitForSingleObject+0x0
...
0018f4d0 74f063b7 022b9000 42e80000 00000c00 8ab38ce3 nvcuda!cuMemcpyDtoH_v2+0x0
0018f528 74f2803d 022b9000 42e80000 00000c00 00000002 cudart32_50_35!+0x0
0018f630 004462c0 022b9000 42e80000 00000c00 00000002 cudart32_50_35!cudaMemcpy+0x0
0018f66c 00402101 07590100 00020000 00000002 00000003 Lunatics_x41zc_win32_cuda50!+0x0


I would systematically, under full load:
- check all temps,
- check for artefacts, adjust OCs/voltages as appropriate
- check PCIe Bus root port drivers are not labelled Microsoft 2006 (in device manager, system section), e.g. should say latest Intel if an Intel chipset
- examine possible CPU overcommit situations, (reduce CPU load)
- see if there is any PCI latency timer related BIOS setting, and wind it out.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1494009 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1494013 - Posted: 23 Mar 2014, 14:31:44 UTC - in response to Message 1494009.  

Not an easy task for my but i will try to do the tests.

What i could say now,

- GPU temps are OK (75C max on the GPU at least as reported by EVGA Precision)
- GPU Clock Offset +30Mhz RAM + 100 MHz - GPU Clock 1123 V 1162 (stock)
- PCI Driver 6.1.7601.17514 - 21/06/2006 (windows said is updated)
- NO CPU WU is crunched in this host - 100% of the CPU is avaiable to feed the GPU
- Will check the lattency ASAP

What i don´t understood why the program Restarted at 79.99 percent, that sure will produce a bigger file than hnormal, or i´m wrong?
ID: 1494013 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1494019 - Posted: 23 Mar 2014, 14:46:37 UTC - in response to Message 1494013.  
Last modified: 23 Mar 2014, 14:50:23 UTC

- PCI Driver 6.1.7601.17514 - 21/06/2006 (windows said is updated)


Yes, It lies. 2006 is most definitely not updated, but an original MS Win7 driver, and probably a large part of the problem (if not the whole of it). If after OS install, You need to force the Intel drivers manually for many Intel Chipset parts so they show recent Intel(R) drivers instead, as "Intel Inf Update utility" only installs the inf files.

There is a confusing section in Intel's readme about this, but the basics are
- get the zip download version instead of the installer
- unzip it to an easy location to find, such as C:\inteldrv
- manually go through the system devices and force each one that 'smells' like it should be, to one from c:\inteldrv..'All' subfolder.

This process replaces a lot of cruddy generic MS drivers from 2006, potentially to ones from 2011-2014 depending on the Intel Chipset.


What i don´t understood why the program Restarted at 79.99 percent, that sure will produce a bigger file than hnormal, or i´m wrong?


The Application probably completed, but due to the congestion some or all of the result, state or finished file operations were blocked too long. Boinc has a way of silently killing things if disk operations etc are ongoing.

Apparently the Boinc client saw that as a premature exit, and restarted the task at the last checkpoint. [..In this case probably further blocking up the pipes, resulting in even higher contention and subsequent crash ]
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1494019 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1494023 - Posted: 23 Mar 2014, 14:53:59 UTC

I run the latency test with the DPC Latency Checker program and now (CPU usage in the range of 20-60% normal in this host) the absoule maximum was 273uS and that happening (at my human eyes) when one task crunching is ended and starts a new one (cruching 3 at a time). Normaly the latency is in the range of 100-128uS.

It´s hard to put this host at full load since it only crunch, normal browsing (Look at SETI pages for example) and eventualy i see a DL TVseries, so i start the 3 task simultanuesly now and the latency still at the same range.
ID: 1494023 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1494024 - Posted: 23 Mar 2014, 15:01:24 UTC - in response to Message 1494023.  
Last modified: 23 Mar 2014, 15:02:09 UTC

I run the latency test with the DPC Latency Checker program and now (CPU usage in the range of 20-60% normal in this host) the absoule maximum was 273uS and that happening (at my human eyes) when one task crunching is ended and starts a new one (cruching 3 at a time). Normaly the latency is in the range of 100-128uS.

It´s hard to put this host at full load since it only crunch, normal browsing (Look at SETI pages for example) and eventualy i see a DL TVseries, so i start the 3 task simultanuesly now and the latency still at the same range.


Good, could be just the 2006 drivers then.

[Example Only] Here's the ones I had to force update here ( all Intel(R) devices ) :
- G33/G31/P35/P31 Express Chipset PCI express root port
- G33/G31/P35/P31 Express Chipset Processor to I/O controller
- ICH9 Family PCI express root port, 1, 5 & 6 (3 devices)
- SMBus Controller
- LPC Interface Controller
- Lots of Intel(R) USB Universal Host COntrollers
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1494024 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1494025 - Posted: 23 Mar 2014, 15:01:48 UTC - in response to Message 1494019.  
Last modified: 23 Mar 2014, 15:02:18 UTC

[quote]- PCI Driver 6.1.7601.17514 - 21/06/2006 (windows said is updated)



Yes, It lies. 2006 is most definitely not updated, but an original MS Win7 driver, and probably a large part of the problem (if not the whole of it). If after OS install, You need to force the Intel drivers manually for many Intel Chipset parts so they show recent Intel(R) drivers instead, as "Intel Inf Update utility" only installs the inf files.

There is a confusing section in Intel's readme about this, but the basics are
- get the zip download version instead of the installer
- unzip it to an easy location to find, such as C:\inteldrv
- manually go through the system devices and force each one that 'smells' like it should be, to one from c:\inteldrv..'All' subfolder.

This process replaces a lot of cruddy generic MS drivers from 2006, potentially to ones from 2011-2014 depending on the Intel Chipset.

That´s will take some more time than i have now (it 12:00AM here need to fire the barbeque, you know sunday, shinning sky, etc.), will try to do latter ASAP.

You give me another reason the hate Windows! You can´t even trust when he say: it´s updated!

Maybe it´s better to install Win 8.1 directly? Since it must have the most updated drivers or no?

Thanks for your tips and usual help.
ID: 1494025 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : 194 (0xc2) EXIT_ABORTED_BY_CLIENT - "finish file present too long"


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.