194 (0xc2) EXIT_ABORTED_BY_CLIENT - "finish file present too long"

Author	Message
Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1475079 - Posted: 10 Feb 2014, 6:59:15 UTC - in response to Message 1474816. Richard Haselgrove wrote: ... If there are problems with the PID/app_init.xml replacement, it would be helpful to feed them back in via boinc_alpha. Agreed, but Jeff's case won't help. The project AP OpenCL build is from Rev. 1316 in the SETI repository, dated 25 Jun 2012. The BOINC API change was made 11 Oct 2012 according to one of the last entries in checkin_notes. So the old heartbeat mechanism was in use. Joe ID: 1475079 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1476342 - Posted: 13 Feb 2014, 3:04:20 UTC - in response to Message 1474791. The two tasks where I deleted the file, 3378140958 and 3378159065, appear to have finished normally, making a "second" call to boinc_finish after the restart. Of course I won't know for sure if they'll actually validate until a wingman reports on each of those, but it looks promising so far. Just a quick follow-up to confirm that those two tasks did validate successfully. Deleting the "boinc_finish_called" file from each slot directory before restarting BOINC not only allowed the AP tasks to restart successfully and finish without triggering an error, that action doesn't seem to have caused any problems with the result files, either. Seems like a viable workaround until the underlying source of the problem gets identified and fixed. ID: 1476342 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1482530 - Posted: 27 Feb 2014, 23:05:39 UTC Get 2 new WU today with this error in diferent hosts. http://setiathome.berkeley.edu/result.php?resultid=3409133824 http://setiathome.berkeley.edu/result.php?resultid=3408686062 ID: 1482530 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1482735 - Posted: 28 Feb 2014, 9:38:50 UTC Last modified: 28 Feb 2014, 9:41:54 UTC Another one, can i do something to avoid? http://setiathome.berkeley.edu/result.php?resultid=3411026804 But you talk about AP WU, in my case the eror apears on the MB WU. ID: 1482735 ·

William Volunteer tester Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0	Message 1482736 - Posted: 28 Feb 2014, 9:40:49 UTC please try if the commode build helps. A person who won't read has no advantage over one who can't read. (Mark Twain) ID: 1482736 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1482775 - Posted: 28 Feb 2014, 14:26:06 UTC Last modified: 28 Feb 2014, 15:02:19 UTC Done, letÂ´s see what we get. ThatÂ´s rises the old question, Why? You know my hosts cruches a lot of WU for months without producing this error then from no where it apears... but just in some WU without warning. Yes i need to agree, software development is not a exact science. Thanks for the tip. :) ID: 1482775 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1482785 - Posted: 28 Feb 2014, 14:53:02 UTC - in response to Message 1482775. Done, letÂ´s see what we get. ThatÂ´s rises the old question, Why? You know my hosts cruches a lot of WU for months without producing this error then from no where it apears... but just in some WU withour warning. Yes i need to agree, software development is not a exact science. Thanks for the tip. :) Simple. You don't notice your house was built on a swamp until the cracks appear in the walls 5 years after moving in. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1482785 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1482788 - Posted: 28 Feb 2014, 15:01:57 UTC LOL - I like the analogy. ID: 1482788 ·

BilBg Volunteer tester Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0	Message 1484065 - Posted: 3 Mar 2014, 16:35:22 UTC - in response to Message 1476342. Deleting the "boinc_finish_called" file from each slot directory before restarting BOINC not only allowed the AP tasks to restart successfully and finish without triggering an error, ... Since I see no normal reason for boinc_finish_called file to exist when BOINC is not running I suggest using a .bat file (DEL_boinc_finish_called.bat) put in BOINC Data directory, create shortcut to it, put the shortcut in Startup and/or run it manually: @If Exist slots\. Del /s slots\boinc_finish_called >>DEL_boinc_finish_called.Log :@Pause I tested with a copy of slots\ in empty directory: Deleted file - h:\BOINC-Data\- Bil -\_TTT_\slots\0\boinc_finish_called Deleted file - h:\BOINC-Data\- Bil -\_TTT_\slots\2\boinc_finish_called It seems you do not start BOINC automatically at startup but for those that do probably need some start_delay: <cc_config> <options> <start_delay>55</start_delay> </options> </cc_config> Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â ID: 1484065 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1487038 - Posted: 10 Mar 2014, 17:31:40 UTC Discovered this morning that BOINC had crashed during the night on one of my machines (6980751). This is a different machine than the one where I've reported on this problem previously. However, as before, there was an AP running on a GPU at the time BOINC went down. When I checked the slot directory, sure enough, I found a boinc_finish_called file sitting there. I deleted that file before restarting BOINC and the AP restarted (at 96.40%) and finished normally. That task, 3430588916, is now waiting for validation. I found essentially the same set of circumstances with this BOINC crash as I did with the one on the other machine that I reported in detail on in Message 1474737. That is to say, the last entry in the Event Log prior to the crash is the "Starting task" message for the AP GPU task at 3:00:11 AM. All other running tasks at the time of the crash (7 MB GPU, 4 MB CPU, 3 AP CPU) appear to have exited shortly thereafter, with a "No heartbeat from core client for 30 sec - exiting" message in the Stderr. However, the one AP GPU task apparently continued to run for another 51 minutes before finally calling boinc_finish at 03:51:36. One interesting difference in this case is that I've just recently started running Lunatics on this particular machine, whereas the previously reported incidents on the other machine were all running the stock AP app. So, apparently, the problem exists in both stock and Lunatics, but only for GPU tasks, not CPU tasks. In summary, the BOINC crashes seem to occur immediately after an AP GPU task is started, but any AP GPU tasks running at the time of the crash seem to keep running to a normal finish, at which point the call to boinc_finish is left unanswered, since BOINC is by then long gone. Deleting the boinc_finish_called file from the Slot directory for each AP GPU task before restarting BOINC succeeds in allowing the tasks to restart and finish normally (again), without triggering the "finish file present too long" error. ID: 1487038 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1487613 - Posted: 12 Mar 2014, 2:27:51 UTC Just for the record, I guess, here's another one. Pretty much the same circumstances. BOINC crashed and the last entry in the Event Log was the "Starting task" message for an AP GPU task at 21:54:44. That task, 3431345398, turned out to be 100% blanked (too much RFI) and called boinc_finish just 3 seconds later, at 21:54:47, but BOINC was already gone. Unfortunately, it was almost 11 hours later before I discovered the outage, deleted the boinc_finish_called file from the AP task's slot directory, and got that machine back in business. No tasks lost, just 11 hours of processing time. :-( For what it's worth, this AP task was running on a different GPU than the one involved in the previous crash on that machine. Clearly there seems to be some connection between these BOINC crashes and the start of an AP GPU task but, despite these two happening within less than 24 hours, they still seem to be pretty rare. (These are the first two on that particular machine.) Between these two most recent crashes, at least 4 AP GPU tasks were successfully processed without any issues, another one has completed since the last crash, and 3 more are running without any problems right now, as I write this. So there must be some combination of circumstances that have to exist to trigger the crash, besides just the start of an AP GPU task. On the other machine where I've seen this happen, weeks or months pass between crashes, with dozens or even hundreds of AP tasks running without incident. It's certainly mystifying! ID: 1487613 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1493996 - Posted: 23 Mar 2014, 13:39:32 UTC - in response to Message 1482736. please try if the commode build helps. @William I try your sugestion but... I get another one today http://setiathome.berkeley.edu/result.php?resultid=3451019700 and i use the: commode special test buils so they donÂ´t actualy fix the problem. ID: 1493996 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1493999 - Posted: 23 Mar 2014, 13:50:15 UTC - in response to Message 1493996. please try if the commode build helps. @William I try your sugestion but... I get another one today http://setiathome.berkeley.edu/result.php?resultid=3451019700 and i use the: commode special test buils so they donÂ´t actualy fix the problem. I'm not sure whether that's the same problem or not. It looks as if the program reached the normal completion point, but couldn't close down properly for some reason. So, it started again, and crashed on the restart. Still, lots of lovely debug information logged, so I'll save that for Jason - looks like the replacement wingmate is a very fast host, so it may disappear too soon. ID: 1493999 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1494007 - Posted: 23 Mar 2014, 14:13:00 UTC - in response to Message 1493999. Last modified: 23 Mar 2014, 14:17:37 UTC It looks as if the program reached the normal completion point, but couldn't close down properly for some reason. So, it started again, and crashed on the restart. ThatÂ´s what i belive happening when i see in the logs: ... boinc_exit(): requesting safe worker shutdown -> Worker preemptively acknowledging a normal exit.-> called boinc_finish FILE_LOCK::unlock(): close failed.: No such file or directory Exit Status: 0 ... Restarted at 79.99 percent,... <edit> 441 sec of GPU time itÂ´s the normal total crunching time for 2.7 AR WU in this host, i check that on others similar WU. ID: 1494007 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1494009 - Posted: 23 Mar 2014, 14:16:04 UTC - in response to Message 1493999. Last modified: 23 Mar 2014, 14:30:52 UTC I'm not sure whether that's the same problem or not. It looks as if the program reached the normal completion point, but couldn't close down properly for some reason. So, it started again, and crashed on the restart. Still, lots of lovely debug information logged, so I'll save that for Jason - looks like the replacement wingmate is a very fast host, so it may disappear too soon. If not the exact same error, That'll be related. That's a driver crash subsequent to a bus timeout during a memory transfer from the Device to the host. A possible difference here is that the commit mode build is successfully (Yay :D) writing+flushing+committing the full stderr content not visible before. The error here is consistent with a number of low level possibilities ranging from extreme PCIe bus saturation hitting a (hardware) timeout, through to excessive GPU overclock &/or overheating. - Callstack - ChildEBP RetAddr Args to Child 0018ec1c 76241194 000001b0 ffffffff 00000000 041a0048 ntdll!ZwWaitForSingleObject+0x0 { Top of call stack, OS part of driver is waiting } 0018ec34 76241148 000001b0 ffffffff 00000000 0018ec58 kernel32!WaitForSingleObjectEx+0x0 0018ec48 70dfba96 000001b0 ffffffff 0018ed18 7087afc9 kernel32!WaitForSingleObject+0x0 ... 0018f4d0 74f063b7 022b9000 42e80000 00000c00 8ab38ce3 nvcuda!cuMemcpyDtoH_v2+0x0 0018f528 74f2803d 022b9000 42e80000 00000c00 00000002 cudart32_50_35!+0x0 0018f630 004462c0 022b9000 42e80000 00000c00 00000002 cudart32_50_35!cudaMemcpy+0x0 0018f66c 00402101 07590100 00020000 00000002 00000003 Lunatics_x41zc_win32_cuda50!+0x0 I would systematically, under full load: - check all temps, - check for artefacts, adjust OCs/voltages as appropriate - check PCIe Bus root port drivers are not labelled Microsoft 2006 (in device manager, system section), e.g. should say latest Intel if an Intel chipset - examine possible CPU overcommit situations, (reduce CPU load) - see if there is any PCI latency timer related BIOS setting, and wind it out. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1494009 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1494013 - Posted: 23 Mar 2014, 14:31:44 UTC - in response to Message 1494009. Not an easy task for my but i will try to do the tests. What i could say now, - GPU temps are OK (75C max on the GPU at least as reported by EVGA Precision) - GPU Clock Offset +30Mhz RAM + 100 MHz - GPU Clock 1123 V 1162 (stock) - PCI Driver 6.1.7601.17514 - 21/06/2006 (windows said is updated) - NO CPU WU is crunched in this host - 100% of the CPU is avaiable to feed the GPU - Will check the lattency ASAP What i donÂ´t understood why the program Restarted at 79.99 percent, that sure will produce a bigger file than hnormal, or iÂ´m wrong? ID: 1494013 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1494019 - Posted: 23 Mar 2014, 14:46:37 UTC - in response to Message 1494013. Last modified: 23 Mar 2014, 14:50:23 UTC - PCI Driver 6.1.7601.17514 - 21/06/2006 (windows said is updated) Yes, It lies. 2006 is most definitely not updated, but an original MS Win7 driver, and probably a large part of the problem (if not the whole of it). If after OS install, You need to force the Intel drivers manually for many Intel Chipset parts so they show recent Intel(R) drivers instead, as "Intel Inf Update utility" only installs the inf files. There is a confusing section in Intel's readme about this, but the basics are - get the zip download version instead of the installer - unzip it to an easy location to find, such as C:\inteldrv - manually go through the system devices and force each one that 'smells' like it should be, to one from c:\inteldrv..'All' subfolder. This process replaces a lot of cruddy generic MS drivers from 2006, potentially to ones from 2011-2014 depending on the Intel Chipset. What i donÂ´t understood why the program Restarted at 79.99 percent, that sure will produce a bigger file than hnormal, or iÂ´m wrong? The Application probably completed, but due to the congestion some or all of the result, state or finished file operations were blocked too long. Boinc has a way of silently killing things if disk operations etc are ongoing. Apparently the Boinc client saw that as a premature exit, and restarted the task at the last checkpoint. [..In this case probably further blocking up the pipes, resulting in even higher contention and subsequent crash ] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1494019 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1494023 - Posted: 23 Mar 2014, 14:53:59 UTC I run the latency test with the DPC Latency Checker program and now (CPU usage in the range of 20-60% normal in this host) the absoule maximum was 273uS and that happening (at my human eyes) when one task crunching is ended and starts a new one (cruching 3 at a time). Normaly the latency is in the range of 100-128uS. ItÂ´s hard to put this host at full load since it only crunch, normal browsing (Look at SETI pages for example) and eventualy i see a DL TVseries, so i start the 3 task simultanuesly now and the latency still at the same range. ID: 1494023 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1494024 - Posted: 23 Mar 2014, 15:01:24 UTC - in response to Message 1494023. Last modified: 23 Mar 2014, 15:02:09 UTC I run the latency test with the DPC Latency Checker program and now (CPU usage in the range of 20-60% normal in this host) the absoule maximum was 273uS and that happening (at my human eyes) when one task crunching is ended and starts a new one (cruching 3 at a time). Normaly the latency is in the range of 100-128uS. ItÂ´s hard to put this host at full load since it only crunch, normal browsing (Look at SETI pages for example) and eventualy i see a DL TVseries, so i start the 3 task simultanuesly now and the latency still at the same range. Good, could be just the 2006 drivers then. [Example Only] Here's the ones I had to force update here ( all Intel(R) devices ) : - G33/G31/P35/P31 Express Chipset PCI express root port - G33/G31/P35/P31 Express Chipset Processor to I/O controller - ICH9 Family PCI express root port, 1, 5 & 6 (3 devices) - SMBus Controller - LPC Interface Controller - Lots of Intel(R) USB Universal Host COntrollers "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1494024 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1494025 - Posted: 23 Mar 2014, 15:01:48 UTC - in response to Message 1494019. Last modified: 23 Mar 2014, 15:02:18 UTC [quote]- PCI Driver 6.1.7601.17514 - 21/06/2006 (windows said is updated) Yes, It lies. 2006 is most definitely not updated, but an original MS Win7 driver, and probably a large part of the problem (if not the whole of it). If after OS install, You need to force the Intel drivers manually for many Intel Chipset parts so they show recent Intel(R) drivers instead, as "Intel Inf Update utility" only installs the inf files. There is a confusing section in Intel's readme about this, but the basics are - get the zip download version instead of the installer - unzip it to an easy location to find, such as C:\inteldrv - manually go through the system devices and force each one that 'smells' like it should be, to one from c:\inteldrv..'All' subfolder. This process replaces a lot of cruddy generic MS drivers from 2006, potentially to ones from 2011-2014 depending on the Intel Chipset. ThatÂ´s will take some more time than i have now (it 12:00AM here need to fire the barbeque, you know sunday, shinning sky, etc.), will try to do latter ASAP. You give me another reason the hate Windows! You canÂ´t even trust when he say: itÂ´s updated! Maybe itÂ´s better to install Win 8.1 directly? Since it must have the most updated drivers or no? Thanks for your tips and usual help. ID: 1494025 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.