Message boards :
Number crunching :
194 (0xc2) EXIT_ABORTED_BY_CLIENT - "finish file present too long"
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Richard Haselgrove wrote: ... Agreed, but Jeff's case won't help. The project AP OpenCL build is from Rev. 1316 in the SETI repository, dated 25 Jun 2012. The BOINC API change was made 11 Oct 2012 according to one of the last entries in checkin_notes. So the old heartbeat mechanism was in use. Joe |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
The two tasks where I deleted the file, 3378140958 and 3378159065, appear to have finished normally, making a "second" call to boinc_finish after the restart. Of course I won't know for sure if they'll actually validate until a wingman reports on each of those, but it looks promising so far. Just a quick follow-up to confirm that those two tasks did validate successfully. Deleting the "boinc_finish_called" file from each slot directory before restarting BOINC not only allowed the AP tasks to restart successfully and finish without triggering an error, that action doesn't seem to have caused any problems with the result files, either. Seems like a viable workaround until the underlying source of the problem gets identified and fixed. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Get 2 new WU today with this error in diferent hosts. http://setiathome.berkeley.edu/result.php?resultid=3409133824 http://setiathome.berkeley.edu/result.php?resultid=3408686062 |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Another one, can i do something to avoid? http://setiathome.berkeley.edu/result.php?resultid=3411026804 But you talk about AP WU, in my case the eror apears on the MB WU. |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
please try if the commode build helps. A person who won't read has no advantage over one who can't read. (Mark Twain) |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Done, let´s see what we get. That´s rises the old question, Why? You know my hosts cruches a lot of WU for months without producing this error then from no where it apears... but just in some WU without warning. Yes i need to agree, software development is not a exact science. Thanks for the tip. :) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Done, let´s see what we get. Simple. You don't notice your house was built on a swamp until the cracks appear in the walls 5 years after moving in. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
LOL - I like the analogy. |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
Deleting the "boinc_finish_called" file from each slot directory before restarting BOINC not only allowed the AP tasks to restart successfully and finish without triggering an error, ... Since I see no normal reason for boinc_finish_called file to exist when BOINC is not running I suggest using a .bat file (DEL_boinc_finish_called.bat) put in BOINC Data directory, create shortcut to it, put the shortcut in Startup and/or run it manually: @If Exist slots\*.* Del /s slots\boinc_finish_called >>DEL_boinc_finish_called.Log :@Pause I tested with a copy of slots\ in empty directory: Deleted file - h:\BOINC-Data\- Bil -\_TTT_\slots\0\boinc_finish_called Deleted file - h:\BOINC-Data\- Bil -\_TTT_\slots\2\boinc_finish_called It seems you do not start BOINC automatically at startup but for those that do probably need some start_delay: <cc_config> <options> <start_delay>55</start_delay> </options> </cc_config> Â - ALF - "Find out what you don't do well ..... then don't do it!" :) Â |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Discovered this morning that BOINC had crashed during the night on one of my machines (6980751). This is a different machine than the one where I've reported on this problem previously. However, as before, there was an AP running on a GPU at the time BOINC went down. When I checked the slot directory, sure enough, I found a boinc_finish_called file sitting there. I deleted that file before restarting BOINC and the AP restarted (at 96.40%) and finished normally. That task, 3430588916, is now waiting for validation. I found essentially the same set of circumstances with this BOINC crash as I did with the one on the other machine that I reported in detail on in Message 1474737. That is to say, the last entry in the Event Log prior to the crash is the "Starting task" message for the AP GPU task at 3:00:11 AM. All other running tasks at the time of the crash (7 MB GPU, 4 MB CPU, 3 AP CPU) appear to have exited shortly thereafter, with a "No heartbeat from core client for 30 sec - exiting" message in the Stderr. However, the one AP GPU task apparently continued to run for another 51 minutes before finally calling boinc_finish at 03:51:36. One interesting difference in this case is that I've just recently started running Lunatics on this particular machine, whereas the previously reported incidents on the other machine were all running the stock AP app. So, apparently, the problem exists in both stock and Lunatics, but only for GPU tasks, not CPU tasks. In summary, the BOINC crashes seem to occur immediately after an AP GPU task is started, but any AP GPU tasks running at the time of the crash seem to keep running to a normal finish, at which point the call to boinc_finish is left unanswered, since BOINC is by then long gone. Deleting the boinc_finish_called file from the Slot directory for each AP GPU task before restarting BOINC succeeds in allowing the tasks to restart and finish normally (again), without triggering the "finish file present too long" error. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Just for the record, I guess, here's another one. Pretty much the same circumstances. BOINC crashed and the last entry in the Event Log was the "Starting task" message for an AP GPU task at 21:54:44. That task, 3431345398, turned out to be 100% blanked (too much RFI) and called boinc_finish just 3 seconds later, at 21:54:47, but BOINC was already gone. Unfortunately, it was almost 11 hours later before I discovered the outage, deleted the boinc_finish_called file from the AP task's slot directory, and got that machine back in business. No tasks lost, just 11 hours of processing time. :-( For what it's worth, this AP task was running on a different GPU than the one involved in the previous crash on that machine. Clearly there seems to be some connection between these BOINC crashes and the start of an AP GPU task but, despite these two happening within less than 24 hours, they still seem to be pretty rare. (These are the first two on that particular machine.) Between these two most recent crashes, at least 4 AP GPU tasks were successfully processed without any issues, another one has completed since the last crash, and 3 more are running without any problems right now, as I write this. So there must be some combination of circumstances that have to exist to trigger the crash, besides just the start of an AP GPU task. On the other machine where I've seen this happen, weeks or months pass between crashes, with dozens or even hundreds of AP tasks running without incident. It's certainly mystifying! |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
please try if the commode build helps. @William I try your sugestion but... I get another one today http://setiathome.berkeley.edu/result.php?resultid=3451019700 and i use the: commode special test buils so they don´t actualy fix the problem. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
please try if the commode build helps. I'm not sure whether that's the same problem or not. It looks as if the program reached the normal completion point, but couldn't close down properly for some reason. So, it started again, and crashed on the restart. Still, lots of lovely debug information logged, so I'll save that for Jason - looks like the replacement wingmate is a very fast host, so it may disappear too soon. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
It looks as if the program reached the normal completion point, but couldn't close down properly for some reason. So, it started again, and crashed on the restart. That´s what i belive happening when i see in the logs: ... boinc_exit(): requesting safe worker shutdown -> Worker preemptively acknowledging a normal exit.-> called boinc_finish FILE_LOCK::unlock(): close failed.: No such file or directory Exit Status: 0 ... Restarted at 79.99 percent,... <edit> 441 sec of GPU time it´s the normal total crunching time for 2.7 AR WU in this host, i check that on others similar WU. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I'm not sure whether that's the same problem or not. It looks as if the program reached the normal completion point, but couldn't close down properly for some reason. So, it started again, and crashed on the restart. If not the exact same error, That'll be related. That's a driver crash subsequent to a bus timeout during a memory transfer from the Device to the host. A possible difference here is that the commit mode build is successfully (Yay :D) writing+flushing+committing the full stderr content not visible before. The error here is consistent with a number of low level possibilities ranging from extreme PCIe bus saturation hitting a (hardware) timeout, through to excessive GPU overclock &/or overheating. - Callstack - I would systematically, under full load: - check all temps, - check for artefacts, adjust OCs/voltages as appropriate - check PCIe Bus root port drivers are not labelled Microsoft 2006 (in device manager, system section), e.g. should say latest Intel if an Intel chipset - examine possible CPU overcommit situations, (reduce CPU load) - see if there is any PCI latency timer related BIOS setting, and wind it out. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Not an easy task for my but i will try to do the tests. What i could say now, - GPU temps are OK (75C max on the GPU at least as reported by EVGA Precision) - GPU Clock Offset +30Mhz RAM + 100 MHz - GPU Clock 1123 V 1162 (stock) - PCI Driver 6.1.7601.17514 - 21/06/2006 (windows said is updated) - NO CPU WU is crunched in this host - 100% of the CPU is avaiable to feed the GPU - Will check the lattency ASAP What i don´t understood why the program Restarted at 79.99 percent, that sure will produce a bigger file than hnormal, or i´m wrong? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
- PCI Driver 6.1.7601.17514 - 21/06/2006 (windows said is updated) Yes, It lies. 2006 is most definitely not updated, but an original MS Win7 driver, and probably a large part of the problem (if not the whole of it). If after OS install, You need to force the Intel drivers manually for many Intel Chipset parts so they show recent Intel(R) drivers instead, as "Intel Inf Update utility" only installs the inf files. There is a confusing section in Intel's readme about this, but the basics are - get the zip download version instead of the installer - unzip it to an easy location to find, such as C:\inteldrv - manually go through the system devices and force each one that 'smells' like it should be, to one from c:\inteldrv..'All' subfolder. This process replaces a lot of cruddy generic MS drivers from 2006, potentially to ones from 2011-2014 depending on the Intel Chipset.
The Application probably completed, but due to the congestion some or all of the result, state or finished file operations were blocked too long. Boinc has a way of silently killing things if disk operations etc are ongoing. Apparently the Boinc client saw that as a premature exit, and restarted the task at the last checkpoint. [..In this case probably further blocking up the pipes, resulting in even higher contention and subsequent crash ] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
I run the latency test with the DPC Latency Checker program and now (CPU usage in the range of 20-60% normal in this host) the absoule maximum was 273uS and that happening (at my human eyes) when one task crunching is ended and starts a new one (cruching 3 at a time). Normaly the latency is in the range of 100-128uS. It´s hard to put this host at full load since it only crunch, normal browsing (Look at SETI pages for example) and eventualy i see a DL TVseries, so i start the 3 task simultanuesly now and the latency still at the same range. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I run the latency test with the DPC Latency Checker program and now (CPU usage in the range of 20-60% normal in this host) the absoule maximum was 273uS and that happening (at my human eyes) when one task crunching is ended and starts a new one (cruching 3 at a time). Normaly the latency is in the range of 100-128uS. Good, could be just the 2006 drivers then. [Example Only] Here's the ones I had to force update here ( all Intel(R) devices ) : - G33/G31/P35/P31 Express Chipset PCI express root port - G33/G31/P35/P31 Express Chipset Processor to I/O controller - ICH9 Family PCI express root port, 1, 5 & 6 (3 devices) - SMBus Controller - LPC Interface Controller - Lots of Intel(R) USB Universal Host COntrollers "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
[quote]- PCI Driver 6.1.7601.17514 - 21/06/2006 (windows said is updated)
That´s will take some more time than i have now (it 12:00AM here need to fire the barbeque, you know sunday, shinning sky, etc.), will try to do latter ASAP. You give me another reason the hate Windows! You can´t even trust when he say: it´s updated! Maybe it´s better to install Win 8.1 directly? Since it must have the most updated drivers or no? Thanks for your tips and usual help. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.