Stderr Truncations

Author	Message
Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1700626 - Posted: 11 Jul 2015, 23:07:25 UTC - in response to Message 1700621. Makes sense. Heck if all I have to do is set the flag to eliminate the invalids and suffer a bigger logfile ... I'll do it. At least until the devs can figure out the proper solution in the BOINC client code. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1700626 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1700629 - Posted: 11 Jul 2015, 23:58:19 UTC Of course I jinxed it. Got an invalid Task 1184614368 after I had set the task_debug flag. 6817 Milkyway@Home 7/11/2015 4:15:21 PM [cpu_sched] Starting task de_sum_fast_15_3s_136_sim1Jun1_4_1434554402_8848733_1 using milkyway_separation__modified_fit version 136 (opencl_nvidia_101) in slot 0 6818 Milkyway@Home 7/11/2015 4:15:23 PM Sending scheduler request: To fetch work. 6819 Milkyway@Home 7/11/2015 4:15:23 PM Reporting 1 completed tasks 6820 Milkyway@Home 7/11/2015 4:15:23 PM Requesting new tasks for NVIDIA GPU 6822 Milkyway@Home 7/11/2015 4:15:24 PM Scheduler request completed: got 1 new tasks 6823 Milkyway@Home 7/11/2015 4:15:24 PM [task] result state=NEW for de_80_DR8_Rev_8_5_00004_1434551187_12091972_0 from handle_scheduler_reply 6824 7/11/2015 4:15:24 PM [slot] removed file projects/milkyway.cs.rpi.edu_milkyway/de_80_DR8_Rev_8_5_00004_1434551187_12091972_0_0 6825 7/11/2015 4:15:24 PM [slot] removed file projects/milkyway.cs.rpi.edu_milkyway/de_80_DR8_Rev_8_5_00004_1434551187_12091972_0_0.gz 6826 7/11/2015 4:15:24 PM [slot] removed file projects/milkyway.cs.rpi.edu_milkyway/de_80_DR8_Rev_8_5_00004_1434551187_12091972_0_0.gzt 6827 Milkyway@Home 7/11/2015 4:15:25 PM [task] result state=FILES_DOWNLOADING for de_80_DR8_Rev_8_5_00004_1434551187_12091972_0 from CS::update_results 6828 Milkyway@Home 7/11/2015 4:15:26 PM [task] result state=FILES_DOWNLOADED for de_80_DR8_Rev_8_5_00004_1434551187_12091972_0 from CS::update_results 6838 Milkyway@Home 7/11/2015 4:16:11 PM [task] Process for de_sum_fast_15_3s_136_sim1Jun1_4_1434554402_8848733_1 exited, exit code 0, task state 1 6839 Milkyway@Home 7/11/2015 4:16:11 PM [task] task_state=EXITED for de_sum_fast_15_3s_136_sim1Jun1_4_1434554402_8848733_1 from handle_exited_app 6840 7/11/2015 4:16:11 PM [slot] cleaning out slots/0: handle_exited_app() 6841 7/11/2015 4:16:11 PM [slot] removed file slots/0/astronomy_parameters.txt 6842 7/11/2015 4:16:11 PM [slot] removed file slots/0/boinc_finish_called 6843 7/11/2015 4:16:11 PM [slot] removed file slots/0/init_data.xml 6844 7/11/2015 4:16:11 PM [slot] removed file slots/0/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe 6845 7/11/2015 4:16:11 PM [slot] removed file slots/0/separation_checkpoint 6846 7/11/2015 4:16:11 PM [slot] removed file slots/0/stars.txt 6847 7/11/2015 4:16:11 PM [slot] failed to remove file slots/0/stderr.txt: unlink() failed 6848 Milkyway@Home 7/11/2015 4:16:11 PM Computation for task de_sum_fast_15_3s_136_sim1Jun1_4_1434554402_8848733_1 finished 6849 Milkyway@Home 7/11/2015 4:16:11 PM [task] result state=FILES_UPLOADING for de_sum_fast_15_3s_136_sim1Jun1_4_1434554402_8848733_1 from CS::app_finished 6850 Milkyway@Home 7/11/2015 4:16:11 PM [task] result state=FILES_UPLOADED for de_sum_fast_15_3s_136_sim1Jun1_4_1434554402_8848733_1 from CS::update_results 6851 7/11/2015 4:16:11 PM [slot] cleaning out slots/0: get_free_slot() 6852 7/11/2015 4:16:11 PM [slot] removed file slots/0/stderr.txt 6853 Milkyway@Home 7/11/2015 4:16:11 PM [slot] assigning slot 0 to de_80_DR8_Rev_8_5_00004_1434551187_11984805_2 6854 7/11/2015 4:16:11 PM [slot] removed file slots/0/init_data.xml 6855 Milkyway@Home 7/11/2015 4:16:11 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_separation_1.02_windows_x86_64__opencl_nvidia.exe to slots/0/milkyway_separation_1.02_windows_x86_64__opencl_nvidia.exe 6856 Milkyway@Home 7/11/2015 4:16:11 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/80_rev_8_5.prmtrs to slots/0/astronomy_parameters.txt 6857 Milkyway@Home 7/11/2015 4:16:11 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/80_Rev_8_3.stars to slots/0/stars.txt 6858 7/11/2015 4:16:11 PM [slot] removed file slots/0/boinc_temporary_exit 6859 Milkyway@Home 7/11/2015 4:16:11 PM [task] task_state=EXECUTING for de_80_DR8_Rev_8_5_00004_1434551187_11984805_2 from start 6860 Milkyway@Home 7/11/2015 4:16:11 PM Starting task de_80_DR8_Rev_8_5_00004_1434551187_11984805_2 6861 Milkyway@Home 7/11/2015 4:16:11 PM [cpu_sched] Starting task de_80_DR8_Rev_8_5_00004_1434551187_11984805_2 using milkyway version 102 (opencl_nvidia) in slot 0 Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1700629 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1700634 - Posted: 12 Jul 2015, 0:44:15 UTC And another one Task 1184683992. 9330 7/11/2015 5:21:14 PM [slot] cleaning out slots/1: get_free_slot() 9331 Milkyway@Home 7/11/2015 5:21:14 PM [slot] assigning slot 1 to de_fast_15_3s_136_sim1Jun1_1_1434554402_8887629_0 9332 7/11/2015 5:21:14 PM [slot] removed file slots/1/init_data.xml 9333 Milkyway@Home 7/11/2015 5:21:14 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe to slots/1/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe 9334 Milkyway@Home 7/11/2015 5:21:14 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/parameters-15-3s-sim-fast.txt to slots/1/astronomy_parameters.txt 9335 Milkyway@Home 7/11/2015 5:21:14 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/stars-15-sim-1Jun1.txt to slots/1/stars.txt 9336 7/11/2015 5:21:14 PM [slot] removed file slots/1/boinc_temporary_exit 9337 Milkyway@Home 7/11/2015 5:21:14 PM [task] task_state=EXECUTING for de_fast_15_3s_136_sim1Jun1_1_1434554402_8887629_0 from start 9338 Milkyway@Home 7/11/2015 5:21:14 PM Starting task de_fast_15_3s_136_sim1Jun1_1_1434554402_8887629_0 9339 Milkyway@Home 7/11/2015 5:21:14 PM [cpu_sched] Starting task de_fast_15_3s_136_sim1Jun1_1_1434554402_8887629_0 using milkyway_separation__modified_fit version 136 (opencl_nvidia_101) in slot 1 9443 Milkyway@Home 7/11/2015 5:22:05 PM [task] Process for de_fast_15_3s_136_sim1Jun1_1_1434554402_8887629_0 exited, exit code 0, task state 1 9444 Milkyway@Home 7/11/2015 5:22:05 PM [task] task_state=EXITED for de_fast_15_3s_136_sim1Jun1_1_1434554402_8887629_0 from handle_exited_app 9445 7/11/2015 5:22:05 PM [slot] cleaning out slots/1: handle_exited_app() 9446 7/11/2015 5:22:05 PM [slot] removed file slots/1/astronomy_parameters.txt 9447 7/11/2015 5:22:05 PM [slot] removed file slots/1/boinc_finish_called 9448 7/11/2015 5:22:05 PM [slot] removed file slots/1/init_data.xml 9449 7/11/2015 5:22:05 PM [slot] removed file slots/1/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe 9450 7/11/2015 5:22:05 PM [slot] removed file slots/1/separation_checkpoint 9451 7/11/2015 5:22:05 PM [slot] removed file slots/1/stars.txt 9452 7/11/2015 5:22:05 PM [slot] failed to remove file slots/1/stderr.txt: unlink() failed 9453 Milkyway@Home 7/11/2015 5:22:05 PM Computation for task de_fast_15_3s_136_sim1Jun1_1_1434554402_8887629_0 finished 9454 Milkyway@Home 7/11/2015 5:22:05 PM [task] result state=FILES_UPLOADING for de_fast_15_3s_136_sim1Jun1_1_1434554402_8887629_0 from CS::app_finished 9455 Milkyway@Home 7/11/2015 5:22:05 PM [task] result state=FILES_UPLOADED for de_fast_15_3s_136_sim1Jun1_1_1434554402_8887629_0 from CS::update_results 9456 7/11/2015 5:22:05 PM [slot] cleaning out slots/1: get_free_slot() 9457 7/11/2015 5:22:05 PM [slot] failed to remove file slots/1/stderr.txt: unlink() failed 9458 Milkyway@Home 7/11/2015 5:22:05 PM [slot] failed to clean out dir: unlink() failed 9501 7/11/2015 5:22:57 PM [slot] cleaning out slots/1: get_free_slot() 9502 7/11/2015 5:22:57 PM [slot] removed file slots/1/stderr.txt 9504 7/11/2015 5:22:57 PM [slot] removed file slots/1/init_data.xml 9511 7/11/2015 5:22:57 PM [slot] removed file slots/1/boinc_temporary_exit Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1700634 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1700677 - Posted: 12 Jul 2015, 6:59:31 UTC - in response to Message 1700634. hmmm, well there's your problem, the Cat's dead. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1700677 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1700686 - Posted: 12 Jul 2015, 9:03:59 UTC I just caught an orphaned stderr.txt from a double unlink error (Error 32, in my case with v7.6.3). Complete down to ... Initial wait: 12 ms Integration time: 92.403269 s. Average time per iteration = 288.760217 ms Integral 0 time = 92.714350 s Running likelihood with 108458 stars Likelihood time = 2.533832 s <background_integral> 0.000132372752421 </background_integral> <stream_integral> 44.152223561035349 202.686531693434380 29.999699899263753 </stream_integral> <background_likelihood> -3.490802006209325 </background_likelihood> <stream_only_likelihood> -6.337757282544891 -4.054056095546774 -4.201967829647314 </stream_only_likelihood> <search_likelihood> -3.041275469646288 </search_likelihood> 09:54:36 (3868): called boinc_finish ID: 1700686 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1700692 - Posted: 12 Jul 2015, 9:59:07 UTC - in response to Message 1700686. Last modified: 12 Jul 2015, 10:02:20 UTC sweet, yeah, places the events after entering boinc_finish(), which fits with the 3 recommended options to trial, 1 with source code provided. As we know, the commode option helps, but I consider only a partial workaround, rather than a thorough treatment of the logic to consider multithreading appropriately and necessarily. I would like to see the third (uncoded) option, which I labelled 'Best'as considered, with the caveat that it requires a much better understanding of multithreading, modern Operating system fundamental principles, and synchronisation than I see throughout the Boinc codebase. Kindof a shame these people don't seem to have contact with their own computer science department, whose work I refer to frequently. I will be using the 'Best'' approach in any custom libraries as the most robust and flexible option long term, though the midway option I coded should be sufficient for many situations, labelled as 'Better' (but not Best). The deeper option just adds flexibility and practices that can make application development much more civilised down the road. These principles apply to any application running on a multitasking/mutithreaded platform, so the arguments that Windows sucks are invalid (it might in many ways, but not relevant to this discussion). Options are just that. No reason the codebase cant have all and the existing options, and a simple switch to choose the appropriate one for the applÃcation. So it becaomes a little baffling when you see code like the original unmaintained, when it's the cause of so much more or less random grief, just through a refusal to acknowledge something doesn;t work the way you thought it did. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1700692 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1700693 - Posted: 12 Jul 2015, 10:01:29 UTC - in response to Message 1700692. Kindof a shame these people don;t seem to have contact with their own computer science department, whose work I refer to frequently. That could be a very useful suggestion to make to the new BOINC Project Management Committee. ID: 1700693 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1700694 - Posted: 12 Jul 2015, 10:03:02 UTC - in response to Message 1700693. Kindof a shame these people don;t seem to have contact with their own computer science department, whose work I refer to frequently. That could be a very useful suggestion to make to the new BOINC Project Management Committee. Lol, "go read, and keep and open mind" "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1700694 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1700698 - Posted: 12 Jul 2015, 10:36:16 UTC OK, back on topic. On the very weak evidence of an admittedly tiny sample of one, it would appear that the problem isn't one of "incomplete (truncated) write", but instead "incomplete (premature) read". The API fix will cure that version too, of course, by not telling BOINC there's anything to read until it knows the write has been completed and signed off - so the client won't jump the gun. So now we have a different problem of asynchronous timing. Which will reach 100% penetration in the BOINC user base first? Will a fixed API reach all the science applications maintained by 'project scientist' users, and will a fixed client reach all 'volunteer cruncher' users? The answer, of course, is neither. There is a reluctance to embrace the new in both camps. The BOINC philosophy - back in the days when there was one - is to strive to maintain both backwards and forwards compatibility between project servers and user's clients, as far as is humanly possible. In this context, I'd call the API a server component, because it's managed and deployed at the project level. Ordinary users (developers are different) have no control over the API they're given. So, I'd suggest that we also need a client-side fix: "Don't read contents until you have exclusive rights". A file lock policy of "write exclusive, read shared" works in most cases, but not here. Placing the fix in the client as well gives back an element of control to that small subset of volunteer users who post in threads like this one, because they notice and are upset by their wasted work. Just while the low, slow, migration of the fixed API makes its way to the outer reaches of developer-land. ID: 1700698 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1700720 - Posted: 12 Jul 2015, 12:10:32 UTC - in response to Message 1700698. Last modified: 12 Jul 2015, 12:58:56 UTC Yeah, I'll have to dip out of the client side issues at this point, as preparing to start a fairly significant application revision first sprint (switching to a more team oriented development process, a big thing). I consider a reasonable answer that all the components, i.e. applications, client, and servers, work together as a single unified system, so require a cohesive effort. The divide and conquer approach there requires the most attention to the interfaces between these subsystems. That includes the state protocols and messages, on top of the API behaviour. [In the case of the servers, it's a virtual representation of your host that goes wacky due to some incomplete models on how things work, and some lax engineering]. Then whatever is internal to a given subsystem is able to be dealt with individually. Maintaining backward and forward compatibility where feasible isn't that difficult (version numbers are present), but there are also ways to encourage update to reduce the maintenance/helpdesk costs as well. Actually moving to simplify / streamline and improve quality is a start. You can still make all the non-critical stuff' as plugins, rather than break the core client every few weeks. There are indeed a lot of issues I know about with respect to similar engineering issues with the client, which are also process and wilful ignorance based. But I seriously doubt that the 7 years spent trying to communicate this trivial API patch is really the way to go [for these]. I'll [continue to] demonstrate by doing instead. [Edit:] don't count your chickens on one or more of the three suggested api fixes being applied either, or even some variation. They've been shown them before... The usual pattern is to discount it before trying anything. I must have missed that particular "Software Engineering by Wishful Thinking" lecture I guess... Try any and all available options, so learn what works best. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1700720 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1700734 - Posted: 12 Jul 2015, 13:24:27 UTC - in response to Message 1700720. There are indeed a lot of issues I know about with respect to similar engineering issues with the client, which are also process and wilful ignorance based. But I seriously doubt that the 7 years spent trying to communicate this trivial API patch is really the way to go [for these]. The trick with getting BOINC bugs fixed is manyfold. First, make absolutely sure that you have understood the whole problem, so the proposed fix doesn't blow up in your face in a case you hadn't considered, or didn't even know about. Then, report it - and if you have the courage of your convictions, keep reporting it until it goes in: don't back off at the first hurdle. And finally, keep monitoring it until and after the fix is deployed. We've seen all that in "task abandonment", over the last few days. I first narrowed it down to those two calls to mark_results_over() in lines 403 and 426 of handle_request.cpp over two years ago. But I didn't have the courage of my convictions to delve any deeper at that time, and no code-walkers were available to back me up either. Now, we have (collectively) found the answer, proof-read and refined it to the point where we're sure it's safe, and reported it for implementation. My report failed - at least, it wasn't picked up within the timescale we would have liked - but it was accepted at the second attempt. We still await deployment, but I've opened up a second front for that department too, and I have a reasonable expectation that we might be rid of the problem for good next week. ID: 1700734 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1700742 - Posted: 12 Jul 2015, 13:46:04 UTC - in response to Message 1700734. Last modified: 12 Jul 2015, 13:49:51 UTC There are indeed a lot of issues I know about with respect to similar engineering issues with the client, which are also process and wilful ignorance based. But I seriously doubt that the 7 years spent trying to communicate this trivial API patch is really the way to go [for these]. The trick with getting BOINC bugs fixed is manyfold. First, make absolutely sure that you have understood the whole problem, so the proposed fix doesn't blow up in your face in a case you hadn't considered, or didn't even know about. Then, report it - and if you have the courage of your convictions, keep reporting it until it goes in: don't back off at the first hurdle. And finally, keep monitoring it until and after the fix is deployed. We've seen all that in "task abandonment", over the last few days. I first narrowed it down to those two calls to mark_results_over() in lines 403 and 426 of handle_request.cpp over two years ago. But I didn't have the courage of my convictions to delve any deeper at that time, and no code-walkers were available to back me up either. Now, we have (collectively) found the answer, proof-read and refined it to the point where we're sure it's safe, and reported it for implementation. My report failed - at least, it wasn't picked up within the timescale we would have liked - but it was accepted at the second attempt. We still await deployment, but I've opened up a second front for that department too, and I have a reasonable expectation that we might be rid of the problem for good next week. ah hah, right. That's all well and good if I gave two hoots about Boinc, any more than I see it as robbing precious development energies from the project I love and support. Boinc can run their project however they like of course, but it doesn't mean I have any incentive to burn resources pretending I actually need to sacrifice my mental health to help the paid members do their jobs, more than the minimum possible to keep my applications working in support of Seti. If and when they lean out the management bloated juggernaut, and put some actual software engineers on in place, then I would consider that a massive step forward. Until that happens, the inefficiency is a huge waste of time. Naturally the software engineer part in me has been curious about what's broken and why. It is the encrusted barnacles on the code and the people that need chipping off, back to bare metal. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1700742 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1700788 - Posted: 12 Jul 2015, 15:38:01 UTC I had hoped that my T7400, which I put to work on S@h last night for the first time since November, would generate some of those completely empty Stderrs like it used to, but none so far. My xw9400, meanwhile, just produced two run-of-the-mill truncated Stderrs, Task 4260496447 and Task 4260549088, similar to what I reported yesterday morning in Message 1700500. Neither one shows any remove or unlink failures in the log. I also found another AP task, 4260871348, which did have unlink failures (but a complete Stderr) similar to what I also posted yesterday in Message 1700567. This time, though, there were three files that failed: ap_state.dat0, ap_state.dat1, and stderr.txt. Since the get_free_slot() failed, Slot 16 was brought into play. I had a little monitor running to try to capture the orphaned stderr.txt file when that happened, but it's just a simple timer routine on a 30-second trigger and, unfortunately, Slot 0 got successfully cleaned out for another task 10 seconds after the AP task ended, so the stderr.txt I retrieved was for a newly launched MB task. ID: 1700788 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1700884 - Posted: 12 Jul 2015, 21:52:38 UTC Perhaps a little bit off topic, but still seemingly related to the out-of-sync conditions at task termination, are these AP tasks that are having remove/unlink failures. I just caught another one on my xw9400, Task 4262424987, which was unable to clean out ap_state.dat0, ap_state.dat1, and stderr.txt in handle_exited_app(), the same failures as the one in my last post. In this case, however, the get_free_slot() routine did succeed in removing the ap_state.dat0, but still couldn't remove the other two. This time I did successfully capture the orphaned stderr.txt, but it matches what appears on the task detail page, a complete Stderr. Neither one had any truncation. ID: 1700884 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1700952 - Posted: 13 Jul 2015, 3:13:01 UTC I think Murphy had a hand in this one, Task 4262322967. It's a truncated non-empty Stderr for what appears to be a -9 overflow. At the moment, it's still awaiting the wingman's report, so it's still up in the air as to whether or not it will be marked Invalid. However, it did have a remove/unlink failure for stderr.txt and I should have been able to capture the orphaned stderr.txt. But...my routine is monitoring Slot 16 and, in this instance, the subsequent task went to Slot 15, since 15 was free due to 16 still being in use following the slot juggle for the AP task in my previous post. Yep, I think that's Murphy's handiwork. Anyway, since this is the first truncated Stderr that actually had an unlink failure since I turned <task_debug> on, I'll go ahead and include the entire log entry for it. It doesn't look to me as if <task_debug> added anything new to what <slot_debug> was already showing, but perhaps it will be useful to someone else. 12-Jul-2015 16:11:05 [SETI@home] [slot] assigning slot 0 to 01jn15aa.17439.885.438086664200.12.239_1 12-Jul-2015 16:11:05 [---] [slot] removed file slots/0/init_data.xml 12-Jul-2015 16:11:05 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/Lunatics_x41zc_win32_cuda50.exe to slots/0/Lunatics_x41zc_win32_cuda50.exe 12-Jul-2015 16:11:05 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/cudart32_50_35.dll to slots/0/cudart32_50_35.dll 12-Jul-2015 16:11:05 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/cufft32_50_35.dll to slots/0/cufft32_50_35.dll 12-Jul-2015 16:11:05 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/mbcuda.cfg to slots/0/mbcuda.cfg 12-Jul-2015 16:11:05 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/01jn15aa.17439.885.438086664200.12.239 to slots/0/work_unit.sah 12-Jul-2015 16:11:05 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/01jn15aa.17439.885.438086664200.12.239_1_0 to slots/0/result.sah 12-Jul-2015 16:11:05 [---] [slot] removed file slots/0/boinc_temporary_exit 12-Jul-2015 16:11:05 [SETI@home] [task] task_state=EXECUTING for 01jn15aa.17439.885.438086664200.12.239_1 from start 12-Jul-2015 16:11:05 [SETI@home] Starting task 01jn15aa.17439.885.438086664200.12.239_1 12-Jul-2015 16:11:05 [SETI@home] [cpu_sched] Starting task 01jn15aa.17439.885.438086664200.12.239_1 using setiathome_v7 version 700 (cuda50) in slot 0 12-Jul-2015 16:11:07 [SETI@home] Started upload of 01jn15ab.4387.89556.438086664198.12.189_0_0 12-Jul-2015 16:11:09 [SETI@home] Finished upload of 01jn15ab.4387.89556.438086664198.12.189_0_0 12-Jul-2015 16:11:09 [---] [slot] removed file projects/setiathome.berkeley.edu/01jn15ab.4387.89556.438086664198.12.189_0_0 12-Jul-2015 16:11:09 [---] [slot] removed file projects/setiathome.berkeley.edu/01jn15ab.4387.89556.438086664198.12.189_0_0.gz 12-Jul-2015 16:11:09 [---] [slot] removed file projects/setiathome.berkeley.edu/01jn15ab.4387.89556.438086664198.12.189_0_0.gzt 12-Jul-2015 16:11:09 [SETI@home] [task] result state=FILES_UPLOADED for 01jn15ab.4387.89556.438086664198.12.189_0 from CS::update_results 12-Jul-2015 16:11:19 [SETI@home] [task] Process for 01jn15aa.17439.885.438086664200.12.239_1 exited, exit code 0, task state 1 12-Jul-2015 16:11:19 [SETI@home] [task] task_state=EXITED for 01jn15aa.17439.885.438086664200.12.239_1 from handle_exited_app 12-Jul-2015 16:11:19 [---] [slot] cleaning out slots/0: handle_exited_app() 12-Jul-2015 16:11:19 [---] [slot] removed file slots/0/boinc_finish_called 12-Jul-2015 16:11:19 [---] [slot] removed file slots/0/cudart32_50_35.dll 12-Jul-2015 16:11:19 [---] [slot] removed file slots/0/cufft32_50_35.dll 12-Jul-2015 16:11:19 [---] [slot] removed file slots/0/init_data.xml 12-Jul-2015 16:11:19 [---] [slot] removed file slots/0/Lunatics_x41zc_win32_cuda50.exe 12-Jul-2015 16:11:19 [---] [slot] removed file slots/0/mbcuda.cfg 12-Jul-2015 16:11:19 [---] [slot] removed file slots/0/result.sah 12-Jul-2015 16:11:19 [---] [slot] removed file slots/0/state.sah 12-Jul-2015 16:11:19 [---] [slot] failed to remove file slots/0/stderr.txt: unlink() failed 12-Jul-2015 16:11:19 [---] [slot] removed file slots/0/work_unit.sah 12-Jul-2015 16:11:19 [SETI@home] Computation for task 01jn15aa.17439.885.438086664200.12.239_1 finished 12-Jul-2015 16:11:19 [SETI@home] [task] result state=FILES_UPLOADING for 01jn15aa.17439.885.438086664200.12.239_1 from CS::app_finished 12-Jul-2015 16:11:19 [---] [slot] cleaning out slots/0: get_free_slot() 12-Jul-2015 16:11:19 [---] [slot] failed to remove file slots/0/stderr.txt: unlink() failed 12-Jul-2015 16:11:19 [SETI@home] [slot] failed to clean out dir: unlink() failed 12-Jul-2015 16:11:19 [---] [slot] cleaning out slots/15: get_free_slot() 12-Jul-2015 16:11:19 [SETI@home] [slot] assigning slot 15 to 01jn15ab.15470.638.438086664199.12.145_0 12-Jul-2015 16:11:19 [---] [slot] removed file slots/15/init_data.xml 12-Jul-2015 16:11:19 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/Lunatics_x41zc_win32_cuda50.exe to slots/15/Lunatics_x41zc_win32_cuda50.exe 12-Jul-2015 16:11:19 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/cudart32_50_35.dll to slots/15/cudart32_50_35.dll 12-Jul-2015 16:11:19 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/cufft32_50_35.dll to slots/15/cufft32_50_35.dll 12-Jul-2015 16:11:19 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/mbcuda.cfg to slots/15/mbcuda.cfg 12-Jul-2015 16:11:19 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/01jn15ab.15470.638.438086664199.12.145 to slots/15/work_unit.sah 12-Jul-2015 16:11:19 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/01jn15ab.15470.638.438086664199.12.145_0_0 to slots/15/result.sah 12-Jul-2015 16:11:19 [---] [slot] removed file slots/15/boinc_temporary_exit 12-Jul-2015 16:11:19 [SETI@home] [task] task_state=EXECUTING for 01jn15ab.15470.638.438086664199.12.145_0 from start 12-Jul-2015 16:11:19 [SETI@home] Starting task 01jn15ab.15470.638.438086664199.12.145_0 12-Jul-2015 16:11:19 [SETI@home] [cpu_sched] Starting task 01jn15ab.15470.638.438086664199.12.145_0 using setiathome_v7 version 700 (cuda50) in slot 15 12-Jul-2015 16:11:21 [SETI@home] Started upload of 01jn15aa.17439.885.438086664200.12.239_1_0 12-Jul-2015 16:11:24 [SETI@home] Finished upload of 01jn15aa.17439.885.438086664200.12.239_1_0 12-Jul-2015 16:11:24 [---] [slot] removed file projects/setiathome.berkeley.edu/01jn15aa.17439.885.438086664200.12.239_1_0 12-Jul-2015 16:11:24 [---] [slot] removed file projects/setiathome.berkeley.edu/01jn15aa.17439.885.438086664200.12.239_1_0.gz 12-Jul-2015 16:11:24 [---] [slot] removed file projects/setiathome.berkeley.edu/01jn15aa.17439.885.438086664200.12.239_1_0.gzt 12-Jul-2015 16:11:24 [SETI@home] [task] result state=FILES_UPLOADED for 01jn15aa.17439.885.438086664200.12.239_1 from CS::update_results ID: 1700952 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1701079 - Posted: 13 Jul 2015, 15:26:42 UTC Another slight variation on the AP remove/unlink failures this morning, this time on my T7400 (so these aren't limited to one machine or OS). Once again, no Stderr truncation even though handle_exited_app() couldn't remove the ap_state.dat0, ap_state.dat1, and stderr.txt. However, this time the get_free_slot(), with the same timestamp but obviously a fraction of a second later, succeeded in removing both the ap_state.dat0 and stderr.txt, but not the ap_state.dat1. In this scenario, there was no orphaned stderr.txt to capture. So, what sort of timing differences allow the AP app to have these occasional remove/unlink failures on the stderr.txt file, yet not exhibit any Stderr truncation, while the MB apps experience the truncations but may or may not actually have associated remove/unlink failures? Does that point more toward application behavior rather than BOINC behavior? BTW, absolutely no Stderr truncations from any of my machines overnight, an unusual occurrence. ID: 1701079 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1701089 - Posted: 13 Jul 2015, 16:12:17 UTC - in response to Message 1701079. Another slight variation on the AP remove/unlink failures this morning, this time on my T7400 (so these aren't limited to one machine or OS). Once again, no Stderr truncation even though handle_exited_app() couldn't remove the ap_state.dat0, ap_state.dat1, and stderr.txt. However, this time the get_free_slot(), with the same timestamp but obviously a fraction of a second later, succeeded in removing both the ap_state.dat0 and stderr.txt, but not the ap_state.dat1. In this scenario, there was no orphaned stderr.txt to capture. So, what sort of timing differences allow the AP app to have these occasional remove/unlink failures on the stderr.txt file, yet not exhibit any Stderr truncation, while the MB apps experience the truncations but may or may not actually have associated remove/unlink failures? Does that point more toward application behavior rather than BOINC behavior? BTW, absolutely no Stderr truncations from any of my machines overnight, an unusual occurrence. I still think that you need to consider the possibility that the 'truncation' is actually an 'incomplete read' - i.e. the whole thing gets written (eventually), but the BOINC client is too impatient and reads it before it's fully ready. So it misses the last part, which hasn't arrived yet. ID: 1701089 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1701094 - Posted: 13 Jul 2015, 16:30:18 UTC - in response to Message 1701089. Last modified: 13 Jul 2015, 16:30:50 UTC Another slight variation on the AP remove/unlink failures this morning, this time on my T7400 (so these aren't limited to one machine or OS). Once again, no Stderr truncation even though handle_exited_app() couldn't remove the ap_state.dat0, ap_state.dat1, and stderr.txt. However, this time the get_free_slot(), with the same timestamp but obviously a fraction of a second later, succeeded in removing both the ap_state.dat0 and stderr.txt, but not the ap_state.dat1. In this scenario, there was no orphaned stderr.txt to capture. So, what sort of timing differences allow the AP app to have these occasional remove/unlink failures on the stderr.txt file, yet not exhibit any Stderr truncation, while the MB apps experience the truncations but may or may not actually have associated remove/unlink failures? Does that point more toward application behavior rather than BOINC behavior? BTW, absolutely no Stderr truncations from any of my machines overnight, an unusual occurrence. I still think that you need to consider the possibility that the 'truncation' is actually an 'incomplete read' - i.e. the whole thing gets written (eventually), but the BOINC client is too impatient and reads it before it's fully ready. So it misses the last part, which hasn't arrived yet. Sure, but is the BOINC client actually impatient, or is the science app telling BOINC that it's all done before it really is? Which side needs to verify that all the writing is complete? It seems like perhaps what should be a firm handshake between the science app and BOINC is more of a casual fist bump. ;^) ID: 1701094 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1701097 - Posted: 13 Jul 2015, 16:40:25 UTC - in response to Message 1701094. The answer probably can;t be that definite and depends on which end you're standing on. Relative to where you look from it will look like the other end is broken, when the reality is probably close to neither is behaving ideally. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1701097 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1701098 - Posted: 13 Jul 2015, 16:42:54 UTC - in response to Message 1701094. Another slight variation on the AP remove/unlink failures this morning, this time on my T7400 (so these aren't limited to one machine or OS). Once again, no Stderr truncation even though handle_exited_app() couldn't remove the ap_state.dat0, ap_state.dat1, and stderr.txt. However, this time the get_free_slot(), with the same timestamp but obviously a fraction of a second later, succeeded in removing both the ap_state.dat0 and stderr.txt, but not the ap_state.dat1. In this scenario, there was no orphaned stderr.txt to capture. So, what sort of timing differences allow the AP app to have these occasional remove/unlink failures on the stderr.txt file, yet not exhibit any Stderr truncation, while the MB apps experience the truncations but may or may not actually have associated remove/unlink failures? Does that point more toward application behavior rather than BOINC behavior? BTW, absolutely no Stderr truncations from any of my machines overnight, an unusual occurrence. I still think that you need to consider the possibility that the 'truncation' is actually an 'incomplete read' - i.e. the whole thing gets written (eventually), but the BOINC client is too impatient and reads it before it's fully ready. So it misses the last part, which hasn't arrived yet. Sure, but is the BOINC client actually impatient, or is the science app telling BOINC that it's all done before it really is? Which side needs to verify that all the writing is complete? It seems like perhaps what should be a firm handshake between the science app and BOINC is more of a casual fist bump. ;^) Good analogy. If even one of them had a good sense of timing, it would be better. But at the moment, both of them need to be told to chill, slow down a bit. Wait for the gun, or it's a false start. ID: 1701098 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.