Stderr Truncations

Message boards : Number crunching : Stderr Truncations
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 11 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1700626 - Posted: 11 Jul 2015, 23:07:25 UTC - in response to Message 1700621.  

Makes sense. Heck if all I have to do is set the flag to eliminate the invalids and suffer a bigger logfile ... I'll do it. At least until the devs can figure out the proper solution in the BOINC client code.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1700626 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1700629 - Posted: 11 Jul 2015, 23:58:19 UTC

Of course I jinxed it. Got an invalid Task 1184614368 after I had set the task_debug flag.

6817 Milkyway@Home 7/11/2015 4:15:21 PM [cpu_sched] Starting task de_sum_fast_15_3s_136_sim1Jun1_4_1434554402_8848733_1 using milkyway_separation__modified_fit version 136 (opencl_nvidia_101) in slot 0
6818 Milkyway@Home 7/11/2015 4:15:23 PM Sending scheduler request: To fetch work.
6819 Milkyway@Home 7/11/2015 4:15:23 PM Reporting 1 completed tasks
6820 Milkyway@Home 7/11/2015 4:15:23 PM Requesting new tasks for NVIDIA GPU
6822 Milkyway@Home 7/11/2015 4:15:24 PM Scheduler request completed: got 1 new tasks
6823 Milkyway@Home 7/11/2015 4:15:24 PM [task] result state=NEW for de_80_DR8_Rev_8_5_00004_1434551187_12091972_0 from handle_scheduler_reply
6824 7/11/2015 4:15:24 PM [slot] removed file projects/milkyway.cs.rpi.edu_milkyway/de_80_DR8_Rev_8_5_00004_1434551187_12091972_0_0
6825 7/11/2015 4:15:24 PM [slot] removed file projects/milkyway.cs.rpi.edu_milkyway/de_80_DR8_Rev_8_5_00004_1434551187_12091972_0_0.gz
6826 7/11/2015 4:15:24 PM [slot] removed file projects/milkyway.cs.rpi.edu_milkyway/de_80_DR8_Rev_8_5_00004_1434551187_12091972_0_0.gzt
6827 Milkyway@Home 7/11/2015 4:15:25 PM [task] result state=FILES_DOWNLOADING for de_80_DR8_Rev_8_5_00004_1434551187_12091972_0 from CS::update_results
6828 Milkyway@Home 7/11/2015 4:15:26 PM [task] result state=FILES_DOWNLOADED for de_80_DR8_Rev_8_5_00004_1434551187_12091972_0 from CS::update_results
6838 Milkyway@Home 7/11/2015 4:16:11 PM [task] Process for de_sum_fast_15_3s_136_sim1Jun1_4_1434554402_8848733_1 exited, exit code 0, task state 1
6839 Milkyway@Home 7/11/2015 4:16:11 PM [task] task_state=EXITED for de_sum_fast_15_3s_136_sim1Jun1_4_1434554402_8848733_1 from handle_exited_app
6840 7/11/2015 4:16:11 PM [slot] cleaning out slots/0: handle_exited_app()
6841 7/11/2015 4:16:11 PM [slot] removed file slots/0/astronomy_parameters.txt
6842 7/11/2015 4:16:11 PM [slot] removed file slots/0/boinc_finish_called
6843 7/11/2015 4:16:11 PM [slot] removed file slots/0/init_data.xml
6844 7/11/2015 4:16:11 PM [slot] removed file slots/0/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe
6845 7/11/2015 4:16:11 PM [slot] removed file slots/0/separation_checkpoint
6846 7/11/2015 4:16:11 PM [slot] removed file slots/0/stars.txt
6847 7/11/2015 4:16:11 PM [slot] failed to remove file slots/0/stderr.txt: unlink() failed
6848 Milkyway@Home 7/11/2015 4:16:11 PM Computation for task de_sum_fast_15_3s_136_sim1Jun1_4_1434554402_8848733_1 finished
6849 Milkyway@Home 7/11/2015 4:16:11 PM [task] result state=FILES_UPLOADING for de_sum_fast_15_3s_136_sim1Jun1_4_1434554402_8848733_1 from CS::app_finished
6850 Milkyway@Home 7/11/2015 4:16:11 PM [task] result state=FILES_UPLOADED for de_sum_fast_15_3s_136_sim1Jun1_4_1434554402_8848733_1 from CS::update_results
6851 7/11/2015 4:16:11 PM [slot] cleaning out slots/0: get_free_slot()
6852 7/11/2015 4:16:11 PM [slot] removed file slots/0/stderr.txt
6853 Milkyway@Home 7/11/2015 4:16:11 PM [slot] assigning slot 0 to de_80_DR8_Rev_8_5_00004_1434551187_11984805_2
6854 7/11/2015 4:16:11 PM [slot] removed file slots/0/init_data.xml
6855 Milkyway@Home 7/11/2015 4:16:11 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_separation_1.02_windows_x86_64__opencl_nvidia.exe to

slots/0/milkyway_separation_1.02_windows_x86_64__opencl_nvidia.exe
6856 Milkyway@Home 7/11/2015 4:16:11 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/80_rev_8_5.prmtrs to slots/0/astronomy_parameters.txt
6857 Milkyway@Home 7/11/2015 4:16:11 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/80_Rev_8_3.stars to slots/0/stars.txt
6858 7/11/2015 4:16:11 PM [slot] removed file slots/0/boinc_temporary_exit
6859 Milkyway@Home 7/11/2015 4:16:11 PM [task] task_state=EXECUTING for de_80_DR8_Rev_8_5_00004_1434551187_11984805_2 from start
6860 Milkyway@Home 7/11/2015 4:16:11 PM Starting task de_80_DR8_Rev_8_5_00004_1434551187_11984805_2
6861 Milkyway@Home 7/11/2015 4:16:11 PM [cpu_sched] Starting task de_80_DR8_Rev_8_5_00004_1434551187_11984805_2 using milkyway version 102 (opencl_nvidia) in slot 0
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1700629 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1700634 - Posted: 12 Jul 2015, 0:44:15 UTC

And another one Task 1184683992.


9330 7/11/2015 5:21:14 PM [slot] cleaning out slots/1: get_free_slot()
9331 Milkyway@Home 7/11/2015 5:21:14 PM [slot] assigning slot 1 to de_fast_15_3s_136_sim1Jun1_1_1434554402_8887629_0
9332 7/11/2015 5:21:14 PM [slot] removed file slots/1/init_data.xml
9333 Milkyway@Home 7/11/2015 5:21:14 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe to

slots/1/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe
9334 Milkyway@Home 7/11/2015 5:21:14 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/parameters-15-3s-sim-fast.txt to slots/1/astronomy_parameters.txt
9335 Milkyway@Home 7/11/2015 5:21:14 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/stars-15-sim-1Jun1.txt to slots/1/stars.txt
9336 7/11/2015 5:21:14 PM [slot] removed file slots/1/boinc_temporary_exit
9337 Milkyway@Home 7/11/2015 5:21:14 PM [task] task_state=EXECUTING for de_fast_15_3s_136_sim1Jun1_1_1434554402_8887629_0 from start
9338 Milkyway@Home 7/11/2015 5:21:14 PM Starting task de_fast_15_3s_136_sim1Jun1_1_1434554402_8887629_0
9339 Milkyway@Home 7/11/2015 5:21:14 PM [cpu_sched] Starting task de_fast_15_3s_136_sim1Jun1_1_1434554402_8887629_0 using milkyway_separation__modified_fit version 136 (opencl_nvidia_101) in slot 1
9443 Milkyway@Home 7/11/2015 5:22:05 PM [task] Process for de_fast_15_3s_136_sim1Jun1_1_1434554402_8887629_0 exited, exit code 0, task state 1
9444 Milkyway@Home 7/11/2015 5:22:05 PM [task] task_state=EXITED for de_fast_15_3s_136_sim1Jun1_1_1434554402_8887629_0 from handle_exited_app
9445 7/11/2015 5:22:05 PM [slot] cleaning out slots/1: handle_exited_app()
9446 7/11/2015 5:22:05 PM [slot] removed file slots/1/astronomy_parameters.txt
9447 7/11/2015 5:22:05 PM [slot] removed file slots/1/boinc_finish_called
9448 7/11/2015 5:22:05 PM [slot] removed file slots/1/init_data.xml
9449 7/11/2015 5:22:05 PM [slot] removed file slots/1/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe
9450 7/11/2015 5:22:05 PM [slot] removed file slots/1/separation_checkpoint
9451 7/11/2015 5:22:05 PM [slot] removed file slots/1/stars.txt
9452 7/11/2015 5:22:05 PM [slot] failed to remove file slots/1/stderr.txt: unlink() failed
9453 Milkyway@Home 7/11/2015 5:22:05 PM Computation for task de_fast_15_3s_136_sim1Jun1_1_1434554402_8887629_0 finished
9454 Milkyway@Home 7/11/2015 5:22:05 PM [task] result state=FILES_UPLOADING for de_fast_15_3s_136_sim1Jun1_1_1434554402_8887629_0 from CS::app_finished
9455 Milkyway@Home 7/11/2015 5:22:05 PM [task] result state=FILES_UPLOADED for de_fast_15_3s_136_sim1Jun1_1_1434554402_8887629_0 from CS::update_results
9456 7/11/2015 5:22:05 PM [slot] cleaning out slots/1: get_free_slot()
9457 7/11/2015 5:22:05 PM [slot] failed to remove file slots/1/stderr.txt: unlink() failed
9458 Milkyway@Home 7/11/2015 5:22:05 PM [slot] failed to clean out dir: unlink() failed
9501 7/11/2015 5:22:57 PM [slot] cleaning out slots/1: get_free_slot()
9502 7/11/2015 5:22:57 PM [slot] removed file slots/1/stderr.txt
9504 7/11/2015 5:22:57 PM [slot] removed file slots/1/init_data.xml
9511 7/11/2015 5:22:57 PM [slot] removed file slots/1/boinc_temporary_exit
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1700634 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1700677 - Posted: 12 Jul 2015, 6:59:31 UTC - in response to Message 1700634.  

hmmm, well there's your problem, the Cat's dead.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1700677 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1700686 - Posted: 12 Jul 2015, 9:03:59 UTC

I just caught an orphaned stderr.txt from a double unlink error (Error 32, in my case with v7.6.3).

Complete down to

...
Initial wait: 12 ms
Integration time: 92.403269 s. Average time per iteration = 288.760217 ms
Integral 0 time = 92.714350 s
Running likelihood with 108458 stars
Likelihood time = 2.533832 s
<background_integral> 0.000132372752421 </background_integral>
<stream_integral> 44.152223561035349 202.686531693434380 29.999699899263753 </stream_integral>
<background_likelihood> -3.490802006209325 </background_likelihood>
<stream_only_likelihood> -6.337757282544891 -4.054056095546774 -4.201967829647314 </stream_only_likelihood>
<search_likelihood> -3.041275469646288 </search_likelihood>
09:54:36 (3868): called boinc_finish
ID: 1700686 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1700692 - Posted: 12 Jul 2015, 9:59:07 UTC - in response to Message 1700686.  
Last modified: 12 Jul 2015, 10:02:20 UTC

sweet, yeah, places the events after entering boinc_finish(), which fits with the 3 recommended options to trial, 1 with source code provided.

As we know, the commode option helps, but I consider only a partial workaround, rather than a thorough treatment of the logic to consider multithreading appropriately and necessarily.

I would like to see the third (uncoded) option, which I labelled 'Best'as considered, with the caveat that it requires a much better understanding of multithreading, modern Operating system fundamental principles, and synchronisation than I see throughout the Boinc codebase. Kindof a shame these people don't seem to have contact with their own computer science department, whose work I refer to frequently.

I will be using the 'Best'' approach in any custom libraries as the most robust and flexible option long term, though the midway option I coded should be sufficient for many situations, labelled as 'Better' (but not Best). The deeper option just adds flexibility and practices that can make application development much more civilised down the road. These principles apply to any application running on a multitasking/mutithreaded platform, so the arguments that Windows sucks are invalid (it might in many ways, but not relevant to this discussion).

Options are just that. No reason the codebase cant have all and the existing options, and a simple switch to choose the appropriate one for the applícation. So it becaomes a little baffling when you see code like the original unmaintained, when it's the cause of so much more or less random grief, just through a refusal to acknowledge something doesn;t work the way you thought it did.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1700692 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1700693 - Posted: 12 Jul 2015, 10:01:29 UTC - in response to Message 1700692.  

Kindof a shame these people don;t seem to have contact with their own computer science department, whose work I refer to frequently.

That could be a very useful suggestion to make to the new BOINC Project Management Committee.
ID: 1700693 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1700694 - Posted: 12 Jul 2015, 10:03:02 UTC - in response to Message 1700693.  

Kindof a shame these people don;t seem to have contact with their own computer science department, whose work I refer to frequently.

That could be a very useful suggestion to make to the new BOINC Project Management Committee.


Lol, "go read, and keep and open mind"
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1700694 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1700698 - Posted: 12 Jul 2015, 10:36:16 UTC

OK, back on topic. On the very weak evidence of an admittedly tiny sample of one, it would appear that the problem isn't one of "incomplete (truncated) write", but instead "incomplete (premature) read".

The API fix will cure that version too, of course, by not telling BOINC there's anything to read until it knows the write has been completed and signed off - so the client won't jump the gun.

So now we have a different problem of asynchronous timing. Which will reach 100% penetration in the BOINC user base first? Will a fixed API reach all the science applications maintained by 'project scientist' users, and will a fixed client reach all 'volunteer cruncher' users?

The answer, of course, is neither. There is a reluctance to embrace the new in both camps.

The BOINC philosophy - back in the days when there was one - is to strive to maintain both backwards and forwards compatibility between project servers and user's clients, as far as is humanly possible. In this context, I'd call the API a server component, because it's managed and deployed at the project level. Ordinary users (developers are different) have no control over the API they're given.

So, I'd suggest that we also need a client-side fix: "Don't read contents until you have exclusive rights". A file lock policy of "write exclusive, read shared" works in most cases, but not here. Placing the fix in the client as well gives back an element of control to that small subset of volunteer users who post in threads like this one, because they notice and are upset by their wasted work. Just while the low, slow, migration of the fixed API makes its way to the outer reaches of developer-land.
ID: 1700698 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1700720 - Posted: 12 Jul 2015, 12:10:32 UTC - in response to Message 1700698.  
Last modified: 12 Jul 2015, 12:58:56 UTC

Yeah, I'll have to dip out of the client side issues at this point, as preparing to start a fairly significant application revision first sprint (switching to a more team oriented development process, a big thing).

I consider a reasonable answer that all the components, i.e. applications, client, and servers, work together as a single unified system, so require a cohesive effort. The divide and conquer approach there requires the most attention to the interfaces between these subsystems. That includes the state protocols and messages, on top of the API behaviour. [In the case of the servers, it's a virtual representation of your host that goes wacky due to some incomplete models on how things work, and some lax engineering]. Then whatever is internal to a given subsystem is able to be dealt with individually.

Maintaining backward and forward compatibility where feasible isn't that difficult (version numbers are present), but there are also ways to encourage update to reduce the maintenance/helpdesk costs as well. Actually moving to simplify / streamline and improve quality is a start. You can still make all the non-critical stuff' as plugins, rather than break the core client every few weeks.

There are indeed a lot of issues I know about with respect to similar engineering issues with the client, which are also process and wilful ignorance based. But I seriously doubt that the 7 years spent trying to communicate this trivial API patch is really the way to go [for these].

I'll [continue to] demonstrate by doing instead.

[Edit:] don't count your chickens on one or more of the three suggested api fixes being applied either, or even some variation. They've been shown them before... The usual pattern is to discount it before trying anything. I must have missed that particular "Software Engineering by Wishful Thinking" lecture I guess... Try any and all available options, so learn what works best.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1700720 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1700734 - Posted: 12 Jul 2015, 13:24:27 UTC - in response to Message 1700720.  

There are indeed a lot of issues I know about with respect to similar engineering issues with the client, which are also process and wilful ignorance based. But I seriously doubt that the 7 years spent trying to communicate this trivial API patch is really the way to go [for these].

The trick with getting BOINC bugs fixed is manyfold. First, make absolutely sure that you have understood the whole problem, so the proposed fix doesn't blow up in your face in a case you hadn't considered, or didn't even know about. Then, report it - and if you have the courage of your convictions, keep reporting it until it goes in: don't back off at the first hurdle. And finally, keep monitoring it until and after the fix is deployed.

We've seen all that in "task abandonment", over the last few days. I first narrowed it down to those two calls to mark_results_over() in lines 403 and 426 of handle_request.cpp over two years ago. But I didn't have the courage of my convictions to delve any deeper at that time, and no code-walkers were available to back me up either. Now, we have (collectively) found the answer, proof-read and refined it to the point where we're sure it's safe, and reported it for implementation. My report failed - at least, it wasn't picked up within the timescale we would have liked - but it was accepted at the second attempt. We still await deployment, but I've opened up a second front for that department too, and I have a reasonable expectation that we might be rid of the problem for good next week.
ID: 1700734 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1700742 - Posted: 12 Jul 2015, 13:46:04 UTC - in response to Message 1700734.  
Last modified: 12 Jul 2015, 13:49:51 UTC

There are indeed a lot of issues I know about with respect to similar engineering issues with the client, which are also process and wilful ignorance based. But I seriously doubt that the 7 years spent trying to communicate this trivial API patch is really the way to go [for these].

The trick with getting BOINC bugs fixed is manyfold. First, make absolutely sure that you have understood the whole problem, so the proposed fix doesn't blow up in your face in a case you hadn't considered, or didn't even know about. Then, report it - and if you have the courage of your convictions, keep reporting it until it goes in: don't back off at the first hurdle. And finally, keep monitoring it until and after the fix is deployed.

We've seen all that in "task abandonment", over the last few days. I first narrowed it down to those two calls to mark_results_over() in lines 403 and 426 of handle_request.cpp over two years ago. But I didn't have the courage of my convictions to delve any deeper at that time, and no code-walkers were available to back me up either. Now, we have (collectively) found the answer, proof-read and refined it to the point where we're sure it's safe, and reported it for implementation. My report failed - at least, it wasn't picked up within the timescale we would have liked - but it was accepted at the second attempt. We still await deployment, but I've opened up a second front for that department too, and I have a reasonable expectation that we might be rid of the problem for good next week.


ah hah, right.

That's all well and good if I gave two hoots about Boinc, any more than I see it as robbing precious development energies from the project I love and support.

Boinc can run their project however they like of course, but it doesn't mean I have any incentive to burn resources pretending I actually need to sacrifice my mental health to help the paid members do their jobs, more than the minimum possible to keep my applications working in support of Seti.

If and when they lean out the management bloated juggernaut, and put some actual software engineers on in place, then I would consider that a massive step forward. Until that happens, the inefficiency is a huge waste of time.

Naturally the software engineer part in me has been curious about what's broken and why. It is the encrusted barnacles on the code and the people that need chipping off, back to bare metal.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1700742 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1700788 - Posted: 12 Jul 2015, 15:38:01 UTC

I had hoped that my T7400, which I put to work on S@h last night for the first time since November, would generate some of those completely empty Stderrs like it used to, but none so far.

My xw9400, meanwhile, just produced two run-of-the-mill truncated Stderrs, Task 4260496447 and Task 4260549088, similar to what I reported yesterday morning in Message 1700500. Neither one shows any remove or unlink failures in the log.

I also found another AP task, 4260871348, which did have unlink failures (but a complete Stderr) similar to what I also posted yesterday in Message 1700567. This time, though, there were three files that failed: ap_state.dat0, ap_state.dat1, and stderr.txt.

Since the get_free_slot() failed, Slot 16 was brought into play. I had a little monitor running to try to capture the orphaned stderr.txt file when that happened, but it's just a simple timer routine on a 30-second trigger and, unfortunately, Slot 0 got successfully cleaned out for another task 10 seconds after the AP task ended, so the stderr.txt I retrieved was for a newly launched MB task.
ID: 1700788 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1700884 - Posted: 12 Jul 2015, 21:52:38 UTC

Perhaps a little bit off topic, but still seemingly related to the out-of-sync conditions at task termination, are these AP tasks that are having remove/unlink failures. I just caught another one on my xw9400, Task 4262424987, which was unable to clean out ap_state.dat0, ap_state.dat1, and stderr.txt in handle_exited_app(), the same failures as the one in my last post. In this case, however, the get_free_slot() routine did succeed in removing the ap_state.dat0, but still couldn't remove the other two.

This time I did successfully capture the orphaned stderr.txt, but it matches what appears on the task detail page, a complete Stderr. Neither one had any truncation.
ID: 1700884 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1700952 - Posted: 13 Jul 2015, 3:13:01 UTC

I think Murphy had a hand in this one, Task 4262322967. It's a truncated non-empty Stderr for what appears to be a -9 overflow. At the moment, it's still awaiting the wingman's report, so it's still up in the air as to whether or not it will be marked Invalid.

However, it did have a remove/unlink failure for stderr.txt and I should have been able to capture the orphaned stderr.txt. But...my routine is monitoring Slot 16 and, in this instance, the subsequent task went to Slot 15, since 15 was free due to 16 still being in use following the slot juggle for the AP task in my previous post. Yep, I think that's Murphy's handiwork.

Anyway, since this is the first truncated Stderr that actually had an unlink failure since I turned <task_debug> on, I'll go ahead and include the entire log entry for it. It doesn't look to me as if <task_debug> added anything new to what <slot_debug> was already showing, but perhaps it will be useful to someone else.

12-Jul-2015 16:11:05 [SETI@home] [slot] assigning slot 0 to 01jn15aa.17439.885.438086664200.12.239_1
12-Jul-2015 16:11:05 [---] [slot] removed file slots/0/init_data.xml
12-Jul-2015 16:11:05 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/Lunatics_x41zc_win32_cuda50.exe to slots/0/Lunatics_x41zc_win32_cuda50.exe
12-Jul-2015 16:11:05 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/cudart32_50_35.dll to slots/0/cudart32_50_35.dll
12-Jul-2015 16:11:05 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/cufft32_50_35.dll to slots/0/cufft32_50_35.dll
12-Jul-2015 16:11:05 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/mbcuda.cfg to slots/0/mbcuda.cfg
12-Jul-2015 16:11:05 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/01jn15aa.17439.885.438086664200.12.239 to slots/0/work_unit.sah
12-Jul-2015 16:11:05 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/01jn15aa.17439.885.438086664200.12.239_1_0 to slots/0/result.sah
12-Jul-2015 16:11:05 [---] [slot] removed file slots/0/boinc_temporary_exit
12-Jul-2015 16:11:05 [SETI@home] [task] task_state=EXECUTING for 01jn15aa.17439.885.438086664200.12.239_1 from start
12-Jul-2015 16:11:05 [SETI@home] Starting task 01jn15aa.17439.885.438086664200.12.239_1
12-Jul-2015 16:11:05 [SETI@home] [cpu_sched] Starting task 01jn15aa.17439.885.438086664200.12.239_1 using setiathome_v7 version 700 (cuda50) in slot 0
12-Jul-2015 16:11:07 [SETI@home] Started upload of 01jn15ab.4387.89556.438086664198.12.189_0_0
12-Jul-2015 16:11:09 [SETI@home] Finished upload of 01jn15ab.4387.89556.438086664198.12.189_0_0
12-Jul-2015 16:11:09 [---] [slot] removed file projects/setiathome.berkeley.edu/01jn15ab.4387.89556.438086664198.12.189_0_0
12-Jul-2015 16:11:09 [---] [slot] removed file projects/setiathome.berkeley.edu/01jn15ab.4387.89556.438086664198.12.189_0_0.gz
12-Jul-2015 16:11:09 [---] [slot] removed file projects/setiathome.berkeley.edu/01jn15ab.4387.89556.438086664198.12.189_0_0.gzt
12-Jul-2015 16:11:09 [SETI@home] [task] result state=FILES_UPLOADED for 01jn15ab.4387.89556.438086664198.12.189_0 from CS::update_results
12-Jul-2015 16:11:19 [SETI@home] [task] Process for 01jn15aa.17439.885.438086664200.12.239_1 exited, exit code 0, task state 1
12-Jul-2015 16:11:19 [SETI@home] [task] task_state=EXITED for 01jn15aa.17439.885.438086664200.12.239_1 from handle_exited_app
12-Jul-2015 16:11:19 [---] [slot] cleaning out slots/0: handle_exited_app()
12-Jul-2015 16:11:19 [---] [slot] removed file slots/0/boinc_finish_called
12-Jul-2015 16:11:19 [---] [slot] removed file slots/0/cudart32_50_35.dll
12-Jul-2015 16:11:19 [---] [slot] removed file slots/0/cufft32_50_35.dll
12-Jul-2015 16:11:19 [---] [slot] removed file slots/0/init_data.xml
12-Jul-2015 16:11:19 [---] [slot] removed file slots/0/Lunatics_x41zc_win32_cuda50.exe
12-Jul-2015 16:11:19 [---] [slot] removed file slots/0/mbcuda.cfg
12-Jul-2015 16:11:19 [---] [slot] removed file slots/0/result.sah
12-Jul-2015 16:11:19 [---] [slot] removed file slots/0/state.sah
12-Jul-2015 16:11:19 [---] [slot] failed to remove file slots/0/stderr.txt: unlink() failed
12-Jul-2015 16:11:19 [---] [slot] removed file slots/0/work_unit.sah
12-Jul-2015 16:11:19 [SETI@home] Computation for task 01jn15aa.17439.885.438086664200.12.239_1 finished
12-Jul-2015 16:11:19 [SETI@home] [task] result state=FILES_UPLOADING for 01jn15aa.17439.885.438086664200.12.239_1 from CS::app_finished
12-Jul-2015 16:11:19 [---] [slot] cleaning out slots/0: get_free_slot()
12-Jul-2015 16:11:19 [---] [slot] failed to remove file slots/0/stderr.txt: unlink() failed
12-Jul-2015 16:11:19 [SETI@home] [slot] failed to clean out dir: unlink() failed

12-Jul-2015 16:11:19 [---] [slot] cleaning out slots/15: get_free_slot()
12-Jul-2015 16:11:19 [SETI@home] [slot] assigning slot 15 to 01jn15ab.15470.638.438086664199.12.145_0
12-Jul-2015 16:11:19 [---] [slot] removed file slots/15/init_data.xml
12-Jul-2015 16:11:19 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/Lunatics_x41zc_win32_cuda50.exe to slots/15/Lunatics_x41zc_win32_cuda50.exe
12-Jul-2015 16:11:19 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/cudart32_50_35.dll to slots/15/cudart32_50_35.dll
12-Jul-2015 16:11:19 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/cufft32_50_35.dll to slots/15/cufft32_50_35.dll
12-Jul-2015 16:11:19 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/mbcuda.cfg to slots/15/mbcuda.cfg
12-Jul-2015 16:11:19 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/01jn15ab.15470.638.438086664199.12.145 to slots/15/work_unit.sah
12-Jul-2015 16:11:19 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/01jn15ab.15470.638.438086664199.12.145_0_0 to slots/15/result.sah
12-Jul-2015 16:11:19 [---] [slot] removed file slots/15/boinc_temporary_exit
12-Jul-2015 16:11:19 [SETI@home] [task] task_state=EXECUTING for 01jn15ab.15470.638.438086664199.12.145_0 from start
12-Jul-2015 16:11:19 [SETI@home] Starting task 01jn15ab.15470.638.438086664199.12.145_0
12-Jul-2015 16:11:19 [SETI@home] [cpu_sched] Starting task 01jn15ab.15470.638.438086664199.12.145_0 using setiathome_v7 version 700 (cuda50) in slot 15
12-Jul-2015 16:11:21 [SETI@home] Started upload of 01jn15aa.17439.885.438086664200.12.239_1_0
12-Jul-2015 16:11:24 [SETI@home] Finished upload of 01jn15aa.17439.885.438086664200.12.239_1_0
12-Jul-2015 16:11:24 [---] [slot] removed file projects/setiathome.berkeley.edu/01jn15aa.17439.885.438086664200.12.239_1_0
12-Jul-2015 16:11:24 [---] [slot] removed file projects/setiathome.berkeley.edu/01jn15aa.17439.885.438086664200.12.239_1_0.gz
12-Jul-2015 16:11:24 [---] [slot] removed file projects/setiathome.berkeley.edu/01jn15aa.17439.885.438086664200.12.239_1_0.gzt
12-Jul-2015 16:11:24 [SETI@home] [task] result state=FILES_UPLOADED for 01jn15aa.17439.885.438086664200.12.239_1 from CS::update_results
ID: 1700952 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1701079 - Posted: 13 Jul 2015, 15:26:42 UTC

Another slight variation on the AP remove/unlink failures this morning, this time on my T7400 (so these aren't limited to one machine or OS). Once again, no Stderr truncation even though handle_exited_app() couldn't remove the ap_state.dat0, ap_state.dat1, and stderr.txt. However, this time the get_free_slot(), with the same timestamp but obviously a fraction of a second later, succeeded in removing both the ap_state.dat0 and stderr.txt, but not the ap_state.dat1. In this scenario, there was no orphaned stderr.txt to capture.

So, what sort of timing differences allow the AP app to have these occasional remove/unlink failures on the stderr.txt file, yet not exhibit any Stderr truncation, while the MB apps experience the truncations but may or may not actually have associated remove/unlink failures? Does that point more toward application behavior rather than BOINC behavior?

BTW, absolutely no Stderr truncations from any of my machines overnight, an unusual occurrence.
ID: 1701079 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1701089 - Posted: 13 Jul 2015, 16:12:17 UTC - in response to Message 1701079.  

Another slight variation on the AP remove/unlink failures this morning, this time on my T7400 (so these aren't limited to one machine or OS). Once again, no Stderr truncation even though handle_exited_app() couldn't remove the ap_state.dat0, ap_state.dat1, and stderr.txt. However, this time the get_free_slot(), with the same timestamp but obviously a fraction of a second later, succeeded in removing both the ap_state.dat0 and stderr.txt, but not the ap_state.dat1. In this scenario, there was no orphaned stderr.txt to capture.

So, what sort of timing differences allow the AP app to have these occasional remove/unlink failures on the stderr.txt file, yet not exhibit any Stderr truncation, while the MB apps experience the truncations but may or may not actually have associated remove/unlink failures? Does that point more toward application behavior rather than BOINC behavior?

BTW, absolutely no Stderr truncations from any of my machines overnight, an unusual occurrence.

I still think that you need to consider the possibility that the 'truncation' is actually an 'incomplete read' - i.e. the whole thing gets written (eventually), but the BOINC client is too impatient and reads it before it's fully ready. So it misses the last part, which hasn't arrived yet.
ID: 1701089 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1701094 - Posted: 13 Jul 2015, 16:30:18 UTC - in response to Message 1701089.  
Last modified: 13 Jul 2015, 16:30:50 UTC

Another slight variation on the AP remove/unlink failures this morning, this time on my T7400 (so these aren't limited to one machine or OS). Once again, no Stderr truncation even though handle_exited_app() couldn't remove the ap_state.dat0, ap_state.dat1, and stderr.txt. However, this time the get_free_slot(), with the same timestamp but obviously a fraction of a second later, succeeded in removing both the ap_state.dat0 and stderr.txt, but not the ap_state.dat1. In this scenario, there was no orphaned stderr.txt to capture.

So, what sort of timing differences allow the AP app to have these occasional remove/unlink failures on the stderr.txt file, yet not exhibit any Stderr truncation, while the MB apps experience the truncations but may or may not actually have associated remove/unlink failures? Does that point more toward application behavior rather than BOINC behavior?

BTW, absolutely no Stderr truncations from any of my machines overnight, an unusual occurrence.

I still think that you need to consider the possibility that the 'truncation' is actually an 'incomplete read' - i.e. the whole thing gets written (eventually), but the BOINC client is too impatient and reads it before it's fully ready. So it misses the last part, which hasn't arrived yet.

Sure, but is the BOINC client actually impatient, or is the science app telling BOINC that it's all done before it really is? Which side needs to verify that all the writing is complete? It seems like perhaps what should be a firm handshake between the science app and BOINC is more of a casual fist bump. ;^)
ID: 1701094 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1701097 - Posted: 13 Jul 2015, 16:40:25 UTC - in response to Message 1701094.  

The answer probably can;t be that definite and depends on which end you're standing on. Relative to where you look from it will look like the other end is broken, when the reality is probably close to neither is behaving ideally.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1701097 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14679
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1701098 - Posted: 13 Jul 2015, 16:42:54 UTC - in response to Message 1701094.  

Another slight variation on the AP remove/unlink failures this morning, this time on my T7400 (so these aren't limited to one machine or OS). Once again, no Stderr truncation even though handle_exited_app() couldn't remove the ap_state.dat0, ap_state.dat1, and stderr.txt. However, this time the get_free_slot(), with the same timestamp but obviously a fraction of a second later, succeeded in removing both the ap_state.dat0 and stderr.txt, but not the ap_state.dat1. In this scenario, there was no orphaned stderr.txt to capture.

So, what sort of timing differences allow the AP app to have these occasional remove/unlink failures on the stderr.txt file, yet not exhibit any Stderr truncation, while the MB apps experience the truncations but may or may not actually have associated remove/unlink failures? Does that point more toward application behavior rather than BOINC behavior?

BTW, absolutely no Stderr truncations from any of my machines overnight, an unusual occurrence.

I still think that you need to consider the possibility that the 'truncation' is actually an 'incomplete read' - i.e. the whole thing gets written (eventually), but the BOINC client is too impatient and reads it before it's fully ready. So it misses the last part, which hasn't arrived yet.

Sure, but is the BOINC client actually impatient, or is the science app telling BOINC that it's all done before it really is? Which side needs to verify that all the writing is complete? It seems like perhaps what should be a firm handshake between the science app and BOINC is more of a casual fist bump. ;^)

Good analogy. If even one of them had a good sense of timing, it would be better. But at the moment, both of them need to be told to chill, slow down a bit. Wait for the gun, or it's a false start.
ID: 1701098 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 11 · Next

Message boards : Number crunching : Stderr Truncations


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.