Message boards :
Number crunching :
Stderr Truncations
Message board moderation
Author | Message |
---|---|
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Rather than continuing to pollute/hijack the Panic Mode thread, I thought it better to continue this discussion in a new thread. My xw9400 burped up 5 truncated Stderr results last night, with <slot_debug> turned on. The log results show that either something else is going on here that's different from the completely empty Stderrs, or the BOINC 7.2.33 <slot_debug> isn't reporting on "remove file" failures. (Is that the new enhancement in 7.6.2?) Here's a summary of the five tasks. Note that the first one was a very quick -9 overflow with 30 Spikes (according to the wingman's Stderr). The other four seem to be normal full runs. You may also note that all of these were GPU tasks, and 3 of the 4 GPUs on that box were involved. Task 4257333848 09-Jul-2015 20:10:26 [SETI@home] [slot] assigning slot 0 to 09fe15aa.15517.8419.438086664202.12.160_0 09-Jul-2015 20:10:26 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/Lunatics_x41zc_win32_cuda50.exe to slots/0/Lunatics_x41zc_win32_cuda50.exe 09-Jul-2015 20:10:26 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/cudart32_50_35.dll to slots/0/cudart32_50_35.dll 09-Jul-2015 20:10:26 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/cufft32_50_35.dll to slots/0/cufft32_50_35.dll 09-Jul-2015 20:10:26 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/mbcuda.cfg to slots/0/mbcuda.cfg 09-Jul-2015 20:10:26 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/09fe15aa.15517.8419.438086664202.12.160 to slots/0/work_unit.sah 09-Jul-2015 20:10:26 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/09fe15aa.15517.8419.438086664202.12.160_0_0 to slots/0/result.sah 09-Jul-2015 20:10:26 [SETI@home] Starting task 09fe15aa.15517.8419.438086664202.12.160_0 using setiathome_v7 version 700 (cuda50) in slot 0 09-Jul-2015 20:10:29 [SETI@home] Started upload of 10fe15aa.31398.9474.438086664202.12.249_0_0 09-Jul-2015 20:10:31 [SETI@home] Finished upload of 10fe15aa.31398.9474.438086664202.12.249_0_0 09-Jul-2015 20:10:31 [---] [slot] removed file projects/setiathome.berkeley.edu/10fe15aa.31398.9474.438086664202.12.249_0_0 09-Jul-2015 20:10:33 [---] [slot] cleaning out slots/0: handle_exited_app() 09-Jul-2015 20:10:33 [---] [slot] removed file slots/0/boinc_finish_called 09-Jul-2015 20:10:33 [---] [slot] removed file slots/0/cudart32_50_35.dll 09-Jul-2015 20:10:33 [---] [slot] removed file slots/0/cufft32_50_35.dll 09-Jul-2015 20:10:33 [---] [slot] removed file slots/0/init_data.xml 09-Jul-2015 20:10:33 [---] [slot] removed file slots/0/Lunatics_x41zc_win32_cuda50.exe 09-Jul-2015 20:10:33 [---] [slot] removed file slots/0/mbcuda.cfg 09-Jul-2015 20:10:33 [---] [slot] removed file slots/0/result.sah 09-Jul-2015 20:10:33 [---] [slot] removed file slots/0/state.sah 09-Jul-2015 20:10:33 [---] [slot] removed file slots/0/work_unit.sah 09-Jul-2015 20:10:33 [SETI@home] Computation for task 09fe15aa.15517.8419.438086664202.12.160_0 finished 09-Jul-2015 20:10:33 [---] [slot] cleaning out slots/0: get_free_slot() 09-Jul-2015 20:10:33 [SETI@home] [slot] assigning slot 16 to 09fe15aa.15517.8419.438086664202.12.176_1 ///////////// Task 4257333711 09-Jul-2015 20:55:48 [SETI@home] [slot] assigning slot 6 to 09fe15aa.15517.8419.438086664202.12.104_1 09-Jul-2015 20:55:48 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/Lunatics_x41zc_win32_cuda50.exe to slots/6/Lunatics_x41zc_win32_cuda50.exe 09-Jul-2015 20:55:48 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/cudart32_50_35.dll to slots/6/cudart32_50_35.dll 09-Jul-2015 20:55:48 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/cufft32_50_35.dll to slots/6/cufft32_50_35.dll 09-Jul-2015 20:55:48 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/mbcuda.cfg to slots/6/mbcuda.cfg 09-Jul-2015 20:55:48 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/09fe15aa.15517.8419.438086664202.12.104 to slots/6/work_unit.sah 09-Jul-2015 20:55:48 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/09fe15aa.15517.8419.438086664202.12.104_1_0 to slots/6/result.sah 09-Jul-2015 20:55:48 [SETI@home] Starting task 09fe15aa.15517.8419.438086664202.12.104_1 using setiathome_v7 version 700 (cuda50) in slot 6 ... 09-Jul-2015 21:10:02 [---] [slot] cleaning out slots/6: handle_exited_app() 09-Jul-2015 21:10:02 [---] [slot] removed file slots/6/boinc_finish_called 09-Jul-2015 21:10:02 [---] [slot] removed file slots/6/boinc_task_state.xml 09-Jul-2015 21:10:02 [---] [slot] removed file slots/6/cudart32_50_35.dll 09-Jul-2015 21:10:02 [---] [slot] removed file slots/6/cufft32_50_35.dll 09-Jul-2015 21:10:02 [---] [slot] removed file slots/6/init_data.xml 09-Jul-2015 21:10:02 [---] [slot] removed file slots/6/Lunatics_x41zc_win32_cuda50.exe 09-Jul-2015 21:10:02 [---] [slot] removed file slots/6/mbcuda.cfg 09-Jul-2015 21:10:02 [---] [slot] removed file slots/6/result.sah 09-Jul-2015 21:10:02 [---] [slot] removed file slots/6/state.sah 09-Jul-2015 21:10:02 [---] [slot] removed file slots/6/stderr.txt 09-Jul-2015 21:10:02 [---] [slot] removed file slots/6/work_unit.sah 09-Jul-2015 21:10:02 [SETI@home] Computation for task 09fe15aa.15517.8419.438086664202.12.104_1 finished 09-Jul-2015 21:10:02 [---] [slot] cleaning out slots/6: get_free_slot() 09-Jul-2015 21:10:02 [SETI@home] [slot] assigning slot 6 to 09fe15aa.15517.8419.438086664202.12.244_1 ///////////// Task 4258519278 10-Jul-2015 04:04:31 [SETI@home] [slot] assigning slot 12 to 13ja15aa.30952.4157.438086664199.12.106_1 10-Jul-2015 04:04:31 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/Lunatics_x41zc_win32_cuda50.exe to slots/12/Lunatics_x41zc_win32_cuda50.exe 10-Jul-2015 04:04:31 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/cudart32_50_35.dll to slots/12/cudart32_50_35.dll 10-Jul-2015 04:04:31 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/cufft32_50_35.dll to slots/12/cufft32_50_35.dll 10-Jul-2015 04:04:31 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/mbcuda.cfg to slots/12/mbcuda.cfg 10-Jul-2015 04:04:31 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/13ja15aa.30952.4157.438086664199.12.106 to slots/12/work_unit.sah 10-Jul-2015 04:04:31 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/13ja15aa.30952.4157.438086664199.12.106_1_0 to slots/12/result.sah 10-Jul-2015 04:04:31 [SETI@home] Starting task 13ja15aa.30952.4157.438086664199.12.106_1 using setiathome_v7 version 700 (cuda50) in slot 12 ... 10-Jul-2015 04:20:11 [---] [slot] cleaning out slots/12: handle_exited_app() 10-Jul-2015 04:20:11 [---] [slot] removed file slots/12/boinc_finish_called 10-Jul-2015 04:20:11 [---] [slot] removed file slots/12/boinc_task_state.xml 10-Jul-2015 04:20:11 [---] [slot] removed file slots/12/cudart32_50_35.dll 10-Jul-2015 04:20:11 [---] [slot] removed file slots/12/cufft32_50_35.dll 10-Jul-2015 04:20:11 [---] [slot] removed file slots/12/init_data.xml 10-Jul-2015 04:20:11 [---] [slot] removed file slots/12/Lunatics_x41zc_win32_cuda50.exe 10-Jul-2015 04:20:11 [---] [slot] removed file slots/12/mbcuda.cfg 10-Jul-2015 04:20:11 [---] [slot] removed file slots/12/result.sah 10-Jul-2015 04:20:11 [---] [slot] removed file slots/12/state.sah 10-Jul-2015 04:20:11 [---] [slot] removed file slots/12/stderr.txt 10-Jul-2015 04:20:11 [---] [slot] removed file slots/12/work_unit.sah 10-Jul-2015 04:20:11 [SETI@home] Computation for task 13ja15aa.30952.4157.438086664199.12.106_1 finished 10-Jul-2015 04:20:11 [---] [slot] cleaning out slots/12: get_free_slot() 10-Jul-2015 04:20:11 [SETI@home] [slot] assigning slot 12 to 13ja15aa.30952.4157.438086664199.12.178_0 ///////////// Task 4258519325 10-Jul-2015 04:14:45 [---] [slot] cleaning out slots/3: get_free_slot() 10-Jul-2015 04:14:45 [SETI@home] [slot] assigning slot 3 to 13ja15aa.30952.4157.438086664199.12.174_0 10-Jul-2015 04:14:45 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/Lunatics_x41zc_win32_cuda50.exe to slots/3/Lunatics_x41zc_win32_cuda50.exe 10-Jul-2015 04:14:45 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/cudart32_50_35.dll to slots/3/cudart32_50_35.dll 10-Jul-2015 04:14:45 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/cufft32_50_35.dll to slots/3/cufft32_50_35.dll 10-Jul-2015 04:14:45 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/mbcuda.cfg to slots/3/mbcuda.cfg 10-Jul-2015 04:14:45 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/13ja15aa.30952.4157.438086664199.12.174 to slots/3/work_unit.sah 10-Jul-2015 04:14:45 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/13ja15aa.30952.4157.438086664199.12.174_0_0 to slots/3/result.sah 10-Jul-2015 04:14:45 [SETI@home] Starting task 13ja15aa.30952.4157.438086664199.12.174_0 using setiathome_v7 version 700 (cuda50) in slot 3 ... 10-Jul-2015 04:30:22 [---] [slot] cleaning out slots/3: handle_exited_app() 10-Jul-2015 04:30:22 [---] [slot] removed file slots/3/boinc_finish_called 10-Jul-2015 04:30:22 [---] [slot] removed file slots/3/boinc_task_state.xml 10-Jul-2015 04:30:22 [---] [slot] removed file slots/3/cudart32_50_35.dll 10-Jul-2015 04:30:22 [---] [slot] removed file slots/3/cufft32_50_35.dll 10-Jul-2015 04:30:22 [---] [slot] removed file slots/3/init_data.xml 10-Jul-2015 04:30:22 [---] [slot] removed file slots/3/Lunatics_x41zc_win32_cuda50.exe 10-Jul-2015 04:30:22 [---] [slot] removed file slots/3/mbcuda.cfg 10-Jul-2015 04:30:22 [---] [slot] removed file slots/3/result.sah 10-Jul-2015 04:30:22 [---] [slot] removed file slots/3/state.sah 10-Jul-2015 04:30:22 [---] [slot] removed file slots/3/stderr.txt 10-Jul-2015 04:30:22 [---] [slot] removed file slots/3/work_unit.sah 10-Jul-2015 04:30:22 [SETI@home] Computation for task 13ja15aa.30952.4157.438086664199.12.174_0 finished 10-Jul-2015 04:30:22 [---] [slot] cleaning out slots/3: get_free_slot() 10-Jul-2015 04:30:22 [SETI@home] [slot] assigning slot 3 to 13ja15aa.30952.4157.438086664199.12.216_0 ///////////// Task 4258633619 10-Jul-2015 05:55:45 [SETI@home] [slot] assigning slot 12 to 11ja15aa.25394.21744.438086664203.12.126_0 10-Jul-2015 05:55:45 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/Lunatics_x41zc_win32_cuda50.exe to slots/12/Lunatics_x41zc_win32_cuda50.exe 10-Jul-2015 05:55:45 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/cudart32_50_35.dll to slots/12/cudart32_50_35.dll 10-Jul-2015 05:55:45 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/cufft32_50_35.dll to slots/12/cufft32_50_35.dll 10-Jul-2015 05:55:45 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/mbcuda.cfg to slots/12/mbcuda.cfg 10-Jul-2015 05:55:45 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/11ja15aa.25394.21744.438086664203.12.126 to slots/12/work_unit.sah 10-Jul-2015 05:55:45 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/11ja15aa.25394.21744.438086664203.12.126_0_0 to slots/12/result.sah 10-Jul-2015 05:55:45 [SETI@home] Starting task 11ja15aa.25394.21744.438086664203.12.126_0 using setiathome_v7 version 700 (cuda50) in slot 12 ... 10-Jul-2015 06:11:37 [---] [slot] cleaning out slots/12: handle_exited_app() 10-Jul-2015 06:11:37 [---] [slot] removed file slots/12/boinc_finish_called 10-Jul-2015 06:11:37 [---] [slot] removed file slots/12/boinc_task_state.xml 10-Jul-2015 06:11:37 [---] [slot] removed file slots/12/cudart32_50_35.dll 10-Jul-2015 06:11:37 [---] [slot] removed file slots/12/cufft32_50_35.dll 10-Jul-2015 06:11:37 [---] [slot] removed file slots/12/init_data.xml 10-Jul-2015 06:11:37 [---] [slot] removed file slots/12/Lunatics_x41zc_win32_cuda50.exe 10-Jul-2015 06:11:37 [---] [slot] removed file slots/12/mbcuda.cfg 10-Jul-2015 06:11:37 [---] [slot] removed file slots/12/result.sah 10-Jul-2015 06:11:37 [---] [slot] removed file slots/12/state.sah 10-Jul-2015 06:11:37 [---] [slot] removed file slots/12/stderr.txt 10-Jul-2015 06:11:37 [---] [slot] removed file slots/12/work_unit.sah 10-Jul-2015 06:11:37 [SETI@home] Computation for task 11ja15aa.25394.21744.438086664203.12.126_0 finished 10-Jul-2015 06:11:37 [---] [slot] cleaning out slots/12: get_free_slot() 10-Jul-2015 06:11:37 [SETI@home] [slot] assigning slot 12 to 11ja15aa.12146.16836.438086664204.12.150_0 |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
I'm curious, are you seeing a computation pause like I am during the task execution? |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
I'm curious, are you seeing a computation pause like I am during the task execution? Well, since that machine is usually running 16 tasks concurrently, and mostly runs while I'm sleeping, it's unlikely I'd ever notice any sort of computation pause. :^) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
BOINC Manager 7.6.2 has cpu_sched flag option that helps identify the task in each slot. That along with an enhanced slot_debug are needed to follow the flow of work occupying each slot. When I did my original trace in message 1674525 I believe I was using BM version 7.2.42 at the time which I believe had a more enhanced slot_debug than your version 7.2.33. If you want to see the "race mechanism" in action you really need to move to the 7.6.2 platform. Based on the fact that removed file slots/0/stderr.txt is missing from Task 4257333848] I believe that task failed because of the race condition where the two slot cleansing conditions failed and condition 3 was used. Unless you want to see exactly what is going on, it probably isn't necessary to move to BOINC platform level 7.6.2 if you don't want to. The condition is well documented now and a boinc_dev application analysis report has been submitted. So the devs probably aren't looking for more data for analysis. I could be mistaken though, Richard Haselgrove is the final touchstone. Cheers, Keith Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
BOINC Manager 7.6.2 has cpu_sched flag option that helps identify the task in each slot. That along with an enhanced slot_debug are needed to follow the flow of work occupying each slot. When I did my original trace in message 1674525 I believe I was using BM version 7.2.42 at the time which I believe had a more enhanced slot_debug than your version 7.2.33. If you want to see the "race mechanism" in action you really need to move to the 7.6.2 platform. Based on the fact that removed file slots/0/stderr.txt is missing from Task 4257333848] I believe that task failed because of the race condition where the two slot cleansing conditions failed and condition 3 was used. Unless you want to see exactly what is going on, it probably isn't necessary to move to BOINC platform level 7.6.2 if you don't want to. The condition is well documented now and a boinc_dev application analysis report has been submitted. So the devs probably aren't looking for more data for analysis. I could be mistaken though, Richard Haselgrove is the final touchstone. Well, I guess I was wondering if the truncations on S@h have any different characteristics from those on MW. It seemed like all of the ones in your postings for MW were completely empty, whereas here on S@h they are sometimes empty and sometimes partials, like the five I got overnight. I actually got an empty one on my daily driver yesterday, but it ran before I had turned <slot_debug> on. That's an interesting observation on the 4257333848 task. I had overlooked the fact that "removed file slots/0/stderr.txt" was missing for that one, although it was present for the other four. That one was the overflow task that was done within a couple seconds. I wonder if it's possible that the Stderr for that one never even had time to get written to the slot. Yes, I'll be definitely interested to see if Richard has any observations on what <slot_debug> in 7.2.33 might be missing. EDIT: Ah, I see what happened. I didn't follow the log far enough. Another task ended one second after the 4257333848 task and there were some overlapping log entries. Here's some additional lines that appear to include the "removed file slots/0/stderr.txt" action. 09-Jul-2015 20:10:33 [---] [slot] cleaning out slots/0: get_free_slot() 09-Jul-2015 20:10:33 [SETI@home] [slot] assigning slot 16 to 09fe15aa.15517.8419.438086664202.12.176_1 09-Jul-2015 20:10:33 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/Lunatics_x41zc_win32_cuda50.exe to slots/16/Lunatics_x41zc_win32_cuda50.exe 09-Jul-2015 20:10:33 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/cudart32_50_35.dll to slots/16/cudart32_50_35.dll 09-Jul-2015 20:10:33 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/cufft32_50_35.dll to slots/16/cufft32_50_35.dll 09-Jul-2015 20:10:33 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/mbcuda.cfg to slots/16/mbcuda.cfg 09-Jul-2015 20:10:33 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/09fe15aa.15517.8419.438086664202.12.176 to slots/16/work_unit.sah 09-Jul-2015 20:10:33 [SETI@home] [slot] linked ../../projects/setiathome.berkeley.edu/09fe15aa.15517.8419.438086664202.12.176_1_0 to slots/16/result.sah 09-Jul-2015 20:10:33 [SETI@home] Starting task 09fe15aa.15517.8419.438086664202.12.176_1 using setiathome_v7 version 700 (cuda50) in slot 16 09-Jul-2015 20:10:34 [---] [slot] cleaning out slots/12: handle_exited_app() 09-Jul-2015 20:10:34 [---] [slot] removed file slots/12/AKv8c_r2549_winx86_SSE3xjfs.exe 09-Jul-2015 20:10:34 [---] [slot] removed file slots/12/boinc_finish_called 09-Jul-2015 20:10:34 [---] [slot] removed file slots/12/boinc_task_state.xml 09-Jul-2015 20:10:34 [---] [slot] removed file slots/12/init_data.xml 09-Jul-2015 20:10:34 [---] [slot] removed file slots/12/libfftw3f-3-3-4_x86.dll 09-Jul-2015 20:10:34 [---] [slot] removed file slots/12/mb_cmdline.txt 09-Jul-2015 20:10:34 [---] [slot] removed file slots/12/result.sah 09-Jul-2015 20:10:34 [---] [slot] removed file slots/12/state.sah 09-Jul-2015 20:10:34 [---] [slot] removed file slots/12/stderr.txt 09-Jul-2015 20:10:34 [---] [slot] removed file slots/12/work_unit.sah 09-Jul-2015 20:10:34 [SETI@home] Computation for task 05fe15ab.19655.4165.438086664207.12.62.vlar_0 finished 09-Jul-2015 20:10:34 [---] [slot] cleaning out slots/0: get_free_slot() 09-Jul-2015 20:10:34 [---] [slot] removed file slots/0/stderr.txt 09-Jul-2015 20:10:34 [SETI@home] [slot] assigning slot 0 to 11ja15aa.32763.8247.438086664198.12.158.vlar_1 |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Note the headers to the sections. The first attempt to delete the files happens when the app exits. The missing one popped up under "get_free_slot()", which is the preparation for the next task to run. So there must be an unlogged 'unlink failed', which v7.6.2 would have displayed. 09-Jul-2015 20:10:34 [---] [slot] cleaning out slots/12: handle_exited_app() I had a long conversation with Jason this afternoon. We have two different theories, and they are possibly both right, accounting for the difference between 'stderr truncated' as we see it here, and 'stderr empty', as Keith sees it at Milkyway. In short, Jason thinks it's a problem with the science application, which shuts itself down too abruptly when it finishes - thus interrupting the action of writing to the file, so it never completes. That would be SETI/truncation. I think that the BOINC client is too impatient, and reads before the file is available - that would be Milkyway/empty. We discussed possible ways to distinguish between them. One would be to try and catch the stderr.txt file left behind in an orphaned slot when both attempts to delete it have failed, and BOINC has moved the next task to another slot for safety. If a file like that can be found, is it complete or truncated? Another test might be to try and influence the relative timings, perhaps by tweaking process priorities. If the Milkyway application could be boosted to 'Above normal' or 'high' priority (above the BOINC client runtime priority), would it stand a better chance of completing the write and making the file available before BOINC tries to read it? There's a program called Process Lasso (free, unless you want the Pro version - not needed for this), which can set priority as requested every time an application starts. If we can find an orphaned stderr, and it's truncated, that would be positive evidence pointing towards Jason's theory. If Milkyway throws fewer 'validate errors' as the app process priority is increased, that would be somewhat weak statistical evidence in favour of my theory. But we're both taking Friday night/Saturday morning off now, and we may run some tests over the weekend. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Yes, this trace follows along more or less what my original trace from MW did. The slot cleanout of the stderr.txt for slot 0 happens further down the trace in the slot cleanout of another slot. I misinterpreted the analysis of my original trace. The situation of the task in slot 0 terminating quickly because of the overflow condition in a very short time (like my 50 seconds MW tasks) is because the stderr.txt file hadn't made it to disk before the task exit finalized, thus the race condition and the invalid task. Its like the analogy of closing the barn doors after the horse has escaped, the damage is all done by the time the slot cleanout procedure has completed. The slot cleanout mechanism is working correctly as designed, that is not the problem here. The problem is prematurely calling the task exit before all the files have been transacted to and from disk and to and from project. Its a different situation with MW tasks, the science result IS the stderr.txt file. In the case of SETI, the science result is a separate upload file and the stderr.txt file is a housekeeping file. I hope that my commentary is somewhat accurate, but I defer always to the devs like Richard. Keith Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
It could also be that an AV program is locking the (or a) file during delete/create, to look at it, and it is missing a crucial write from BOINC. When I run out of AP tasks I will compare my AMD that never shoots blanks and my i5 which always does. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Yes, most certainly both modes of failure are not necessarily mutually exclusive, and there are possible cascade failure situations that look even stranger still. The old crash after exit, sticky gpu downclocks, and exit with no finished file are related. The code at both ends looks 'reasonable'enough when you walk through it as though they are nice orderly single thread processes (which doesn't happen ever anymore). most likely a bit more thought will allow me to illustrate the process/thread timelines involved under various situations (which lead to proper fixes), but I would suggest that long term permanant fixes would lie in general replacement of the úndisciplined use of raw threads, with at least proper sentinals and synchronisation. A pretty big ask of a codebase usually just bandaid patched ad nauseum, but probably worth the thought and research investment to get it right at a software engineering level (maintaining the intended function & compatibility where at all possible) [Edit:] Don't need to take my word for it: http://www.drdobbs.com/parallel/use-threads-correctly-isolation-asynch/215900465 Explicit threads are undisciplined. They need some structure to keep them in line. In this column, we're going to see what that structure is, as we motivate and illustrate best practices for using threads — techniques that will make our concurrent code easier to write correctly and to reason about with confidence. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Note the headers to the sections. The first attempt to delete the files happens when the app exits. The missing one popped up under "get_free_slot()", which is the preparation for the next task to run. So there must be an unlogged 'unlink failed', which v7.6.2 would have displayed. What intrigues me now about my first example, after Keith's reply led me to the additional log entries, is that the initial "handle_exited_app()" doesn't even seem to know that a stderr.txt file existed in the slot, and perhaps it didn't. I'm assuming that that routine simply retrieves a filename list from the OS for the slot folder, then just loops through it to remove/delete the files one by one. If that's the case, stderr.txt hadn't even been written by the app yet. One second later, however (in this example), the "get_free_slot()" cleanout routine does actually find and delete a stderr.txt file. Whether that was a complete or a truncated stderr.txt file is unknown, but the Stderr that BOINC picked up and reported for Task 4257333848 was definitely a truncated one, not an empty one. That sequence is very puzzling to me but seems like it could stem from parts of both theories. As far as capturing an orphaned stderr.txt, that sounds like it would be quite difficult, but perhaps not impossible, on a machine like my xw9400 that churns through a high volume of tasks. I would think, however, that anytime Slot 16 gets called into use it would indicate that there's another slot with one or more leftover files in it. The trick would be finding that slot and grabbing any file(s) before another task finishes and the slot gets reused. It would probably be easier on a machine like my daily driver, which only has 3 slots in normal use, but naturally generates far fewer truncated Stderr files. If anyone has any ideas on how to hook the activation of Slot 16 (or the equivalent on other machines) to trigger a copy operation for the other slots, that might be a workable approach. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
What intrigues me now about my first example, after Keith's reply led me to the additional log entries, is that the initial "handle_exited_app()" doesn't even seem to know that a stderr.txt file existed in the slot, and perhaps it didn't. I'm assuming that that routine simply retrieves a filename list from the OS for the slot folder, then just loops through it to remove/delete the files one by one. If that's the case, stderr.txt hadn't even been written by the app yet. One second later, however (in this example), the "get_free_slot()" cleanout routine does actually find and delete a stderr.txt file. Whether that was a complete or a truncated stderr.txt file is unknown, but the Stderr that BOINC picked up and reported for Task 4257333848 was definitely a truncated one, not an empty one. That sequence is very puzzling to me but seems like it could stem from parts of both theories. I would be careful about jumping to that conclusion. If you refer to Keith Mayers' log at message 1674525 (over two months ago), you get the same impression - stderr.txt is not mentioned in the section in bold. But - I think that's a failure in the logging in that version. The log lists the files that were removed - but what about the files that weren't removed? Was that because they didn't exist, or because BOINC tried and failed? The old version doesn't distinguish between the two. Run through the logic again with a v7.6.2 log, please. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
What intrigues me now about my first example, after Keith's reply led me to the additional log entries, is that the initial "handle_exited_app()" doesn't even seem to know that a stderr.txt file existed in the slot, and perhaps it didn't. I'm assuming that that routine simply retrieves a filename list from the OS for the slot folder, then just loops through it to remove/delete the files one by one. If that's the case, stderr.txt hadn't even been written by the app yet. One second later, however (in this example), the "get_free_slot()" cleanout routine does actually find and delete a stderr.txt file. Whether that was a complete or a truncated stderr.txt file is unknown, but the Stderr that BOINC picked up and reported for Task 4257333848 was definitely a truncated one, not an empty one. That sequence is very puzzling to me but seems like it could stem from parts of both theories. Yeah, I see what you mean. You think that v7.6.2 would have displayed a "[slot] failed to remove file slots/0/stderr.txt: unlink() failed" message whereas v7.2.33 doesn't say anything at all. Okay, perhaps I'll try installing v7.6.2 tonight. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
So far, 7.6.2 appears to be working as well as the old 7.4.42. I grumbled a bit because they moved things around on the menu again and I had to hunt to find often used shut down client. Other than that little nitpick, I like it. It has a lot more options to toggle and they seem to be in the expected places. I need to comment on Richard's suggestion about raising the process priority for MilkyWay tasks with Process Lasso. I already run Process Lasso and I have experimented briefly with raising MW tasks above SETI. That caused some slow downs on my computer because I already run SETI at ABOVE NORMAL priority and I had to move MW to HIGH priority to get more time than SETI. I don't remember any much difference in the production of invalid tasks due to empty stderr.txt's at MW. I could of course repeat the experiment again since now we know what to look for. As Richard suggested, it would be great to capture an orphaned stderr.txt file in an unused slot for both SETI and MW. That would help identify which of the proposed fault mechanisms is happening. I think I will try that experiment again once MW comes up my computers in the project round robin rotation. Keith Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
So far, 7.6.2 appears to be working as well as the old 7.4.42. I grumbled a bit because they moved things around on the menu again and I had to hunt to find often used shut down client. Other than that little nitpick, I like it. It has a lot more options to toggle and they seem to be in the expected places. I started up the xw9400 about an hour ago and immediately installed v7.6.2. The first thing that caught my eye was that the log didn't show the tasks all restarting, even though they clearly did. I guessed that might be the cpu_sched option and turned it on. So far everything seems to be running smoothly. I haven't tried rooting around in the 7.6.2 GUI yet to see what else might be of interest. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
One thing to add to the MW/Seti Priority mix, is that if you reach a state of overcommit at any point, then that is where the probability of the application/boincapi side of the weaknesses happening (at least) should become higher. Examples of the most extreme situation would involve full utilisation of the CPUs. This pressure occurs most with GPU tasks starting up and shutting down, on top of existing CPU tasks, then pile on the Boinc client recalculating/simulating its schedule, other housekeeping/network and possibly other system tasks going on. In these respects it becomes a statistical exercise like the ~3% figure that was mentioned somewhere, where you're waiting for the right set of circumstances to coincide, and any single test that either makes the frequency significantly & consistantly higher or lower over a time period, points toward where the weaknesses lie (which is probably everywhere). That would seem to be evidenced by the relationships to number of slots/resources being managed on a single IO subsystem. Naturally the system and tools are there for grace under pressure (the client, app and system as a whole), but current code tends toward assumed imperative control of situations it has no knowledge of. [Edit:] probably another way to put it. 'Bugs' probably exist in there, possibly multiple, but if the success rate or expected behaviour works some ~97% of the time, then you're (literally probably) not going to find them considering only the 'normal', i.e. three standard deviation either side of the mean, logic and operation. those failing ~3% would be off the edges of where that logic copes, so you need to be looking for design operational limits, which is a whole other field in addition to logic bugs, stress related/induced. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
OK, I've finished cleaning out my cache (and drunk enough coffee), so I've just set host 465695 the task of searching for blank MW stderrs. Running two-up on a single GTX 670, but otherwise plain vanilla for the time being. In case it got missed in the rather quick-fire conversation yesterday, don't worry about the inconclusives. Milkyway, according to my observations from the peanut gallery, uses an obscure BOINC server configuration option called Adaptive Replication - so obscure that I suspect that everyone involved (project staff and volunteers alike) has forgotten that they use it, and what it means. In effect, because this host hasn't crunched MW for about two years, it's not trusted. Every one of my first few tasks will be checked, and checked again - and the check tasks will come from the back of the work generation queue. I'm not sure whether the validation errors we're looking for will have to wait for the second validation check task to report, but I'm hopeful - because of the wording "Checked, but no consensus yet" - that any validation errors will show up on first report, without delay. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Sounds good. It'll be interesting to see what kindof pointy sticks need to be poked at the host to make it get those, and how long to get a predictable percentage. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Interesting how little CPU time the NVidia OpenCL tasks need - and I'm typing from the machine running the test, with no sign of screen lag either. I'd held back an extra core just in case (running five CPU tasks plus intel_gpu), but I think I can safely release that. [Edit - verified it is an OpenCL app, Process Explorer shows threads running in nvopencl.dll You can never be too sure about the plan_class name at MW] |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Interesting how little CPU time the NVidia OpenCL tasks need - and I'm typing from the machine running the test, with no sign of screen lag either. I'd held back an extra core just in case (running five CPU tasks plus intel_gpu), but I think I can safely release that. Yes, there are several techniques to achieve that in OpenCL that we tested at CA (though the others may not have been completely aware what they were testing). Some of these methods I'll eventually be substituting in place of the old Cuda blocking synchronisation I use in Xbranch, since it allows yet lower CPU usage, better non-blocking operation, and adaptively self scaling feedrates to user or safety setpoints. Took a few years to get enough of a grip on how things 'should be', rather than focussing on the existing dogma. Manifesting visualisation is far more fun and effective than labouring on a tired and/or broken idea for sure. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
I grumbled a bit because they moved things around on the menu again and I had to hunt to find often used shut down client. Under the File menu. :) Blame me for the switch-around and remodeling/renaming of the menus. After adding several things and splitting off the add project and add account manager wizards into separate windows, the original menus were too cluttered and the original headings no longer held true for their contents. So I complained about it, after which the developers, plus me and Jacob Klein sat down and showed each other what we thought the menus should look like, be named as etc. It's taken several test- and trial versions before we came on to the version out now. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.