Message boards :
Number crunching :
What causes a Blank stderr?
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
To me it seems the addition of a "Check folder is actually empty before reuse" would solve the issue. Should do, that'd effectively be a mutex. More than one way to skin a cat of course. Another is certainly to make slots unique as described (or suitable variant), and garbage collect finished with resources at leisure, as a background thread at low priority. That last option (or some variant) would let the client do what it needs to do with new tasks/slots in the most timely fashion, and reduce logic involved in checking/locking. Might not be the best option for some situations, but probably the best for multicore PC type platforms at least. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14600 Credit: 200,643,578 RAC: 874 ![]() ![]() |
I've been following the development of this problem since noticing a conversation bouncing between the Milkyway and Einstein message boards since, I think, about 24 April. It was several days (and many false turnings) before the originator of both threads turned to look at his slot folders, and found two stray CNS-dev .vdi files. Since CMS tasks run for 24 hours, and there is a maximum quota of one per machine, I'm interpreting that as saying that files may be present for hours, maybe days, after BOINC calls DeleteFile(). I don't think any OS's garbage collection is as lazy as that - especially since I think the false trails included GPU driver updates and other brute-force operations (though I haven't found any explicit reference to re-starting the host). |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13161 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
I've set the slot_debug flag in my logs for both machines since I caught up reading this thread this morning. Haven't seen any slot clearing failure yet though it is still early in the day. I usually generate a couple of invalids a day per machine at MilkyWay. I will post any log evidence I can find later this evening. Cheers, Keith Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
... I'm interpreting that as saying that files may be present for hours, maybe days, after BOINC calls DeleteFile(). I don't think any OS's garbage collection is as lazy as that... True, however when there are open handles on those files, the deletion may well successfully mark for later deletion. In principle, a corrupt DLL global space (including open handles/locks in the user mode driver) can persist until reboot. IOW, you can walk through as much logic that seems reasonable, but as soon as you have race conditions and non-threadsafe behaviour, all bets are off, and you will see seemingly non-deterministic behaviour. [Edit:] note that at a high level, the 'race condition' concerned typically begins the instant the application writes the finished file. The race is between the finished file being created by the system, and the boinc client picking up the application exit with zero status. The non threadsafe ('insecure') behaviour is primarily the hard process termination, but includes other app and library usage specific issues. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13161 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
OK, I missed an invalid this morning when I was out of the house and it had already been cleared from the log before I got back. However I seem to have generated another one this afternoon and its tracks should be in the logfile. I searched for the string "failed" and failed to find any incidence. I matched the time the invalid MW task reported to its entries in the logfile and see something that doesn't appear anytime else in the current log. I see two slot cleanings reported at the same time, slot 12 and slot 13. Plus I see where the stderr.txt file DID NOT get removed in the block of cleanings as the rest of slot 12. It got removed down in the slot 13 block. It looks like a race condition to me. 7573 5/6/2015 2:12:07 PM [slot] cleaning out slots/12: handle_exited_app() 7574 5/6/2015 2:12:07 PM [slot] removed file slots/12/astronomy_parameters.txt 7575 5/6/2015 2:12:07 PM [slot] removed file slots/12/boinc_finish_called 7576 5/6/2015 2:12:07 PM [slot] removed file slots/12/init_data.xml 7577 5/6/2015 2:12:07 PM [slot] removed file slots/12/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe 7578 5/6/2015 2:12:07 PM [slot] removed file slots/12/separation_checkpoint 7579 5/6/2015 2:12:07 PM [slot] removed file slots/12/stars.txt 7580 5/6/2015 2:12:07 PM [slot] removed file slots/12/stderr.txt 7581 Milkyway@Home 5/6/2015 2:12:07 PM Computation for task de_modfit_fast_15_3s_136_simmodBPL2_1_1429700402_5765257_1 finished 7582 5/6/2015 2:12:07 PM [slot] cleaning out slots/12: get_free_slot() 7583 Milkyway@Home 5/6/2015 2:12:07 PM [slot] assigning slot 12 to de_modfit_fast_15_3s_136_simmodBPL2_2_1429700402_5782635_0 7584 Milkyway@Home 5/6/2015 2:12:07 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe to slots/12/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe 7585 Milkyway@Home 5/6/2015 2:12:07 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/parameters-15-3s-modfit-simmodBPL2.txt to slots/12/astronomy_parameters.txt 7586 Milkyway@Home 5/6/2015 2:12:07 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/stars-15-simmodBPL2.txt to slots/12/stars.txt 7587 Milkyway@Home 5/6/2015 2:12:07 PM Starting task de_modfit_fast_15_3s_136_simmodBPL2_2_1429700402_5782635_0 7588 Milkyway@Home 5/6/2015 2:12:42 PM Sending scheduler request: To fetch work. 7589 Milkyway@Home 5/6/2015 2:12:42 PM Reporting 1 completed tasks 7590 Milkyway@Home 5/6/2015 2:12:42 PM Requesting new tasks for NVIDIA GPU 7591 Milkyway@Home 5/6/2015 2:12:44 PM Scheduler request completed: got 1 new tasks 7592 5/6/2015 2:12:59 PM [slot] cleaning out slots/12: handle_exited_app() 7593 5/6/2015 2:12:59 PM [slot] removed file slots/12/astronomy_parameters.txt 7594 5/6/2015 2:12:59 PM [slot] removed file slots/12/boinc_finish_called 7595 5/6/2015 2:12:59 PM [slot] removed file slots/12/init_data.xml 7596 5/6/2015 2:12:59 PM [slot] removed file slots/12/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe 7597 5/6/2015 2:12:59 PM [slot] removed file slots/12/separation_checkpoint 7598 5/6/2015 2:12:59 PM [slot] removed file slots/12/stars.txt 7599 Milkyway@Home 5/6/2015 2:12:59 PM Computation for task de_modfit_fast_15_3s_136_simmodBPL2_2_1429700402_5782635_0 finished 7600 5/6/2015 2:12:59 PM [slot] cleaning out slots/12: get_free_slot() 7601 5/6/2015 2:12:59 PM [slot] cleaning out slots/13: get_free_slot() 7602 Milkyway@Home 5/6/2015 2:12:59 PM [slot] assigning slot 13 to de_80_DR8_Rev_8_5_00004_1429700402_7400119_0 7603 Milkyway@Home 5/6/2015 2:12:59 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_separation_1.02_windows_x86_64__opencl_nvidia.exe to slots/13/milkyway_separation_1.02_windows_x86_64__opencl_nvidia.exe 7604 Milkyway@Home 5/6/2015 2:12:59 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/80_rev_8_5.prmtrs to slots/13/astronomy_parameters.txt 7605 Milkyway@Home 5/6/2015 2:12:59 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/80_Rev_8_3.stars to slots/13/stars.txt 7606 Milkyway@Home 5/6/2015 2:12:59 PM Starting task de_80_DR8_Rev_8_5_00004_1429700402_7400119_0 7607 Milkyway@Home 5/6/2015 2:13:49 PM Sending scheduler request: To fetch work. 7608 Milkyway@Home 5/6/2015 2:13:49 PM Reporting 1 completed tasks 7609 Milkyway@Home 5/6/2015 2:13:49 PM Requesting new tasks for NVIDIA GPU 7610 Milkyway@Home 5/6/2015 2:13:51 PM Scheduler request completed: got 1 new tasks 7611 Milkyway@Home 5/6/2015 2:14:56 PM Sending scheduler request: To fetch work. 7612 Milkyway@Home 5/6/2015 2:14:56 PM Requesting new tasks for NVIDIA GPU 7613 Milkyway@Home 5/6/2015 2:14:58 PM Scheduler request completed: got 0 new tasks 7614 SETI@home 5/6/2015 2:15:28 PM Sending scheduler request: To fetch work. 7615 SETI@home 5/6/2015 2:15:28 PM Reporting 1 completed tasks 7616 SETI@home 5/6/2015 2:15:28 PM Requesting new tasks for CPU and NVIDIA GPU 7617 SETI@home 5/6/2015 2:15:31 PM Scheduler request completed: got 1 new tasks 7618 SETI@home 5/6/2015 2:15:31 PM [slot] linked projects/setiathome.berkeley.edu/arecibo_181.png to projects/setiathome.berkeley.edu/slideshow_setiathome_v7_01 7619 SETI@home 5/6/2015 2:15:31 PM [slot] linked projects/setiathome.berkeley.edu/arecibo_181.png to projects/setiathome.berkeley.edu/slideshow_astropulse_v6_01 7620 SETI@home 5/6/2015 2:15:31 PM [slot] linked projects/setiathome.berkeley.edu/sah_40.png to projects/setiathome.berkeley.edu/stat_icon 7621 SETI@home 5/6/2015 2:15:31 PM [slot] linked projects/setiathome.berkeley.edu/sah_banner_290.png to projects/setiathome.berkeley.edu/slideshow_setiathome_v7_02 7622 SETI@home 5/6/2015 2:15:31 PM [slot] linked projects/setiathome.berkeley.edu/sah_banner_290.png to projects/setiathome.berkeley.edu/slideshow_astropulse_v6_02 7623 SETI@home 5/6/2015 2:15:31 PM [slot] linked projects/setiathome.berkeley.edu/sah_ss_290.png to projects/setiathome.berkeley.edu/slideshow_setiathome_v7_00 7624 SETI@home 5/6/2015 2:15:31 PM [slot] linked projects/setiathome.berkeley.edu/sah_ss_290.png to projects/setiathome.berkeley.edu/slideshow_astropulse_v6_00 7625 5/6/2015 2:15:31 PM [slot] removed file projects/setiathome.berkeley.edu/25se12ab.22094.24198.438086664198.12.110 7626 SETI@home 5/6/2015 2:15:33 PM Started download of 25no12ad.9834.2930.438086664208.12.177 7627 SETI@home 5/6/2015 2:15:34 PM Finished download of 25no12ad.9834.2930.438086664208.12.177 7628 5/6/2015 2:15:52 PM [slot] cleaning out slots/13: handle_exited_app() 7629 5/6/2015 2:15:52 PM [slot] removed file slots/13/astronomy_parameters.txt 7630 5/6/2015 2:15:52 PM [slot] removed file slots/13/boinc_finish_called 7631 5/6/2015 2:15:52 PM [slot] removed file slots/13/boinc_task_state.xml 7632 5/6/2015 2:15:52 PM [slot] removed file slots/13/init_data.xml 7633 5/6/2015 2:15:52 PM [slot] removed file slots/13/milkyway_separation_1.02_windows_x86_64__opencl_nvidia.exe 7634 5/6/2015 2:15:52 PM [slot] removed file slots/13/separation_checkpoint 7635 5/6/2015 2:15:52 PM [slot] removed file slots/13/stars.txt 7636 5/6/2015 2:15:52 PM [slot] removed file slots/13/stderr.txt 7637 Milkyway@Home 5/6/2015 2:15:52 PM Computation for task de_80_DR8_Rev_8_5_00004_1429700402_7400119_0 finished 7638 5/6/2015 2:15:52 PM [slot] cleaning out slots/12: get_free_slot() 7639 5/6/2015 2:15:52 PM [slot] removed file slots/12/stderr.txt 7640 Milkyway@Home 5/6/2015 2:15:52 PM [slot] assigning slot 12 to de_80_DR8_Rev_8_5_00004_1429700402_7401189_0 Does this help the developers out in any fashion? Or am I reading something into nothing? Cheers, Keith Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
![]() ![]() ![]() ![]() Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 ![]() ![]() |
I just asked why blank results happen, not the entire programming behind it, LOL This may be of interest to you programmers. I have 2 identical 750ti in 2 computers, both running 2 MB tasks at a time. AMD 4200+, no blanks, running clean INTEL 2.9GHz, occasional blanks So it could be the speed of the processor (stop/start speed), or the difference in AMD/INTEL architecture. If you are looking at my errors, that was caused by a computer crash that corrupted some STEI files, Reinstalled and it deleted my berkly folded (grr) ![]() |
![]() ![]() Send message Joined: 11 Sep 99 Posts: 6533 Credit: 196,805,888 RAC: 57 ![]() ![]() |
I just asked why blank results happen, not the entire programming behind it, LOL I would expect the different OS on each machine is more likely a factor in the issue. It could be related to issues I've had deleting files/folder in Windows. Since the release of Vista. Which only got better post Windows 7 SP1. SETI@home classic workunits: 93,865 CPU time: 863,447 hours |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
I would expect the different OS on each machine is more likely a factor in the issue. Highly likely. More specifically, multithreaded C-Runtime libraries (made the default and only option since Visual studio 2005 ). extensively kernel buffered I/O (which is a desktop OS performance optimisation), and proliferation of multi-core. Win7 (SP1) is a fair bit more aggressive about its own garbage collection, hiding some serious application level headaches. It's a bit of a case where some could say 'd@mn M$', but then seeing vaguely similar scaling problems starting to happen with Linux and Android, I'd suggest that most likely bad programming and development inertia are ubiquitous. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
I just asked why blank results happen, not the entire programming behind it, LOL Haha. that's what happens if you open a can of worms. The summary answer is 'outdated design and programming methods'. Why it's never been properly addressed since I reported something related circa 2007 is another matter, involving pointless arguments about 'My OS is better than your OS' and 'not broken here syndrome'. [A special term I've just made up, I like it] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Cosmic_Ocean ![]() Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 ![]() ![]() |
I don't know if this particularly helps in any way or not, but I just wanted to add that I have lots of experience with deleting large amounts of data, whether it be lots of tiny files, or one or two large files. Both of those conditions take more than a few seconds on an average mechanical platter hard drive, and even still take a few seconds on an SSD, though it is noticeably faster on the SSD. It still takes mine a good 2-3 seconds to delete a 5GB file. Likewise, it took almost 20 seconds to delete a folder that had 147,000 files. So since MW uses large files that it then has to delete at the end of the task, it is reasonable to conclude that deleting takes too long before the next task attempts to use that slot. Speaking of slots, though I've never seen any issues myself, I have, on occasion, seen more slots in use than I should have. I only crunch on 3 cores of my CPU, so my slots should only ever be 0,1,2. At one point, I noticed I had a 3 and a 4, and no tasks were moved up to high-priority and I hadn't suspended some to run others, either. The more interesting thing about that is.. slots 3 and 4 stayed, even though my cache had completely emptied-out (because APs were scarce..again). I looked at the modified date on the slot folders and the files inside, and they were from weeks prior. Granted, yes, I'm using an older version of BOINC (6.10.58) and I'm sure a lot of quirks have been fixed since that build, but I feel that my occasional observations here and there seem to support the notion that deleting the files takes longer than expected, and BOINC doesn't appear to actually check that the slot is empty before using it again. ...or something along those lines. It certainly seems like you guys have gone from "this is a noted random thing" to "we're pretty sure it has to do with files not being deleted in a timely fashion," so.. that's pretty good progress by itself. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
![]() ![]() Send message Joined: 11 Sep 99 Posts: 6533 Credit: 196,805,888 RAC: 57 ![]() ![]() |
I don't know if this particularly helps in any way or not, but I just wanted to add that I have lots of experience with deleting large amounts of data, whether it be lots of tiny files, or one or two large files. Both of those conditions take more than a few seconds on an average mechanical platter hard drive, and even still take a few seconds on an SSD, though it is noticeably faster on the SSD. It still takes mine a good 2-3 seconds to delete a 5GB file. Likewise, it took almost 20 seconds to delete a folder that had 147,000 files. The quirk of BOINC leaving empty folders behind still exists in 7.4.42. SETI@home classic workunits: 93,865 CPU time: 863,447 hours |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14600 Credit: 200,643,578 RAC: 874 ![]() ![]() |
I looked at the modified date on the slot folders and the files inside, and they were from weeks prior. The quirk of BOINC leaving empty folders behind still exists in 7.4.42. BOINC does trim the slot list by removing empty and unused folders when the client restarts - I've just watched one of mine drop the folder range from 0..9 to 0..7 as it started up. But I don't know whether it only removes folders after the last one currently in use, or whether it could ever leave 'holes' in the folder list by removing empty ones in the middle. I'd guess non-empty folders wouldn't get deleted, whether or not an active task was marked as using them (though I haven't checked the code). According to the logic I posted yesterday, they should be emptied (files deleted) if BOINC ever finds it needs to re-use the slot, and if that works - and the subsequent clean when the task finishes works too - that should release the slot for purging at the next restart. |
![]() ![]() Send message Joined: 11 Sep 99 Posts: 6533 Credit: 196,805,888 RAC: 57 ![]() ![]() |
I looked at the modified date on the slot folders and the files inside, and they were from weeks prior. I did forget that the current versions do clean up while starting. The next time I see stray empty slot folders I may try moving data from one of the low numbered ones to the highest to see if it will delete middle number folders. I will probably need to check is there are references in client_state.xml that need to be modified so that BOINC doesn't trash a task by thinking it has gone missing form the original folder. SETI@home classic workunits: 93,865 CPU time: 863,447 hours |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14600 Credit: 200,643,578 RAC: 874 ![]() ![]() |
I will probably need to check is there are references in client_state.xml that need to be modified so that BOINC doesn't trash a task by thinking it has gone missing form the original folder. Yes there are, but I think only in the 'active task list' at the very bottom of client_state.xml I haven't looked to see if there are any if the slot files themselves - don't blame me if it goes wrong. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
The easiest way is to suspend a few running tasks, then resume the earlier tasks and let them complete. Then resume all tasks and stop/start BOINC. In my case(s) the empty middle folders with the completed tasks will be removed and the higher slots with pending tasks will be left as there were. BTW, I'm pretty sure a few of us in the old thread reported finding 'empty' slot folders that still contained a stderr.txt file. Everything else had been deleted except the stderr.txt. So, I'm not seeing much progress in this thread. At least Raistmer seems have 'fixed' the related problems with the APs that were crashing after completion occurring back then. I haven't seen any 'crashed after finish' APs in a while. |
![]() ![]() Send message Joined: 11 Sep 99 Posts: 6533 Credit: 196,805,888 RAC: 57 ![]() ![]() |
I will probably need to check is there are references in client_state.xml that need to be modified so that BOINC doesn't trash a task by thinking it has gone missing form the original folder. When I got home from work I checked my HTPC & it happened to have slots 0-5 with the slot 3 folder being empty. Normally there should only be 0-4 for 4 CPU & 1 ATI GPU task. Restarting BOINC caused the 3 folder to be removed. Now I wonder if it will continue with 0,1,2,4,5 or add a new 3 slot folder. Rather than force anything I'm just going to see what happens now. Looking over the data in the slot folders init_data.xml also contains the tasks slot number. So that would need to be changed if moving a task to a new slot I expect. SETI@home classic workunits: 93,865 CPU time: 863,447 hours |
Cosmic_Ocean ![]() Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 ![]() ![]() |
I looked at the modified date on the slot folders and the files inside, and they were from weeks prior. That makes sense.. I very rarely restart BOINC on purpose, and only reboot my main cruncher every 3-6 weeks--usually because Windows Update demands a reboot. I never did look and see if unused slot folders go away after a reboot, I just noticed them still being there long after all tasks had finished. I have also seen (farther into the past, not anytime recently) cases where I still had the WU file or two for MB or AP several months to a year+ after that WU was crunched and returned and the machine had been rebooted dozens of times. Yet if I leave an old build for one of the science apps in there and app_info doesn't reference it.. there's red-text in the message log and the old EXE gets deleted, so that must mean that those old WUs are still in client_state for whatever reason. *shrug* Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14600 Credit: 200,643,578 RAC: 874 ![]() ![]() |
I've now observed the exact problem that started me on the current bug search, live in real time on my own machine. http://boincai05.cern.ch/CMS-dev/forum_thread.php?id=34&postid=337 It appears that files over 4 GB in size are invisible to, or skipped by, BOINC's slot cleaning routines. It's a real file, and it persists and can be copied - BOINC knows about its size if it tries to run another task, but doesn't even try to delete it. That probably doesn't help you all with stderr.txt, but it's probably easier for the BOINC devs to deal with than some of the other theories. |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
It's going to be interesting to look at the file enumeration part, the functions used and MSDN references. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14600 Credit: 200,643,578 RAC: 874 ![]() ![]() |
It's going to be interesting to look at the file enumeration part, the functions used and MSDN references. dir_scan() in sandbox.cpp, around line 300. But David beat me to it: fa3f6be5128071fe7e15563e727b5478c45a63b8 |
©2023 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.