What causes a Blank stderr?

Message boards : Number crunching : What causes a Blank stderr?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1674380 - Posted: 6 May 2015, 15:15:03 UTC - in response to Message 1674379.  
Last modified: 6 May 2015, 15:20:55 UTC

To me it seems the addition of a "Check folder is actually empty before reuse" would solve the issue.


Should do, that'd effectively be a mutex. More than one way to skin a cat of course. Another is certainly to make slots unique as described (or suitable variant), and garbage collect finished with resources at leisure, as a background thread at low priority. That last option (or some variant) would let the client do what it needs to do with new tasks/slots in the most timely fashion, and reduce logic involved in checking/locking. Might not be the best option for some situations, but probably the best for multicore PC type platforms at least.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1674380 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14431
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1674383 - Posted: 6 May 2015, 15:28:30 UTC - in response to Message 1674380.  

I've been following the development of this problem since noticing a conversation bouncing between the Milkyway and Einstein message boards since, I think, about 24 April. It was several days (and many false turnings) before the originator of both threads turned to look at his slot folders, and found two stray CNS-dev .vdi files. Since CMS tasks run for 24 hours, and there is a maximum quota of one per machine, I'm interpreting that as saying that files may be present for hours, maybe days, after BOINC calls DeleteFile(). I don't think any OS's garbage collection is as lazy as that - especially since I think the false trails included GPU driver updates and other brute-force operations (though I haven't found any explicit reference to re-starting the host).
ID: 1674383 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13078
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1674390 - Posted: 6 May 2015, 15:59:59 UTC

I've set the slot_debug flag in my logs for both machines since I caught up reading this thread this morning. Haven't seen any slot clearing failure yet though it is still early in the day. I usually generate a couple of invalids a day per machine at MilkyWay. I will post any log evidence I can find later this evening.

Cheers, Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1674390 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1674428 - Posted: 6 May 2015, 17:49:35 UTC - in response to Message 1674383.  
Last modified: 6 May 2015, 18:25:00 UTC

... I'm interpreting that as saying that files may be present for hours, maybe days, after BOINC calls DeleteFile(). I don't think any OS's garbage collection is as lazy as that...


True, however when there are open handles on those files, the deletion may well successfully mark for later deletion. In principle, a corrupt DLL global space (including open handles/locks in the user mode driver) can persist until reboot.

IOW, you can walk through as much logic that seems reasonable, but as soon as you have race conditions and non-threadsafe behaviour, all bets are off, and you will see seemingly non-deterministic behaviour.

[Edit:] note that at a high level, the 'race condition' concerned typically begins the instant the application writes the finished file. The race is between the finished file being created by the system, and the boinc client picking up the application exit with zero status. The non threadsafe ('insecure') behaviour is primarily the hard process termination, but includes other app and library usage specific issues.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1674428 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13078
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1674525 - Posted: 6 May 2015, 23:16:57 UTC - in response to Message 1674390.  

OK, I missed an invalid this morning when I was out of the house and it had already been cleared from the log before I got back. However I seem to have generated another one this afternoon and its tracks should be in the logfile. I searched for the string "failed" and failed to find any incidence. I matched the time the invalid MW task reported to its entries in the logfile and see something that doesn't appear anytime else in the current log. I see two slot cleanings reported at the same time, slot 12 and slot 13. Plus I see where the stderr.txt file DID NOT get removed in the block of cleanings as the rest of slot 12. It got removed down in the slot 13 block. It looks like a race condition to me.

7573 5/6/2015 2:12:07 PM [slot] cleaning out slots/12: handle_exited_app()
7574 5/6/2015 2:12:07 PM [slot] removed file slots/12/astronomy_parameters.txt
7575 5/6/2015 2:12:07 PM [slot] removed file slots/12/boinc_finish_called
7576 5/6/2015 2:12:07 PM [slot] removed file slots/12/init_data.xml
7577 5/6/2015 2:12:07 PM [slot] removed file slots/12/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe
7578 5/6/2015 2:12:07 PM [slot] removed file slots/12/separation_checkpoint
7579 5/6/2015 2:12:07 PM [slot] removed file slots/12/stars.txt
7580 5/6/2015 2:12:07 PM [slot] removed file slots/12/stderr.txt
7581 Milkyway@Home 5/6/2015 2:12:07 PM Computation for task de_modfit_fast_15_3s_136_simmodBPL2_1_1429700402_5765257_1 finished
7582 5/6/2015 2:12:07 PM [slot] cleaning out slots/12: get_free_slot()
7583 Milkyway@Home 5/6/2015 2:12:07 PM [slot] assigning slot 12 to de_modfit_fast_15_3s_136_simmodBPL2_2_1429700402_5782635_0
7584 Milkyway@Home 5/6/2015 2:12:07 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe to slots/12/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe
7585 Milkyway@Home 5/6/2015 2:12:07 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/parameters-15-3s-modfit-simmodBPL2.txt to slots/12/astronomy_parameters.txt
7586 Milkyway@Home 5/6/2015 2:12:07 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/stars-15-simmodBPL2.txt to slots/12/stars.txt
7587 Milkyway@Home 5/6/2015 2:12:07 PM Starting task de_modfit_fast_15_3s_136_simmodBPL2_2_1429700402_5782635_0
7588 Milkyway@Home 5/6/2015 2:12:42 PM Sending scheduler request: To fetch work.
7589 Milkyway@Home 5/6/2015 2:12:42 PM Reporting 1 completed tasks
7590 Milkyway@Home 5/6/2015 2:12:42 PM Requesting new tasks for NVIDIA GPU
7591 Milkyway@Home 5/6/2015 2:12:44 PM Scheduler request completed: got 1 new tasks
7592 5/6/2015 2:12:59 PM [slot] cleaning out slots/12: handle_exited_app()
7593 5/6/2015 2:12:59 PM [slot] removed file slots/12/astronomy_parameters.txt
7594 5/6/2015 2:12:59 PM [slot] removed file slots/12/boinc_finish_called
7595 5/6/2015 2:12:59 PM [slot] removed file slots/12/init_data.xml
7596 5/6/2015 2:12:59 PM [slot] removed file slots/12/milkyway_separation__modified_fit_1.36_windows_x86_64__opencl_nvidia_101.exe
7597 5/6/2015 2:12:59 PM [slot] removed file slots/12/separation_checkpoint
7598 5/6/2015 2:12:59 PM [slot] removed file slots/12/stars.txt
7599 Milkyway@Home 5/6/2015 2:12:59 PM Computation for task de_modfit_fast_15_3s_136_simmodBPL2_2_1429700402_5782635_0 finished
7600 5/6/2015 2:12:59 PM [slot] cleaning out slots/12: get_free_slot()

7601 5/6/2015 2:12:59 PM [slot] cleaning out slots/13: get_free_slot()
7602 Milkyway@Home 5/6/2015 2:12:59 PM [slot] assigning slot 13 to de_80_DR8_Rev_8_5_00004_1429700402_7400119_0
7603 Milkyway@Home 5/6/2015 2:12:59 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_separation_1.02_windows_x86_64__opencl_nvidia.exe to slots/13/milkyway_separation_1.02_windows_x86_64__opencl_nvidia.exe
7604 Milkyway@Home 5/6/2015 2:12:59 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/80_rev_8_5.prmtrs to slots/13/astronomy_parameters.txt
7605 Milkyway@Home 5/6/2015 2:12:59 PM [slot] linked ../../projects/milkyway.cs.rpi.edu_milkyway/80_Rev_8_3.stars to slots/13/stars.txt
7606 Milkyway@Home 5/6/2015 2:12:59 PM Starting task de_80_DR8_Rev_8_5_00004_1429700402_7400119_0
7607 Milkyway@Home 5/6/2015 2:13:49 PM Sending scheduler request: To fetch work.
7608 Milkyway@Home 5/6/2015 2:13:49 PM Reporting 1 completed tasks
7609 Milkyway@Home 5/6/2015 2:13:49 PM Requesting new tasks for NVIDIA GPU
7610 Milkyway@Home 5/6/2015 2:13:51 PM Scheduler request completed: got 1 new tasks
7611 Milkyway@Home 5/6/2015 2:14:56 PM Sending scheduler request: To fetch work.
7612 Milkyway@Home 5/6/2015 2:14:56 PM Requesting new tasks for NVIDIA GPU
7613 Milkyway@Home 5/6/2015 2:14:58 PM Scheduler request completed: got 0 new tasks
7614 SETI@home 5/6/2015 2:15:28 PM Sending scheduler request: To fetch work.
7615 SETI@home 5/6/2015 2:15:28 PM Reporting 1 completed tasks
7616 SETI@home 5/6/2015 2:15:28 PM Requesting new tasks for CPU and NVIDIA GPU
7617 SETI@home 5/6/2015 2:15:31 PM Scheduler request completed: got 1 new tasks
7618 SETI@home 5/6/2015 2:15:31 PM [slot] linked projects/setiathome.berkeley.edu/arecibo_181.png to projects/setiathome.berkeley.edu/slideshow_setiathome_v7_01
7619 SETI@home 5/6/2015 2:15:31 PM [slot] linked projects/setiathome.berkeley.edu/arecibo_181.png to projects/setiathome.berkeley.edu/slideshow_astropulse_v6_01
7620 SETI@home 5/6/2015 2:15:31 PM [slot] linked projects/setiathome.berkeley.edu/sah_40.png to projects/setiathome.berkeley.edu/stat_icon
7621 SETI@home 5/6/2015 2:15:31 PM [slot] linked projects/setiathome.berkeley.edu/sah_banner_290.png to projects/setiathome.berkeley.edu/slideshow_setiathome_v7_02
7622 SETI@home 5/6/2015 2:15:31 PM [slot] linked projects/setiathome.berkeley.edu/sah_banner_290.png to projects/setiathome.berkeley.edu/slideshow_astropulse_v6_02
7623 SETI@home 5/6/2015 2:15:31 PM [slot] linked projects/setiathome.berkeley.edu/sah_ss_290.png to projects/setiathome.berkeley.edu/slideshow_setiathome_v7_00
7624 SETI@home 5/6/2015 2:15:31 PM [slot] linked projects/setiathome.berkeley.edu/sah_ss_290.png to projects/setiathome.berkeley.edu/slideshow_astropulse_v6_00
7625 5/6/2015 2:15:31 PM [slot] removed file projects/setiathome.berkeley.edu/25se12ab.22094.24198.438086664198.12.110
7626 SETI@home 5/6/2015 2:15:33 PM Started download of 25no12ad.9834.2930.438086664208.12.177
7627 SETI@home 5/6/2015 2:15:34 PM Finished download of 25no12ad.9834.2930.438086664208.12.177
7628 5/6/2015 2:15:52 PM [slot] cleaning out slots/13: handle_exited_app()
7629 5/6/2015 2:15:52 PM [slot] removed file slots/13/astronomy_parameters.txt
7630 5/6/2015 2:15:52 PM [slot] removed file slots/13/boinc_finish_called
7631 5/6/2015 2:15:52 PM [slot] removed file slots/13/boinc_task_state.xml
7632 5/6/2015 2:15:52 PM [slot] removed file slots/13/init_data.xml
7633 5/6/2015 2:15:52 PM [slot] removed file slots/13/milkyway_separation_1.02_windows_x86_64__opencl_nvidia.exe
7634 5/6/2015 2:15:52 PM [slot] removed file slots/13/separation_checkpoint
7635 5/6/2015 2:15:52 PM [slot] removed file slots/13/stars.txt
7636 5/6/2015 2:15:52 PM [slot] removed file slots/13/stderr.txt
7637 Milkyway@Home 5/6/2015 2:15:52 PM Computation for task de_80_DR8_Rev_8_5_00004_1429700402_7400119_0 finished
7638 5/6/2015 2:15:52 PM [slot] cleaning out slots/12: get_free_slot()
7639 5/6/2015 2:15:52 PM [slot] removed file slots/12/stderr.txt
7640 Milkyway@Home 5/6/2015 2:15:52 PM [slot] assigning slot 12 to de_80_DR8_Rev_8_5_00004_1429700402_7401189_0


Does this help the developers out in any fashion? Or am I reading something into nothing?

Cheers, Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1674525 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1674550 - Posted: 7 May 2015, 2:40:26 UTC

I just asked why blank results happen, not the entire programming behind it, LOL

This may be of interest to you programmers.

I have 2 identical 750ti in 2 computers, both running 2 MB tasks at a time.

AMD 4200+, no blanks, running clean
INTEL 2.9GHz, occasional blanks

So it could be the speed of the processor (stop/start speed), or the difference in AMD/INTEL architecture.

If you are looking at my errors, that was caused by a computer crash that corrupted some STEI files, Reinstalled and it deleted my berkly folded (grr)
ID: 1674550 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6533
Credit: 196,805,888
RAC: 57
United States
Message 1674552 - Posted: 7 May 2015, 2:52:50 UTC - in response to Message 1674550.  

I just asked why blank results happen, not the entire programming behind it, LOL

This may be of interest to you programmers.

I have 2 identical 750ti in 2 computers, both running 2 MB tasks at a time.

AMD 4200+, no blanks, running clean
INTEL 2.9GHz, occasional blanks

So it could be the speed of the processor (stop/start speed), or the difference in AMD/INTEL architecture.

If you are looking at my errors, that was caused by a computer crash that corrupted some STEI files, Reinstalled and it deleted my berkly folded (grr)

I would expect the different OS on each machine is more likely a factor in the issue.
It could be related to issues I've had deleting files/folder in Windows. Since the release of Vista. Which only got better post Windows 7 SP1.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1674552 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1674560 - Posted: 7 May 2015, 3:44:29 UTC - in response to Message 1674552.  

I would expect the different OS on each machine is more likely a factor in the issue.
It could be related to issues I've had deleting files/folder in Windows. Since the release of Vista. Which only got better post Windows 7 SP1.


Highly likely. More specifically, multithreaded C-Runtime libraries (made the default and only option since Visual studio 2005 ). extensively kernel buffered I/O (which is a desktop OS performance optimisation), and proliferation of multi-core.

Win7 (SP1) is a fair bit more aggressive about its own garbage collection, hiding some serious application level headaches. It's a bit of a case where some could say 'd@mn M$', but then seeing vaguely similar scaling problems starting to happen with Linux and Android, I'd suggest that most likely bad programming and development inertia are ubiquitous.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1674560 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1674565 - Posted: 7 May 2015, 4:01:28 UTC - in response to Message 1674550.  
Last modified: 7 May 2015, 4:03:23 UTC

I just asked why blank results happen, not the entire programming behind it, LOL


Haha. that's what happens if you open a can of worms. The summary answer is 'outdated design and programming methods'. Why it's never been properly addressed since I reported something related circa 2007 is another matter, involving pointless arguments about 'My OS is better than your OS' and 'not broken here syndrome'. [A special term I've just made up, I like it]
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1674565 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1674757 - Posted: 7 May 2015, 17:32:53 UTC

I don't know if this particularly helps in any way or not, but I just wanted to add that I have lots of experience with deleting large amounts of data, whether it be lots of tiny files, or one or two large files. Both of those conditions take more than a few seconds on an average mechanical platter hard drive, and even still take a few seconds on an SSD, though it is noticeably faster on the SSD. It still takes mine a good 2-3 seconds to delete a 5GB file. Likewise, it took almost 20 seconds to delete a folder that had 147,000 files.

So since MW uses large files that it then has to delete at the end of the task, it is reasonable to conclude that deleting takes too long before the next task attempts to use that slot.

Speaking of slots, though I've never seen any issues myself, I have, on occasion, seen more slots in use than I should have. I only crunch on 3 cores of my CPU, so my slots should only ever be 0,1,2. At one point, I noticed I had a 3 and a 4, and no tasks were moved up to high-priority and I hadn't suspended some to run others, either. The more interesting thing about that is.. slots 3 and 4 stayed, even though my cache had completely emptied-out (because APs were scarce..again). I looked at the modified date on the slot folders and the files inside, and they were from weeks prior.

Granted, yes, I'm using an older version of BOINC (6.10.58) and I'm sure a lot of quirks have been fixed since that build, but I feel that my occasional observations here and there seem to support the notion that deleting the files takes longer than expected, and BOINC doesn't appear to actually check that the slot is empty before using it again. ...or something along those lines. It certainly seems like you guys have gone from "this is a noted random thing" to "we're pretty sure it has to do with files not being deleted in a timely fashion," so.. that's pretty good progress by itself.
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1674757 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6533
Credit: 196,805,888
RAC: 57
United States
Message 1674798 - Posted: 7 May 2015, 18:50:47 UTC - in response to Message 1674757.  

I don't know if this particularly helps in any way or not, but I just wanted to add that I have lots of experience with deleting large amounts of data, whether it be lots of tiny files, or one or two large files. Both of those conditions take more than a few seconds on an average mechanical platter hard drive, and even still take a few seconds on an SSD, though it is noticeably faster on the SSD. It still takes mine a good 2-3 seconds to delete a 5GB file. Likewise, it took almost 20 seconds to delete a folder that had 147,000 files.

So since MW uses large files that it then has to delete at the end of the task, it is reasonable to conclude that deleting takes too long before the next task attempts to use that slot.

Speaking of slots, though I've never seen any issues myself, I have, on occasion, seen more slots in use than I should have. I only crunch on 3 cores of my CPU, so my slots should only ever be 0,1,2. At one point, I noticed I had a 3 and a 4, and no tasks were moved up to high-priority and I hadn't suspended some to run others, either. The more interesting thing about that is.. slots 3 and 4 stayed, even though my cache had completely emptied-out (because APs were scarce..again). I looked at the modified date on the slot folders and the files inside, and they were from weeks prior.

Granted, yes, I'm using an older version of BOINC (6.10.58) and I'm sure a lot of quirks have been fixed since that build, but I feel that my occasional observations here and there seem to support the notion that deleting the files takes longer than expected, and BOINC doesn't appear to actually check that the slot is empty before using it again. ...or something along those lines. It certainly seems like you guys have gone from "this is a noted random thing" to "we're pretty sure it has to do with files not being deleted in a timely fashion," so.. that's pretty good progress by itself.

The quirk of BOINC leaving empty folders behind still exists in 7.4.42.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1674798 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14431
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1674813 - Posted: 7 May 2015, 19:55:40 UTC - in response to Message 1674798.  

I looked at the modified date on the slot folders and the files inside, and they were from weeks prior.

The quirk of BOINC leaving empty folders behind still exists in 7.4.42.

BOINC does trim the slot list by removing empty and unused folders when the client restarts - I've just watched one of mine drop the folder range from 0..9 to 0..7 as it started up. But I don't know whether it only removes folders after the last one currently in use, or whether it could ever leave 'holes' in the folder list by removing empty ones in the middle.

I'd guess non-empty folders wouldn't get deleted, whether or not an active task was marked as using them (though I haven't checked the code). According to the logic I posted yesterday, they should be emptied (files deleted) if BOINC ever finds it needs to re-use the slot, and if that works - and the subsequent clean when the task finishes works too - that should release the slot for purging at the next restart.
ID: 1674813 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6533
Credit: 196,805,888
RAC: 57
United States
Message 1674831 - Posted: 7 May 2015, 20:22:25 UTC - in response to Message 1674813.  

I looked at the modified date on the slot folders and the files inside, and they were from weeks prior.

The quirk of BOINC leaving empty folders behind still exists in 7.4.42.

BOINC does trim the slot list by removing empty and unused folders when the client restarts - I've just watched one of mine drop the folder range from 0..9 to 0..7 as it started up. But I don't know whether it only removes folders after the last one currently in use, or whether it could ever leave 'holes' in the folder list by removing empty ones in the middle.

I'd guess non-empty folders wouldn't get deleted, whether or not an active task was marked as using them (though I haven't checked the code). According to the logic I posted yesterday, they should be emptied (files deleted) if BOINC ever finds it needs to re-use the slot, and if that works - and the subsequent clean when the task finishes works too - that should release the slot for purging at the next restart.

I did forget that the current versions do clean up while starting. The next time I see stray empty slot folders I may try moving data from one of the low numbered ones to the highest to see if it will delete middle number folders. I will probably need to check is there are references in client_state.xml that need to be modified so that BOINC doesn't trash a task by thinking it has gone missing form the original folder.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1674831 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14431
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1674852 - Posted: 7 May 2015, 20:48:42 UTC - in response to Message 1674831.  

I will probably need to check is there are references in client_state.xml that need to be modified so that BOINC doesn't trash a task by thinking it has gone missing form the original folder.

Yes there are, but I think only in the 'active task list' at the very bottom of client_state.xml

I haven't looked to see if there are any if the slot files themselves - don't blame me if it goes wrong.
ID: 1674852 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1674856 - Posted: 7 May 2015, 21:06:11 UTC - in response to Message 1674852.  
Last modified: 7 May 2015, 21:07:08 UTC

The easiest way is to suspend a few running tasks, then resume the earlier tasks and let them complete. Then resume all tasks and stop/start BOINC. In my case(s) the empty middle folders with the completed tasks will be removed and the higher slots with pending tasks will be left as there were.

BTW, I'm pretty sure a few of us in the old thread reported finding 'empty' slot folders that still contained a stderr.txt file. Everything else had been deleted except the stderr.txt. So, I'm not seeing much progress in this thread. At least Raistmer seems have 'fixed' the related problems with the APs that were crashing after completion occurring back then. I haven't seen any 'crashed after finish' APs in a while.
ID: 1674856 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6533
Credit: 196,805,888
RAC: 57
United States
Message 1674883 - Posted: 7 May 2015, 22:10:07 UTC - in response to Message 1674852.  

I will probably need to check is there are references in client_state.xml that need to be modified so that BOINC doesn't trash a task by thinking it has gone missing form the original folder.

Yes there are, but I think only in the 'active task list' at the very bottom of client_state.xml

I haven't looked to see if there are any if the slot files themselves - don't blame me if it goes wrong.

When I got home from work I checked my HTPC & it happened to have slots 0-5 with the slot 3 folder being empty. Normally there should only be 0-4 for 4 CPU & 1 ATI GPU task. Restarting BOINC caused the 3 folder to be removed. Now I wonder if it will continue with 0,1,2,4,5 or add a new 3 slot folder. Rather than force anything I'm just going to see what happens now.
Looking over the data in the slot folders init_data.xml also contains the tasks slot number. So that would need to be changed if moving a task to a new slot I expect.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1674883 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1674946 - Posted: 7 May 2015, 23:44:25 UTC - in response to Message 1674813.  

I looked at the modified date on the slot folders and the files inside, and they were from weeks prior.

The quirk of BOINC leaving empty folders behind still exists in 7.4.42.

BOINC does trim the slot list by removing empty and unused folders when the client restarts...

That makes sense.. I very rarely restart BOINC on purpose, and only reboot my main cruncher every 3-6 weeks--usually because Windows Update demands a reboot. I never did look and see if unused slot folders go away after a reboot, I just noticed them still being there long after all tasks had finished.

I have also seen (farther into the past, not anytime recently) cases where I still had the WU file or two for MB or AP several months to a year+ after that WU was crunched and returned and the machine had been rebooted dozens of times. Yet if I leave an old build for one of the science apps in there and app_info doesn't reference it.. there's red-text in the message log and the old EXE gets deleted, so that must mean that those old WUs are still in client_state for whatever reason. *shrug*
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1674946 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14431
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1675291 - Posted: 8 May 2015, 19:10:36 UTC

I've now observed the exact problem that started me on the current bug search, live in real time on my own machine.

http://boincai05.cern.ch/CMS-dev/forum_thread.php?id=34&postid=337

It appears that files over 4 GB in size are invisible to, or skipped by, BOINC's slot cleaning routines. It's a real file, and it persists and can be copied - BOINC knows about its size if it tries to run another task, but doesn't even try to delete it.

That probably doesn't help you all with stderr.txt, but it's probably easier for the BOINC devs to deal with than some of the other theories.
ID: 1675291 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1675333 - Posted: 8 May 2015, 20:31:59 UTC - in response to Message 1675291.  

It's going to be interesting to look at the file enumeration part, the functions used and MSDN references.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1675333 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14431
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1675350 - Posted: 8 May 2015, 21:31:47 UTC - in response to Message 1675333.  

It's going to be interesting to look at the file enumeration part, the functions used and MSDN references.

dir_scan() in sandbox.cpp, around line 300.

But David beat me to it:

fa3f6be5128071fe7e15563e727b5478c45a63b8
ID: 1675350 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : What causes a Blank stderr?


 
©2022 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.