Message boards :
Number crunching :
What causes a Blank stderr?
Message board moderation
Author | Message |
---|---|
Brent Norman Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835 |
<core_client_version>7.4.42</core_client_version> I have 3 of them, but with normal run times. Confused. EDIT: cuda50 tasks |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13161 Credit: 1,160,866,277 RAC: 1,873 |
No application developer has ever chimed in with why we see this. Believe me, I've asked. http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3662 It is especially prevalent on MilkyWay@Home 1.36 tasks. Best guess is that the system is too busy to devote enough resources to finish the result transaction. The tasks are valid with a good [0] exit status but just doesn't have any results in the file. Someone suggested that the result in a slot gets occupied with new work before the finished file is completely written out. I once got a project developer say they would look into the problem but haven't ever received any response from them or a new fixed application. I only end up with about 3% errors over at MW and you are not polluting the science database with bad results. They are just invalid and you just wasted time and energy processing them for naught. At least the problem only rarely shows up here at SETI@HOME. The problem may lie with the underlying BOINC platform code or it could just be a problem with the project application. I would sure like to know the answer to your question. Cheers, Keith Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Tom* Send message Joined: 12 Aug 11 Posts: 127 Credit: 20,769,223 RAC: 9 |
http://setiathome.berkeley.edu/forum_thread.php?id=76946&postid=1655230#1655230 No application developer has ever chimed in with why we see this. Believe me, I've asked. I think Jason in the above post has a good handle on the stderr truncated or missing, boincapi issue. although he may have given up on fighting city hall, I hope he gets it fixed as the validate errors in milkyway are too large to ignore. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
http://setiathome.berkeley.edu/forum_thread.php?id=76946&postid=1655230#1655230 The Stderr is not actually the science result. So if it missing or damaged it doesn't really effect anything for the project. The thread Strange Invalid MB Overflow tasks with truncated Stderr outputs... may also have some information, but from what I recall it is a persistent BOINC issue. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
http://setiathome.berkeley.edu/forum_thread.php?id=76946&postid=1655230#1655230 Pretty succinct summary, though 'given up fighting city hall' isn't quite the current status ;) being an Australian born of half dutch half scottish parentage, throwing in the towel isn't an option when there's a decent challenge on. Recent status has been probing and eliminating possible 'paths of least resistance' to check they would cost more in sanity and other resources than simply building, offering for wider use, and maintaining replacement resources (such as boincapi, client, generalised application component, server, and utility support resources) For the application side, while Win10/Cuda7 and other changes are happening, The picture has been a little muddy, though is starting to coalesce into a sensible medium to long term roadmap of sorts (which I'll post in due course). In addition, I've been migrating XBranch to the gradle build system, such that it will eventually have a consistent cross platform development infrastructure in place. Once able to present a clear picture there, I'll be able to begin marshalling existing resources and look for more support (e.g. if Milkyway would like a hand with that). On the Cuda side I've already received a great deal of great work from Perti33, which needs some integration and generalisation for wider testing and distribution. The usual state of affairs with something as large as the above all combined, is that at some point it starts to gain a life of its own, attract more resources & people, and get better and better. In that sense, coalescing the bigger vision of where things should be, is making more sense right now than trying to encourage developers to actually try using their own work, and do a little reading. [As opposed to circling the wagons every time someone points out user expectations aren't being met ] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
... I hope he gets it fixed as the validate errors in milkyway are too large to ignore. At Milkyway, the science result is reported back via stderr: they don't use a separate upload file. So Keith's point is valid - an illustration of the dangers of over-generalising from experience of just one project. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
... I hope he gets it fixed as the validate errors in milkyway are too large to ignore. I'd have to agree with that from a developer perspective also. It can be remarkably difficult to debug applications that apparently work normally most of the time, but then when manifesting issues present no debug log, which after all is what stderr is really meant for. Creating applications is, is of, and is for the science too. The instruments go hand in hand with the research, as much as a chemist might use a mass spectrometer (or some such example). So why dismiss a broken tool as unimportant ? Never saw the logic completely. Best tools for the best job. Nothing worse than repairing surface mount electronic circuitry with a good old fencepost soldering iron. Can be done, but just looks sloppy. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13161 Credit: 1,160,866,277 RAC: 1,873 |
http://setiathome.berkeley.edu/forum_thread.php?id=76946&postid=1655230#1655230 Yes, that is probably who I was remembering came up with the two tasks in same slot scenario. My comment about no developer getting back to me after saying they would look into it was directed at the MW project scientist and developer Matthew over in the MW thread I referenced. I would hope that as Jason writes, maybe the problem gains enough recognition across all projects that the BOINC developers finally develop a plan to fix the underlying weaknesses in the BOINC code. Cheers, Keith Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
Yes, that is probably who I was remembering came up with the two tasks in same slot scenario. My comment about no developer getting back to me after saying they would look into it was directed at the MW project scientist and developer Matthew over in the MW thread I referenced. I would hope that as Jason writes, maybe the problem gains enough recognition across all projects that the BOINC developers finally develop a plan to fix the underlying weaknesses in the BOINC code. "Two tasks in the same slot" is a problem which is under active - very active - investigation right at the moment. It appears that under some circumstances (yet to be confirmed), the 'slot cleansing' between tasks may fail. Stay tuned. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13161 Credit: 1,160,866,277 RAC: 1,873 |
It makes sense in that I see the problem mainly over at MW and more to the point with the Modified Separation Fit 1.36 tasks which run for very short times ... under a minute normally on my hardware. That little snippet of code I saw in this thread with a timeout of 15 seconds for cleanup looks mighty suspicious. Cheers, Keith Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
"Two tasks in the same slot" is a problem which is under active - very active - investigation right at the moment. It appears that under some circumstances (yet to be confirmed), the 'slot cleansing' between tasks may fail. Stay tuned. Interesting, as I haven't yet made my way to look at the slot management, in particular what they're using as mutexes (if any apart from the boinc lockfile). lack of 'proper' locks (OS and runtime library dependant) would likely see those kindof symptoms happen in weird and difficult to consistently reproduce ways. Pretty similar to the api's abuse of threaded IO runtimes in concept, so wouldn't be shocked if it's just patched up old code fine under old [single threaded] POSIX C libraries, and not much else [by way of discipline when using raw threads and IO]. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
"Two tasks in the same slot" is a problem which is under active - very active - investigation right at the moment. It appears that under some circumstances (yet to be confirmed), the 'slot cleansing' between tasks may fail. Stay tuned. Typically, the test task which finished and reported since my last post didn't suffer from the problems others have reported. The problem - being discussed at Milkyway, Einstein and World Community Grid - is that another project's VBox test application creates a 5.7 GB virtual machine image file, and BOINC sometimes fails to cleanse it. If an 'ordinary' project tries to run a task in the slot containing the leftovers, the 5.7 GB is counted against the disk usage of the new task and the task errors out with 'disk limit exceeded'. Since this is a case where one project's app is being blamed for another project's errors, the developers are taking it seriously, and have improved the debug logging and produced a new test version of BOINC with a possible fix since I first reported it on Sunday morning. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
... I hope he gets it fixed as the validate errors in milkyway are too large to ignore. Ah yes that would tend to be a much large issue in that case. Is the stderr intended to be the default method of reporting results for BOINC? Or is Milkyway just an exception to the rule? SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
... I hope he gets it fixed as the validate errors in milkyway are too large to ignore. I'd call Milkyway an exception to most rules. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
"Two tasks in the same slot" is a problem which is under active - very active - investigation right at the moment. It appears that under some circumstances (yet to be confirmed), the 'slot cleansing' between tasks may fail. Stay tuned. Well, we haven't caught our 'exceeded disk limit' yet, but the search with <slot_debug> has turned this up: failed to remove file slots/0/stderr.txt: unlink() failed |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Well, we haven't caught our 'exceeded disk limit' yet, but the search with <slot_debug> has turned this up: Heh. Interestingly the (for Windows) underlying DeleteFile() Windows Api call, under the unlink() logic, is a non blocking call: https://msdn.microsoft.com/en-us/library/windows/desktop/aa363915%28v=vs.85%29.aspx ...The DeleteFile function fails if an application attempts to delete a file that has other handles open for normal I/O or as a memory-mapped file (FILE_SHARE_DELETE must have been specified when other handles were opened). At least in the present example, it points to open handles, likely via Windows memory mapped file implementation used by Boinc MFILE or MIOFILE (whichever) structures. i.e. the application is still shutting down, or was forced closed with TerminateProcess()... [Yeah that old chestnut]. With the code present in sandbox.cpp The failure rate would be proportional to the deleting client to finishing app process priority ratio (app is usually idle-below normal), the total of the file sizes being deleted, system contention, to some limited extent filesystem performance itself, desktop optimisations by Windows version and C-Runtime used, and maybe some caching policies at various levels (Hardware, OS and Driver). Best practices solution might likely involve doing something like this before allowing the slot to be used again: - setting a mutex (of any suitable type) indicating start of a bulk deletion transaction, - delete, - check, --retry for failures, for slowness just accept the IO will get around to it and allow very generous timeouts (if any), - only release the mutex once complete. The above, for large files etc, might take too long (Many seconds to minutes) for the Boinc client in its current architecture, because the file deletions seem to be in the main processing loop (thread). As the contention will be highest at task completion for all sorts of reasons, What would be better is either some dedicated garbage collection thread, that runs independently, allowing normal client other processing to continue (in other slots) while making sure nothing will try to use the slot with a deletion in progress, OR keeping transactions much smaller. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
In theory, according to David's initial response, the current logic specifies: 1) Delete everything in the slot folder on task exit (this failed in the example) 2) Delete everything - i.e. anything remaining - in the slot before reuse 3) Don't reuse the slot if (2) fails The error we were originally investigating implies that step (3) failed. That may have been because step (2) failed without returning an error - David has already attempted to close that loophole in the private drop v7.5.1 (in the case reported at CMS-dev, the poster has returned to say that the file was deleted at step (2), so the safety at least partially works - though he did a service restart, which will have provided extra time for the locks to clear) I've checked my logs since seeing the unlink() failed report - no occurences on either machine. But I'll keep an eye open. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
makes sense, but David's 3 step process there implies the use of mutexes (locks, atomic transactions etc), which aren't in the code. #1 and #2 can succeed but the file not be physically deleted yet. That's [asynchronous] buffered IO, and doesn't occur in sequence (unless you make it so, [with additional logic]) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
3) Don't reuse the slot if (2) fails Well there's probably the 'original' issue. The logic should instead be something like: '3) Don't reuse the slot unless everything's deleted and it's not allocated etc' , because #1 and #2 can succeed on one core/thread, and not be seen on another core/thread until later (race condition). #1 & #2 succeeding, doesn't mean they're complete. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
makes sense, but David's 3 step process there implies the use of mutexes (locks, atomic transactions etc), which aren't in the code. #1 and #2 can succeed but the file not be physically deleted yet. That's [asynchronous] buffered IO, and doesn't occur in sequence (unless you make it so, [with additional logic]) To me it seems the addition of a "Check folder is actually empty before reuse" would solve the issue. Perhaps adding a layer with the project name to the \slots\ folder might help resolve some issues as well. Something like the what \projects\ folder has. \slots\setiathome.berkeley.edu\0 \slots\setiathome.berkeley.edu\1 \slots\milkyway.cs.rpi.edu_milkyway\0 \slots\milkyway.cs.rpi.edu_milkyway\1 However, I imagine that could break all the current science apps if they use a simple "I'm in slot 0" type of communication to the core client & there were two slot 0 folders. Of course the same number scheme could be retained. \slots\setiathome.berkeley.edu\0 \slots\setiathome.berkeley.edu\1 \slots\milkyway.cs.rpi.edu_milkyway\2 \slots\milkyway.cs.rpi.edu_milkyway\3 SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.