What causes a Blank stderr?

Message boards : Number crunching : What causes a Blank stderr?
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1673994 - Posted: 5 May 2015, 0:08:58 UTC
Last modified: 5 May 2015, 0:11:00 UTC

<core_client_version>7.4.42</core_client_version>
<![CDATA[
<stderr_txt>

</stderr_txt>
]]>


I have 3 of them, but with normal run times. Confused.

EDIT: cuda50 tasks
ID: 1673994 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1673996 - Posted: 5 May 2015, 0:46:43 UTC - in response to Message 1673994.  

No application developer has ever chimed in with why we see this. Believe me, I've asked.

http://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=3662

It is especially prevalent on MilkyWay@Home 1.36 tasks. Best guess is that the system is too busy to devote enough resources to finish the result transaction. The tasks are valid with a good [0] exit status but just doesn't have any results in the file. Someone suggested that the result in a slot gets occupied with new work before the finished file is completely written out. I once got a project developer say they would look into the problem but haven't ever received any response from them or a new fixed application. I only end up with about 3% errors over at MW and you are not polluting the science database with bad results. They are just invalid and you just wasted time and energy processing them for naught. At least the problem only rarely shows up here at SETI@HOME. The problem may lie with the underlying BOINC platform code or it could just be a problem with the project application. I would sure like to know the answer to your question.

Cheers, Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1673996 · Report as offensive
Tom*

Send message
Joined: 12 Aug 11
Posts: 127
Credit: 20,769,223
RAC: 9
United States
Message 1674010 - Posted: 5 May 2015, 2:47:51 UTC

http://setiathome.berkeley.edu/forum_thread.php?id=76946&postid=1655230#1655230

No application developer has ever chimed in with why we see this. Believe me, I've asked.


I think Jason in the above post has a good handle on the stderr truncated or missing, boincapi issue.

although he may have given up on fighting city hall, I hope he gets it fixed
as the validate errors in milkyway are too large to ignore.
ID: 1674010 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1674014 - Posted: 5 May 2015, 2:57:24 UTC - in response to Message 1674010.  

http://setiathome.berkeley.edu/forum_thread.php?id=76946&postid=1655230#1655230

No application developer has ever chimed in with why we see this. Believe me, I've asked.


I think Jason in the above post has a good handle on the stderr truncated or missing, boincapi issue.

although he may have given up on fighting city hall, I hope he gets it fixed
as the validate errors in milkyway are too large to ignore.

The Stderr is not actually the science result. So if it missing or damaged it doesn't really effect anything for the project.

The thread Strange Invalid MB Overflow tasks with truncated Stderr outputs... may also have some information, but from what I recall it is a persistent BOINC issue.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1674014 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1674027 - Posted: 5 May 2015, 3:56:58 UTC - in response to Message 1674010.  
Last modified: 5 May 2015, 4:00:37 UTC

http://setiathome.berkeley.edu/forum_thread.php?id=76946&postid=1655230#1655230

No application developer has ever chimed in with why we see this. Believe me, I've asked.


I think Jason in the above post has a good handle on the stderr truncated or missing, boincapi issue.

although he may have given up on fighting city hall, I hope he gets it fixed
as the validate errors in milkyway are too large to ignore.


Pretty succinct summary, though 'given up fighting city hall' isn't quite the current status ;) being an Australian born of half dutch half scottish parentage, throwing in the towel isn't an option when there's a decent challenge on.

Recent status has been probing and eliminating possible 'paths of least resistance' to check they would cost more in sanity and other resources than simply building, offering for wider use, and maintaining replacement resources (such as boincapi, client, generalised application component, server, and utility support resources)

For the application side, while Win10/Cuda7 and other changes are happening, The picture has been a little muddy, though is starting to coalesce into a sensible medium to long term roadmap of sorts (which I'll post in due course). In addition, I've been migrating XBranch to the gradle build system, such that it will eventually have a consistent cross platform development infrastructure in place.

Once able to present a clear picture there, I'll be able to begin marshalling existing resources and look for more support (e.g. if Milkyway would like a hand with that).

On the Cuda side I've already received a great deal of great work from Perti33, which needs some integration and generalisation for wider testing and distribution.

The usual state of affairs with something as large as the above all combined, is that at some point it starts to gain a life of its own, attract more resources & people, and get better and better. In that sense, coalescing the bigger vision of where things should be, is making more sense right now than trying to encourage developers to actually try using their own work, and do a little reading. [As opposed to circling the wagons every time someone points out user expectations aren't being met ]
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1674027 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1674101 - Posted: 5 May 2015, 7:48:19 UTC - in response to Message 1674014.  

... I hope he gets it fixed as the validate errors in milkyway are too large to ignore.

The Stderr is not actually the science result. So if it missing or damaged it doesn't really effect anything for the project.

At Milkyway, the science result is reported back via stderr: they don't use a separate upload file. So Keith's point is valid - an illustration of the dangers of over-generalising from experience of just one project.
ID: 1674101 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1674115 - Posted: 5 May 2015, 10:53:45 UTC - in response to Message 1674101.  

... I hope he gets it fixed as the validate errors in milkyway are too large to ignore.

The Stderr is not actually the science result. So if it missing or damaged it doesn't really effect anything for the project.

At Milkyway, the science result is reported back via stderr: they don't use a separate upload file. So Keith's point is valid - an illustration of the dangers of over-generalising from experience of just one project.


I'd have to agree with that from a developer perspective also. It can be remarkably difficult to debug applications that apparently work normally most of the time, but then when manifesting issues present no debug log, which after all is what stderr is really meant for.

Creating applications is, is of, and is for the science too. The instruments go hand in hand with the research, as much as a chemist might use a mass spectrometer (or some such example). So why dismiss a broken tool as unimportant ? Never saw the logic completely.

Best tools for the best job. Nothing worse than repairing surface mount electronic circuitry with a good old fencepost soldering iron. Can be done, but just looks sloppy.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1674115 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1674162 - Posted: 5 May 2015, 15:37:35 UTC - in response to Message 1674010.  
Last modified: 5 May 2015, 15:38:57 UTC

http://setiathome.berkeley.edu/forum_thread.php?id=76946&postid=1655230#1655230

No application developer has ever chimed in with why we see this. Believe me, I've asked.


I think Jason in the above post has a good handle on the stderr truncated or missing, boincapi issue.

although he may have given up on fighting city hall, I hope he gets it fixed
as the validate errors in milkyway are too large to ignore.



Yes, that is probably who I was remembering came up with the two tasks in same slot scenario. My comment about no developer getting back to me after saying they would look into it was directed at the MW project scientist and developer Matthew over in the MW thread I referenced. I would hope that as Jason writes, maybe the problem gains enough recognition across all projects that the BOINC developers finally develop a plan to fix the underlying weaknesses in the BOINC code.

Cheers, Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1674162 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1674165 - Posted: 5 May 2015, 15:47:46 UTC - in response to Message 1674162.  

Yes, that is probably who I was remembering came up with the two tasks in same slot scenario. My comment about no developer getting back to me after saying they would look into it was directed at the MW project scientist and developer Matthew over in the MW thread I referenced. I would hope that as Jason writes, maybe the problem gains enough recognition across all projects that the BOINC developers finally develop a plan to fix the underlying weaknesses in the BOINC code.

Cheers, Keith

"Two tasks in the same slot" is a problem which is under active - very active - investigation right at the moment. It appears that under some circumstances (yet to be confirmed), the 'slot cleansing' between tasks may fail. Stay tuned.
ID: 1674165 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1674169 - Posted: 5 May 2015, 16:00:13 UTC - in response to Message 1674165.  


"Two tasks in the same slot" is a problem which is under active - very active - investigation right at the moment. It appears that under some circumstances (yet to be confirmed), the 'slot cleansing' between tasks may fail. Stay tuned.


It makes sense in that I see the problem mainly over at MW and more to the point with the Modified Separation Fit 1.36 tasks which run for very short times ... under a minute normally on my hardware. That little snippet of code I saw in this thread with a timeout of 15 seconds for cleanup looks mighty suspicious.

Cheers, Keith
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1674169 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1674170 - Posted: 5 May 2015, 16:02:06 UTC - in response to Message 1674165.  
Last modified: 5 May 2015, 16:08:59 UTC

"Two tasks in the same slot" is a problem which is under active - very active - investigation right at the moment. It appears that under some circumstances (yet to be confirmed), the 'slot cleansing' between tasks may fail. Stay tuned.


Interesting, as I haven't yet made my way to look at the slot management, in particular what they're using as mutexes (if any apart from the boinc lockfile). lack of 'proper' locks (OS and runtime library dependant) would likely see those kindof symptoms happen in weird and difficult to consistently reproduce ways. Pretty similar to the api's abuse of threaded IO runtimes in concept, so wouldn't be shocked if it's just patched up old code fine under old [single threaded] POSIX C libraries, and not much else [by way of discipline when using raw threads and IO].
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1674170 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1674188 - Posted: 5 May 2015, 16:26:35 UTC - in response to Message 1674170.  

"Two tasks in the same slot" is a problem which is under active - very active - investigation right at the moment. It appears that under some circumstances (yet to be confirmed), the 'slot cleansing' between tasks may fail. Stay tuned.

Interesting, as I haven't yet made my way to look at the slot management, in particular what they're using as mutexes (if any apart from the boinc lockfile). lack of 'proper' locks (OS and runtime library dependant) would likely see those kindof symptoms happen in weird and difficult to consistently reproduce ways. Pretty similar to the api's abuse of threaded IO runtimes in concept, so wouldn't be shocked if it's just patched up old code fine under old [single threaded] POSIX C libraries, and not much else [by way of discipline when using raw threads and IO].

Typically, the test task which finished and reported since my last post didn't suffer from the problems others have reported. The problem - being discussed at Milkyway, Einstein and World Community Grid - is that another project's VBox test application creates a 5.7 GB virtual machine image file, and BOINC sometimes fails to cleanse it. If an 'ordinary' project tries to run a task in the slot containing the leftovers, the 5.7 GB is counted against the disk usage of the new task and the task errors out with 'disk limit exceeded'.

Since this is a case where one project's app is being blamed for another project's errors, the developers are taking it seriously, and have improved the debug logging and produced a new test version of BOINC with a possible fix since I first reported it on Sunday morning.
ID: 1674188 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1674207 - Posted: 5 May 2015, 21:31:23 UTC - in response to Message 1674101.  

... I hope he gets it fixed as the validate errors in milkyway are too large to ignore.

The Stderr is not actually the science result. So if it missing or damaged it doesn't really effect anything for the project.

At Milkyway, the science result is reported back via stderr: they don't use a separate upload file. So Keith's point is valid - an illustration of the dangers of over-generalising from experience of just one project.

Ah yes that would tend to be a much large issue in that case. Is the stderr intended to be the default method of reporting results for BOINC? Or is Milkyway just an exception to the rule?
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1674207 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1674221 - Posted: 5 May 2015, 22:05:18 UTC - in response to Message 1674207.  

... I hope he gets it fixed as the validate errors in milkyway are too large to ignore.

The Stderr is not actually the science result. So if it missing or damaged it doesn't really effect anything for the project.

At Milkyway, the science result is reported back via stderr: they don't use a separate upload file. So Keith's point is valid - an illustration of the dangers of over-generalising from experience of just one project.

Ah yes that would tend to be a much large issue in that case. Is the stderr intended to be the default method of reporting results for BOINC? Or is Milkyway just an exception to the rule?

I'd call Milkyway an exception to most rules.
ID: 1674221 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1674363 - Posted: 6 May 2015, 12:50:15 UTC - in response to Message 1674188.  

"Two tasks in the same slot" is a problem which is under active - very active - investigation right at the moment. It appears that under some circumstances (yet to be confirmed), the 'slot cleansing' between tasks may fail. Stay tuned.

Interesting, as I haven't yet made my way to look at the slot management, in particular what they're using as mutexes (if any apart from the boinc lockfile). lack of 'proper' locks (OS and runtime library dependant) would likely see those kindof symptoms happen in weird and difficult to consistently reproduce ways. Pretty similar to the api's abuse of threaded IO runtimes in concept, so wouldn't be shocked if it's just patched up old code fine under old [single threaded] POSIX C libraries, and not much else [by way of discipline when using raw threads and IO].

Typically, the test task which finished and reported since my last post didn't suffer from the problems others have reported. The problem - being discussed at Milkyway, Einstein and World Community Grid - is that another project's VBox test application creates a 5.7 GB virtual machine image file, and BOINC sometimes fails to cleanse it. If an 'ordinary' project tries to run a task in the slot containing the leftovers, the 5.7 GB is counted against the disk usage of the new task and the task errors out with 'disk limit exceeded'.

Since this is a case where one project's app is being blamed for another project's errors, the developers are taking it seriously, and have improved the debug logging and produced a new test version of BOINC with a possible fix since I first reported it on Sunday morning.

Well, we haven't caught our 'exceeded disk limit' yet, but the search with <slot_debug> has turned this up:

failed to remove file slots/0/stderr.txt: unlink() failed
ID: 1674363 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1674370 - Posted: 6 May 2015, 14:01:26 UTC - in response to Message 1674363.  
Last modified: 6 May 2015, 14:05:12 UTC

Well, we haven't caught our 'exceeded disk limit' yet, but the search with <slot_debug> has turned this up:

failed to remove file slots/0/stderr.txt: unlink() failed


Heh. Interestingly the (for Windows) underlying DeleteFile() Windows Api call, under the unlink() logic, is a non blocking call:

https://msdn.microsoft.com/en-us/library/windows/desktop/aa363915%28v=vs.85%29.aspx
...The DeleteFile function fails if an application attempts to delete a file that has other handles open for normal I/O or as a memory-mapped file (FILE_SHARE_DELETE must have been specified when other handles were opened).
The DeleteFile function marks a file for deletion on close. Therefore, the file deletion does not occur until the last handle to the file is closed. Subsequent calls to CreateFile to open the file fail with ERROR_ACCESS_DENIED.


At least in the present example, it points to open handles, likely via Windows memory mapped file implementation used by Boinc MFILE or MIOFILE (whichever) structures. i.e. the application is still shutting down, or was forced closed with TerminateProcess()... [Yeah that old chestnut].

With the code present in sandbox.cpp The failure rate would be proportional to the deleting client to finishing app process priority ratio (app is usually idle-below normal), the total of the file sizes being deleted, system contention, to some limited extent filesystem performance itself, desktop optimisations by Windows version and C-Runtime used, and maybe some caching policies at various levels (Hardware, OS and Driver).

Best practices solution might likely involve doing something like this before allowing the slot to be used again:
- setting a mutex (of any suitable type) indicating start of a bulk deletion transaction,
- delete,
- check,
--retry for failures, for slowness just accept the IO will get around to it and allow very generous timeouts (if any),
- only release the mutex once complete.

The above, for large files etc, might take too long (Many seconds to minutes) for the Boinc client in its current architecture, because the file deletions seem to be in the main processing loop (thread). As the contention will be highest at task completion for all sorts of reasons, What would be better is either some dedicated garbage collection thread, that runs independently, allowing normal client other processing to continue (in other slots) while making sure nothing will try to use the slot with a deletion in progress, OR keeping transactions much smaller.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1674370 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1674373 - Posted: 6 May 2015, 14:23:56 UTC - in response to Message 1674370.  

In theory, according to David's initial response, the current logic specifies:

1) Delete everything in the slot folder on task exit (this failed in the example)
2) Delete everything - i.e. anything remaining - in the slot before reuse
3) Don't reuse the slot if (2) fails

The error we were originally investigating implies that step (3) failed. That may have been because step (2) failed without returning an error - David has already attempted to close that loophole in the private drop v7.5.1 (in the case reported at CMS-dev, the poster has returned to say that the file was deleted at step (2), so the safety at least partially works - though he did a service restart, which will have provided extra time for the locks to clear)

I've checked my logs since seeing the unlink() failed report - no occurences on either machine. But I'll keep an eye open.
ID: 1674373 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1674374 - Posted: 6 May 2015, 14:29:51 UTC - in response to Message 1674373.  
Last modified: 6 May 2015, 14:34:24 UTC

makes sense, but David's 3 step process there implies the use of mutexes (locks, atomic transactions etc), which aren't in the code. #1 and #2 can succeed but the file not be physically deleted yet. That's [asynchronous] buffered IO, and doesn't occur in sequence (unless you make it so, [with additional logic])
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1674374 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1674376 - Posted: 6 May 2015, 14:52:27 UTC - in response to Message 1674373.  
Last modified: 6 May 2015, 14:53:19 UTC

3) Don't reuse the slot if (2) fails


Well there's probably the 'original' issue. The logic should instead be something like:
'3) Don't reuse the slot unless everything's deleted and it's not allocated etc'

, because #1 and #2 can succeed on one core/thread, and not be seen on another core/thread until later (race condition). #1 & #2 succeeding, doesn't mean they're complete.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1674376 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1674379 - Posted: 6 May 2015, 15:03:50 UTC - in response to Message 1674374.  

makes sense, but David's 3 step process there implies the use of mutexes (locks, atomic transactions etc), which aren't in the code. #1 and #2 can succeed but the file not be physically deleted yet. That's [asynchronous] buffered IO, and doesn't occur in sequence (unless you make it so, [with additional logic])

To me it seems the addition of a "Check folder is actually empty before reuse" would solve the issue.

Perhaps adding a layer with the project name to the \slots\ folder might help resolve some issues as well. Something like the what \projects\ folder has.
\slots\setiathome.berkeley.edu\0
\slots\setiathome.berkeley.edu\1
\slots\milkyway.cs.rpi.edu_milkyway\0
\slots\milkyway.cs.rpi.edu_milkyway\1
However, I imagine that could break all the current science apps if they use a simple "I'm in slot 0" type of communication to the core client & there were two slot 0 folders.
Of course the same number scheme could be retained.
\slots\setiathome.berkeley.edu\0
\slots\setiathome.berkeley.edu\1
\slots\milkyway.cs.rpi.edu_milkyway\2
\slots\milkyway.cs.rpi.edu_milkyway\3
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1674379 · Report as offensive
1 · 2 · 3 · Next

Message boards : Number crunching : What causes a Blank stderr?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.