Stderr Truncations

Message boards : Number crunching : Stderr Truncations
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · Next

AuthorMessage
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1702355 - Posted: 17 Jul 2015, 0:10:29 UTC - in response to Message 1702345.  
Last modified: 17 Jul 2015, 0:11:22 UTC

Part of (but not the complete story I'm certain), involves that frequent DMA engine transfers over the PCIe bus are typically occurring from device to host, and vice versa. OpenCL and DirectCompute as well.

In some cases these are application programmer initiated explicit calls, while in others they are underlying library functionality, or even driver management under control of the changing WDDM & Os use of the resources, which has a paged virtual memory model. (now being unified).

When you terminate and free host buffers underneath an active DMA transfer, then you induce arbitrary effects in the driver, from fairly benign ones transparently recovered, through to full reset of the driver stack. An old symptom of this was the old 'sticky downclock', which used to occur at task termination, and make it look like the next task starting up broke something. It always looked like that because the juncture of task complete and new task startup nearly always conincided.

The associated delays involved in recovery (when it can) are pretty lengthy, and block a lot of system operation for the duration.

The effects change with drivers and models of GPU, but you can typically replicate the conditions by terminating an active GPU task via Task manager.

For a long time that's been treated here by trying to use critical sections as they're supposed to be used, and still providing some means to synch cleanly on exitwhen possible, but afaik there are quirks and imperfections in all the approaches tried.

That's where the recommendation for a registerable onexit callback comes from, though of course that doesn't make the mechanism behave gracefully under external termination. Probably more the domain of ms polishing WDDM and working with nVidia.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1702355 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1702360 - Posted: 17 Jul 2015, 0:39:13 UTC

Day or 2 ago I got reboot on BOINC restart attempt (2 ATi AP running on GPU at that time + some Flash with resource leakage taking >1G or RAM). Reboot was with display picture damage just before reboot.
But worst part of it - client_state.xml was zeroed and full cache was lost (maybe because of all writeback settings for HDD were enabled).
So, it's definitely not safe to restart BOINC with GPU apps running (though critical section implemented).
ID: 1702360 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1702363 - Posted: 17 Jul 2015, 0:46:58 UTC - in response to Message 1702360.  

So, it's definitely not safe to restart BOINC with GPU apps running (though critical section implemented).


Yeah, it hasn't been enough (I still use exit flag polling). Rough structure is user mode driver, with user mode driver helper sitting in the application's area. So if you bring that down, your other user mode drivers can go down too. Still better than original XP Driver model, where it'd BSOD if you looked away, so the slower later XP model hybrid is a bit more solid.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1702363 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1702365 - Posted: 17 Jul 2015, 0:52:34 UTC - in response to Message 1702363.  

Being user mode did not save from whole system reboot with NTFS file system file damage, unfortunately.
ID: 1702365 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1702366 - Posted: 17 Jul 2015, 0:57:18 UTC - in response to Message 1702365.  

Being user mode did not save from whole system reboot with NTFS file system file damage, unfortunately.


Yeah, that sounds pretty extreme. Makes sense, but not nice.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1702366 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1702850 - Posted: 18 Jul 2015, 16:21:26 UTC - in response to Message 1702289.  

Now, having just looked at that code block again with my non-C++ trained eyes, I'm wondering whether I've still misinterpreted what will happen if that 5-second grace period expires. Does the Sharing Violation actually end up causing that Error Code 32 to be reported, or does the normal code path simply resume, presumably still producing a truncated stderr.txt? I dunno, some expert is gonna hafta 'splain that to me! ;^)

To my (also untrained) eye, it looks like it simply falls out of the bottom, five seconds later. So it does what it was going to do anyway, but slower.

If that's the case, then it will definitely be a significant improvement for S@h, too, even if it is more of a patch than a true fix. And to use Jason's raft analogy ("Fingers crossed this raft never gets used on the open ocean."), you may not want to launch a raft with this sort of patch, but if you're already in the middle of the ocean, it sure will be helpful to patch the pinholes any way you can, until you can get the raft back to shore and build a whole new one!


Followed up on this with clear eyeballs as promised. I have no issues with the method or logic. It'll just keep trying to open the file for up to 5 seconds (rather than simply a fixed delay). The comment doesn't really say why a magic number of 5 seconds ( not 3, 9 or 42 ), and it only tries once per second for access, but should be way better as your results would suggest.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1702850 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1702874 - Posted: 18 Jul 2015, 17:15:27 UTC - in response to Message 1702850.  

Now, having just looked at that code block again with my non-C++ trained eyes, I'm wondering whether I've still misinterpreted what will happen if that 5-second grace period expires. Does the Sharing Violation actually end up causing that Error Code 32 to be reported, or does the normal code path simply resume, presumably still producing a truncated stderr.txt? I dunno, some expert is gonna hafta 'splain that to me! ;^)

To my (also untrained) eye, it looks like it simply falls out of the bottom, five seconds later. So it does what it was going to do anyway, but slower.

If that's the case, then it will definitely be a significant improvement for S@h, too, even if it is more of a patch than a true fix. And to use Jason's raft analogy ("Fingers crossed this raft never gets used on the open ocean."), you may not want to launch a raft with this sort of patch, but if you're already in the middle of the ocean, it sure will be helpful to patch the pinholes any way you can, until you can get the raft back to shore and build a whole new one!


Followed up on this with clear eyeballs as promised. I have no issues with the method or logic. It'll just keep trying to open the file for up to 5 seconds (rather than simply a fixed delay). The comment doesn't really say why a magic number of 5 seconds ( not 3, 9 or 42 ), and it only tries once per second for access, but should be way better as your results would suggest.

Thanks for the follow-up, Jason. So, if even the 5-second grace period isn't sufficient, BOINC might still produce the truncated stderr.txt but won't throw that new Sharing Violation back at the user as an "Error while computing"? Then that should make the truncations and by extension, the "instant" Invalids, much, much rarer, even if not entirely extinct. A major improvement!
ID: 1702874 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1702876 - Posted: 18 Jul 2015, 17:26:02 UTC - in response to Message 1702874.  
Last modified: 18 Jul 2015, 18:07:04 UTC

Now, having just looked at that code block again with my non-C++ trained eyes, I'm wondering whether I've still misinterpreted what will happen if that 5-second grace period expires. Does the Sharing Violation actually end up causing that Error Code 32 to be reported, or does the normal code path simply resume, presumably still producing a truncated stderr.txt? I dunno, some expert is gonna hafta 'splain that to me! ;^)

To my (also untrained) eye, it looks like it simply falls out of the bottom, five seconds later. So it does what it was going to do anyway, but slower.

If that's the case, then it will definitely be a significant improvement for S@h, too, even if it is more of a patch than a true fix. And to use Jason's raft analogy ("Fingers crossed this raft never gets used on the open ocean."), you may not want to launch a raft with this sort of patch, but if you're already in the middle of the ocean, it sure will be helpful to patch the pinholes any way you can, until you can get the raft back to shore and build a whole new one!


Followed up on this with clear eyeballs as promised. I have no issues with the method or logic. It'll just keep trying to open the file for up to 5 seconds (rather than simply a fixed delay). The comment doesn't really say why a magic number of 5 seconds ( not 3, 9 or 42 ), and it only tries once per second for access, but should be way better as your results would suggest.

Thanks for the follow-up, Jason. So, if even the 5-second grace period isn't sufficient, BOINC might still produce the truncated stderr.txt but won't throw that new Sharing Violation back at the user as an "Error while computing"? Then that should make the truncations and by extension, the "instant" Invalids, much, much rarer, even if not entirely extinct. A major improvement!


Yep. I mentioned the mindset shift before, and Rom indicated he's well experienced with it. Breaking it down, which always takes some effort with lofty abstract concepts:

Old Way (Imperative Control)
Do This ... and it does

New Way (Asynchronous)
request-->wait-->OKdone

Trying to mash the two together with a fork:
demand-->no, wait---> crash

pretty tough, lol

[Edit:] So you could argue that the commit mode + patient read is now near optimal, once applications and clients adopt it.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1702876 · Report as offensive
Profile Rom Walton (BOINC)
Volunteer tester
Avatar

Send message
Joined: 28 Apr 00
Posts: 579
Credit: 130,733
RAC: 0
United States
Message 1702886 - Posted: 18 Jul 2015, 18:31:06 UTC - in response to Message 1702850.  

Now, having just looked at that code block again with my non-C++ trained eyes, I'm wondering whether I've still misinterpreted what will happen if that 5-second grace period expires. Does the Sharing Violation actually end up causing that Error Code 32 to be reported, or does the normal code path simply resume, presumably still producing a truncated stderr.txt? I dunno, some expert is gonna hafta 'splain that to me! ;^)

To my (also untrained) eye, it looks like it simply falls out of the bottom, five seconds later. So it does what it was going to do anyway, but slower.

If that's the case, then it will definitely be a significant improvement for S@h, too, even if it is more of a patch than a true fix. And to use Jason's raft analogy ("Fingers crossed this raft never gets used on the open ocean."), you may not want to launch a raft with this sort of patch, but if you're already in the middle of the ocean, it sure will be helpful to patch the pinholes any way you can, until you can get the raft back to shore and build a whole new one!


Followed up on this with clear eyeballs as promised. I have no issues with the method or logic. It'll just keep trying to open the file for up to 5 seconds (rather than simply a fixed delay). The comment doesn't really say why a magic number of 5 seconds ( not 3, 9 or 42 ), and it only tries once per second for access, but should be way better as your results would suggest.


99% of the time it'll probably the contents of stderr.txt on the first retry when the error condition is triggered. The extra four retries are to deal with the possibility we have to wait on various forms of anti-malware and content indexers to have had their way with the file as well.

Special purpose software such as virus scanners and anti-malware usually have kernel mode components that make sure they get first dibs on files after they have been released by the apps that created them.

Normal Windows software hooks into the file system notification API and lets Windows tell it when it is its turns to fiddle with the file. BOINC being a cross-platform async polling app instead relies of checking to see if it can successfully access the file.
----- Rom
BOINC Development Team, U.C. Berkeley
My Blog
ID: 1702886 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1702887 - Posted: 18 Jul 2015, 19:00:29 UTC - in response to Message 1702886.  

99% of the time it'll probably the contents of stderr.txt on the first retry when the error condition is triggered. The extra four retries are to deal with the possibility we have to wait on various forms of anti-malware and content indexers to have had their way with the file as well.

Special purpose software such as virus scanners and anti-malware usually have kernel mode components that make sure they get first dibs on files after they have been released by the apps that created them.

Normal Windows software hooks into the file system notification API and lets Windows tell it when it is its turns to fiddle with the file. BOINC being a cross-platform async polling app instead relies of checking to see if it can successfully access the file.

For those couple days that I was running with the Process Monitor logging turned on, it never required more than one re-try to be successful. However, the AV on those boxes (Security Essentials on the xw9400 and Windows Defender on the T7400) didn't seem to be taking any interest in those files. It was certainly informative, however, to see that Windows Search indexing was accounting for a significant amount of activity on the T7400 (Message 1701586). Seeing no purpose in indexing such transient files, I was pretty quick to turn the indexing off for the whole slots tree. So, another efficiency improvement resulting from the testing. :^)
ID: 1702887 · Report as offensive
Profile Rom Walton (BOINC)
Volunteer tester
Avatar

Send message
Joined: 28 Apr 00
Posts: 579
Credit: 130,733
RAC: 0
United States
Message 1702888 - Posted: 18 Jul 2015, 19:02:22 UTC - in response to Message 1702887.  

99% of the time it'll probably the contents of stderr.txt on the first retry when the error condition is triggered. The extra four retries are to deal with the possibility we have to wait on various forms of anti-malware and content indexers to have had their way with the file as well.

Special purpose software such as virus scanners and anti-malware usually have kernel mode components that make sure they get first dibs on files after they have been released by the apps that created them.

Normal Windows software hooks into the file system notification API and lets Windows tell it when it is its turns to fiddle with the file. BOINC being a cross-platform async polling app instead relies of checking to see if it can successfully access the file.

For those couple days that I was running with the Process Monitor logging turned on, it never required more than one re-try to be successful. However, the AV on those boxes (Security Essentials on the xw9400 and Windows Defender on the T7400) didn't seem to be taking any interest in those files. It was certainly informative, however, to see that Windows Search indexing was accounting for a significant amount of activity on the T7400 (Message 1701586). Seeing no purpose in indexing such transient files, I was pretty quick to turn the indexing off for the whole slots tree. So, another efficiency improvement resulting from the testing. :^)


Security Essentials is MsMpEng.exe.
----- Rom
BOINC Development Team, U.C. Berkeley
My Blog
ID: 1702888 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1702891 - Posted: 18 Jul 2015, 19:21:13 UTC - in response to Message 1702888.  

99% of the time it'll probably the contents of stderr.txt on the first retry when the error condition is triggered. The extra four retries are to deal with the possibility we have to wait on various forms of anti-malware and content indexers to have had their way with the file as well.

Special purpose software such as virus scanners and anti-malware usually have kernel mode components that make sure they get first dibs on files after they have been released by the apps that created them.

Normal Windows software hooks into the file system notification API and lets Windows tell it when it is its turns to fiddle with the file. BOINC being a cross-platform async polling app instead relies of checking to see if it can successfully access the file.

For those couple days that I was running with the Process Monitor logging turned on, it never required more than one re-try to be successful. However, the AV on those boxes (Security Essentials on the xw9400 and Windows Defender on the T7400) didn't seem to be taking any interest in those files. It was certainly informative, however, to see that Windows Search indexing was accounting for a significant amount of activity on the T7400 (Message 1701586). Seeing no purpose in indexing such transient files, I was pretty quick to turn the indexing off for the whole slots tree. So, another efficiency improvement resulting from the testing. :^)


Security Essentials is MsMpEng.exe.

Ah, okay, thank you. I hadn't caught on to that. Well, at least it seemed to get in and out very quickly.
ID: 1702891 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1702933 - Posted: 18 Jul 2015, 22:49:59 UTC - in response to Message 1702888.  

However, the AV on those boxes (Security Essentials on the xw9400 and Windows Defender on the T7400) didn't seem to be taking any interest in those files.

MSE is very resource light.
I know people that run Norton & other AV programmes & their system responsiveness is very poor.
The programmes themselves don't use much CPU resources, but they do bog things down when downloading emails, opening & saving files etc. I suspect the defaults are to scan everything always.
Maybe if they were to use more CPU resources they could do their work faster with less impact, but as they are they really do slow down systems, even high end ones.
Grant
Darwin NT
ID: 1702933 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1704754 - Posted: 24 Jul 2015, 23:54:54 UTC - in response to Message 1702933.  
Last modified: 24 Jul 2015, 23:57:07 UTC

I'll be glad when a fix is generally available.
Haven't had an issue for a month or 2, and then today alone I have 2 truncated Stderr outputs turn up,


<core_client_version>7.0.64</core_client_version>
<![CDATA[
<stderr_txt>

</stderr_txt>
]]>

4281155998
4279260695
Grant
Darwin NT
ID: 1704754 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1704758 - Posted: 25 Jul 2015, 0:11:20 UTC - in response to Message 1704754.  
Last modified: 25 Jul 2015, 0:15:02 UTC

I'm not sure I understand your concerns. The 'fix' has been available for over a week.

BOINC 7.6.6 HERE
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1704758 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1704760 - Posted: 25 Jul 2015, 0:22:53 UTC - in response to Message 1704758.  
Last modified: 25 Jul 2015, 0:23:35 UTC

I'm not sure I understand your concerns. The 'fix' has been available for over a week.

BOINC 7.6.6 HERE

Development version
(MAY BE UNSTABLE - USE ONLY FOR TESTING)

What are the know issues?
Grant
Darwin NT
ID: 1704760 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1704768 - Posted: 25 Jul 2015, 1:13:31 UTC - in response to Message 1704760.  

Absolutely nothing that I or the other testers have come across. I believe it will go to main next week anyway. You definitely should be running it if you do MilkyWay. In fact there is a recommended sticky note to run this client at MilkyWay.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1704768 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1704771 - Posted: 25 Jul 2015, 1:22:35 UTC - in response to Message 1704768.  

Absolutely nothing that I or the other testers have come across. I believe it will go to main next week anyway. You definitely should be running it if you do MilkyWay. In fact there is a recommended sticky note to run this client at MilkyWay.

I'm Seti only.
Even if it doesn't become the recommended version next week, i'll see how adventurous I'm feeling next weekend.
Grant
Darwin NT
ID: 1704771 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1704772 - Posted: 25 Jul 2015, 1:22:39 UTC - in response to Message 1704760.  

I'm not sure I understand your concerns. The 'fix' has been available for over a week.

BOINC 7.6.6 HERE

Development version
(MAY BE UNSTABLE - USE ONLY FOR TESTING)

What are the know issues?

I don't know about any known issues, but according to the Test results summaries chart it appears that they're 71% complete. The chart seems to show that a lack of XP and MAC testers is the primary holdup.
ID: 1704772 · Report as offensive
Profile Jord
Volunteer tester
Avatar

Send message
Joined: 9 Jun 99
Posts: 15184
Credit: 4,362,181
RAC: 3
Netherlands
Message 1704774 - Posted: 25 Jul 2015, 1:51:56 UTC - in response to Message 1704772.  

And someone writing the documentary, manual and wiki. I haven't had much time yet the past weeks, hope to do better this weekend. :)
ID: 1704774 · Report as offensive
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · Next

Message boards : Number crunching : Stderr Truncations


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.