Stderr Truncations

Author	Message
jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1702355 - Posted: 17 Jul 2015, 0:10:29 UTC - in response to Message 1702345. Last modified: 17 Jul 2015, 0:11:22 UTC Part of (but not the complete story I'm certain), involves that frequent DMA engine transfers over the PCIe bus are typically occurring from device to host, and vice versa. OpenCL and DirectCompute as well. In some cases these are application programmer initiated explicit calls, while in others they are underlying library functionality, or even driver management under control of the changing WDDM & Os use of the resources, which has a paged virtual memory model. (now being unified). When you terminate and free host buffers underneath an active DMA transfer, then you induce arbitrary effects in the driver, from fairly benign ones transparently recovered, through to full reset of the driver stack. An old symptom of this was the old 'sticky downclock', which used to occur at task termination, and make it look like the next task starting up broke something. It always looked like that because the juncture of task complete and new task startup nearly always conincided. The associated delays involved in recovery (when it can) are pretty lengthy, and block a lot of system operation for the duration. The effects change with drivers and models of GPU, but you can typically replicate the conditions by terminating an active GPU task via Task manager. For a long time that's been treated here by trying to use critical sections as they're supposed to be used, and still providing some means to synch cleanly on exitwhen possible, but afaik there are quirks and imperfections in all the approaches tried. That's where the recommendation for a registerable onexit callback comes from, though of course that doesn't make the mechanism behave gracefully under external termination. Probably more the domain of ms polishing WDDM and working with nVidia. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1702355 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1702360 - Posted: 17 Jul 2015, 0:39:13 UTC Day or 2 ago I got reboot on BOINC restart attempt (2 ATi AP running on GPU at that time + some Flash with resource leakage taking >1G or RAM). Reboot was with display picture damage just before reboot. But worst part of it - client_state.xml was zeroed and full cache was lost (maybe because of all writeback settings for HDD were enabled). So, it's definitely not safe to restart BOINC with GPU apps running (though critical section implemented). ID: 1702360 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1702363 - Posted: 17 Jul 2015, 0:46:58 UTC - in response to Message 1702360. So, it's definitely not safe to restart BOINC with GPU apps running (though critical section implemented). Yeah, it hasn't been enough (I still use exit flag polling). Rough structure is user mode driver, with user mode driver helper sitting in the application's area. So if you bring that down, your other user mode drivers can go down too. Still better than original XP Driver model, where it'd BSOD if you looked away, so the slower later XP model hybrid is a bit more solid. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1702363 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1702365 - Posted: 17 Jul 2015, 0:52:34 UTC - in response to Message 1702363. Being user mode did not save from whole system reboot with NTFS file system file damage, unfortunately. ID: 1702365 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1702366 - Posted: 17 Jul 2015, 0:57:18 UTC - in response to Message 1702365. Being user mode did not save from whole system reboot with NTFS file system file damage, unfortunately. Yeah, that sounds pretty extreme. Makes sense, but not nice. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1702366 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1702850 - Posted: 18 Jul 2015, 16:21:26 UTC - in response to Message 1702289. Now, having just looked at that code block again with my non-C++ trained eyes, I'm wondering whether I've still misinterpreted what will happen if that 5-second grace period expires. Does the Sharing Violation actually end up causing that Error Code 32 to be reported, or does the normal code path simply resume, presumably still producing a truncated stderr.txt? I dunno, some expert is gonna hafta 'splain that to me! ;^) To my (also untrained) eye, it looks like it simply falls out of the bottom, five seconds later. So it does what it was going to do anyway, but slower. If that's the case, then it will definitely be a significant improvement for S@h, too, even if it is more of a patch than a true fix. And to use Jason's raft analogy ("Fingers crossed this raft never gets used on the open ocean."), you may not want to launch a raft with this sort of patch, but if you're already in the middle of the ocean, it sure will be helpful to patch the pinholes any way you can, until you can get the raft back to shore and build a whole new one! Followed up on this with clear eyeballs as promised. I have no issues with the method or logic. It'll just keep trying to open the file for up to 5 seconds (rather than simply a fixed delay). The comment doesn't really say why a magic number of 5 seconds ( not 3, 9 or 42 ), and it only tries once per second for access, but should be way better as your results would suggest. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1702850 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1702874 - Posted: 18 Jul 2015, 17:15:27 UTC - in response to Message 1702850. Now, having just looked at that code block again with my non-C++ trained eyes, I'm wondering whether I've still misinterpreted what will happen if that 5-second grace period expires. Does the Sharing Violation actually end up causing that Error Code 32 to be reported, or does the normal code path simply resume, presumably still producing a truncated stderr.txt? I dunno, some expert is gonna hafta 'splain that to me! ;^) To my (also untrained) eye, it looks like it simply falls out of the bottom, five seconds later. So it does what it was going to do anyway, but slower. If that's the case, then it will definitely be a significant improvement for S@h, too, even if it is more of a patch than a true fix. And to use Jason's raft analogy ("Fingers crossed this raft never gets used on the open ocean."), you may not want to launch a raft with this sort of patch, but if you're already in the middle of the ocean, it sure will be helpful to patch the pinholes any way you can, until you can get the raft back to shore and build a whole new one! Followed up on this with clear eyeballs as promised. I have no issues with the method or logic. It'll just keep trying to open the file for up to 5 seconds (rather than simply a fixed delay). The comment doesn't really say why a magic number of 5 seconds ( not 3, 9 or 42 ), and it only tries once per second for access, but should be way better as your results would suggest. Thanks for the follow-up, Jason. So, if even the 5-second grace period isn't sufficient, BOINC might still produce the truncated stderr.txt but won't throw that new Sharing Violation back at the user as an "Error while computing"? Then that should make the truncations and by extension, the "instant" Invalids, much, much rarer, even if not entirely extinct. A major improvement! ID: 1702874 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1702876 - Posted: 18 Jul 2015, 17:26:02 UTC - in response to Message 1702874. Last modified: 18 Jul 2015, 18:07:04 UTC Now, having just looked at that code block again with my non-C++ trained eyes, I'm wondering whether I've still misinterpreted what will happen if that 5-second grace period expires. Does the Sharing Violation actually end up causing that Error Code 32 to be reported, or does the normal code path simply resume, presumably still producing a truncated stderr.txt? I dunno, some expert is gonna hafta 'splain that to me! ;^) To my (also untrained) eye, it looks like it simply falls out of the bottom, five seconds later. So it does what it was going to do anyway, but slower. If that's the case, then it will definitely be a significant improvement for S@h, too, even if it is more of a patch than a true fix. And to use Jason's raft analogy ("Fingers crossed this raft never gets used on the open ocean."), you may not want to launch a raft with this sort of patch, but if you're already in the middle of the ocean, it sure will be helpful to patch the pinholes any way you can, until you can get the raft back to shore and build a whole new one! Followed up on this with clear eyeballs as promised. I have no issues with the method or logic. It'll just keep trying to open the file for up to 5 seconds (rather than simply a fixed delay). The comment doesn't really say why a magic number of 5 seconds ( not 3, 9 or 42 ), and it only tries once per second for access, but should be way better as your results would suggest. Thanks for the follow-up, Jason. So, if even the 5-second grace period isn't sufficient, BOINC might still produce the truncated stderr.txt but won't throw that new Sharing Violation back at the user as an "Error while computing"? Then that should make the truncations and by extension, the "instant" Invalids, much, much rarer, even if not entirely extinct. A major improvement! Yep. I mentioned the mindset shift before, and Rom indicated he's well experienced with it. Breaking it down, which always takes some effort with lofty abstract concepts: Old Way (Imperative Control) Do This ... and it does New Way (Asynchronous) request-->wait-->OKdone Trying to mash the two together with a fork: demand-->no, wait---> crash pretty tough, lol [Edit:] So you could argue that the commit mode + patient read is now near optimal, once applications and clients adopt it. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1702876 ·

Rom Walton (BOINC) Volunteer tester Send message Joined: 28 Apr 00 Posts: 579 Credit: 130,733 RAC: 0	Message 1702886 - Posted: 18 Jul 2015, 18:31:06 UTC - in response to Message 1702850. Now, having just looked at that code block again with my non-C++ trained eyes, I'm wondering whether I've still misinterpreted what will happen if that 5-second grace period expires. Does the Sharing Violation actually end up causing that Error Code 32 to be reported, or does the normal code path simply resume, presumably still producing a truncated stderr.txt? I dunno, some expert is gonna hafta 'splain that to me! ;^) To my (also untrained) eye, it looks like it simply falls out of the bottom, five seconds later. So it does what it was going to do anyway, but slower. If that's the case, then it will definitely be a significant improvement for S@h, too, even if it is more of a patch than a true fix. And to use Jason's raft analogy ("Fingers crossed this raft never gets used on the open ocean."), you may not want to launch a raft with this sort of patch, but if you're already in the middle of the ocean, it sure will be helpful to patch the pinholes any way you can, until you can get the raft back to shore and build a whole new one! Followed up on this with clear eyeballs as promised. I have no issues with the method or logic. It'll just keep trying to open the file for up to 5 seconds (rather than simply a fixed delay). The comment doesn't really say why a magic number of 5 seconds ( not 3, 9 or 42 ), and it only tries once per second for access, but should be way better as your results would suggest. 99% of the time it'll probably the contents of stderr.txt on the first retry when the error condition is triggered. The extra four retries are to deal with the possibility we have to wait on various forms of anti-malware and content indexers to have had their way with the file as well. Special purpose software such as virus scanners and anti-malware usually have kernel mode components that make sure they get first dibs on files after they have been released by the apps that created them. Normal Windows software hooks into the file system notification API and lets Windows tell it when it is its turns to fiddle with the file. BOINC being a cross-platform async polling app instead relies of checking to see if it can successfully access the file. ----- Rom BOINC Development Team, U.C. Berkeley My Blog ID: 1702886 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1702887 - Posted: 18 Jul 2015, 19:00:29 UTC - in response to Message 1702886. 99% of the time it'll probably the contents of stderr.txt on the first retry when the error condition is triggered. The extra four retries are to deal with the possibility we have to wait on various forms of anti-malware and content indexers to have had their way with the file as well. Special purpose software such as virus scanners and anti-malware usually have kernel mode components that make sure they get first dibs on files after they have been released by the apps that created them. Normal Windows software hooks into the file system notification API and lets Windows tell it when it is its turns to fiddle with the file. BOINC being a cross-platform async polling app instead relies of checking to see if it can successfully access the file. For those couple days that I was running with the Process Monitor logging turned on, it never required more than one re-try to be successful. However, the AV on those boxes (Security Essentials on the xw9400 and Windows Defender on the T7400) didn't seem to be taking any interest in those files. It was certainly informative, however, to see that Windows Search indexing was accounting for a significant amount of activity on the T7400 (Message 1701586). Seeing no purpose in indexing such transient files, I was pretty quick to turn the indexing off for the whole slots tree. So, another efficiency improvement resulting from the testing. :^) ID: 1702887 ·

Rom Walton (BOINC) Volunteer tester Send message Joined: 28 Apr 00 Posts: 579 Credit: 130,733 RAC: 0	Message 1702888 - Posted: 18 Jul 2015, 19:02:22 UTC - in response to Message 1702887. 99% of the time it'll probably the contents of stderr.txt on the first retry when the error condition is triggered. The extra four retries are to deal with the possibility we have to wait on various forms of anti-malware and content indexers to have had their way with the file as well. Special purpose software such as virus scanners and anti-malware usually have kernel mode components that make sure they get first dibs on files after they have been released by the apps that created them. Normal Windows software hooks into the file system notification API and lets Windows tell it when it is its turns to fiddle with the file. BOINC being a cross-platform async polling app instead relies of checking to see if it can successfully access the file. For those couple days that I was running with the Process Monitor logging turned on, it never required more than one re-try to be successful. However, the AV on those boxes (Security Essentials on the xw9400 and Windows Defender on the T7400) didn't seem to be taking any interest in those files. It was certainly informative, however, to see that Windows Search indexing was accounting for a significant amount of activity on the T7400 (Message 1701586). Seeing no purpose in indexing such transient files, I was pretty quick to turn the indexing off for the whole slots tree. So, another efficiency improvement resulting from the testing. :^) Security Essentials is MsMpEng.exe. ----- Rom BOINC Development Team, U.C. Berkeley My Blog ID: 1702888 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1702891 - Posted: 18 Jul 2015, 19:21:13 UTC - in response to Message 1702888. 99% of the time it'll probably the contents of stderr.txt on the first retry when the error condition is triggered. The extra four retries are to deal with the possibility we have to wait on various forms of anti-malware and content indexers to have had their way with the file as well. Special purpose software such as virus scanners and anti-malware usually have kernel mode components that make sure they get first dibs on files after they have been released by the apps that created them. Normal Windows software hooks into the file system notification API and lets Windows tell it when it is its turns to fiddle with the file. BOINC being a cross-platform async polling app instead relies of checking to see if it can successfully access the file. For those couple days that I was running with the Process Monitor logging turned on, it never required more than one re-try to be successful. However, the AV on those boxes (Security Essentials on the xw9400 and Windows Defender on the T7400) didn't seem to be taking any interest in those files. It was certainly informative, however, to see that Windows Search indexing was accounting for a significant amount of activity on the T7400 (Message 1701586). Seeing no purpose in indexing such transient files, I was pretty quick to turn the indexing off for the whole slots tree. So, another efficiency improvement resulting from the testing. :^) Security Essentials is MsMpEng.exe. Ah, okay, thank you. I hadn't caught on to that. Well, at least it seemed to get in and out very quickly. ID: 1702891 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1702933 - Posted: 18 Jul 2015, 22:49:59 UTC - in response to Message 1702888. However, the AV on those boxes (Security Essentials on the xw9400 and Windows Defender on the T7400) didn't seem to be taking any interest in those files. MSE is very resource light. I know people that run Norton & other AV programmes & their system responsiveness is very poor. The programmes themselves don't use much CPU resources, but they do bog things down when downloading emails, opening & saving files etc. I suspect the defaults are to scan everything always. Maybe if they were to use more CPU resources they could do their work faster with less impact, but as they are they really do slow down systems, even high end ones. Grant Darwin NT ID: 1702933 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1704754 - Posted: 24 Jul 2015, 23:54:54 UTC - in response to Message 1702933. Last modified: 24 Jul 2015, 23:57:07 UTC I'll be glad when a fix is generally available. Haven't had an issue for a month or 2, and then today alone I have 2 truncated Stderr outputs turn up, <core_client_version>7.0.64</core_client_version> <![CDATA[ <stderr_txt> </stderr_txt> ]]> 4281155998 4279260695 Grant Darwin NT ID: 1704754 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1704758 - Posted: 25 Jul 2015, 0:11:20 UTC - in response to Message 1704754. Last modified: 25 Jul 2015, 0:15:02 UTC I'm not sure I understand your concerns. The 'fix' has been available for over a week. BOINC 7.6.6 HERE Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1704758 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1704760 - Posted: 25 Jul 2015, 0:22:53 UTC - in response to Message 1704758. Last modified: 25 Jul 2015, 0:23:35 UTC I'm not sure I understand your concerns. The 'fix' has been available for over a week. BOINC 7.6.6 HERE Development version (MAY BE UNSTABLE - USE ONLY FOR TESTING) What are the know issues? Grant Darwin NT ID: 1704760 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1704768 - Posted: 25 Jul 2015, 1:13:31 UTC - in response to Message 1704760. Absolutely nothing that I or the other testers have come across. I believe it will go to main next week anyway. You definitely should be running it if you do MilkyWay. In fact there is a recommended sticky note to run this client at MilkyWay. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1704768 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1704771 - Posted: 25 Jul 2015, 1:22:35 UTC - in response to Message 1704768. Absolutely nothing that I or the other testers have come across. I believe it will go to main next week anyway. You definitely should be running it if you do MilkyWay. In fact there is a recommended sticky note to run this client at MilkyWay. I'm Seti only. Even if it doesn't become the recommended version next week, i'll see how adventurous I'm feeling next weekend. Grant Darwin NT ID: 1704771 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1704772 - Posted: 25 Jul 2015, 1:22:39 UTC - in response to Message 1704760. I'm not sure I understand your concerns. The 'fix' has been available for over a week. BOINC 7.6.6 HERE Development version (MAY BE UNSTABLE - USE ONLY FOR TESTING) What are the know issues? I don't know about any known issues, but according to the Test results summaries chart it appears that they're 71% complete. The chart seems to show that a lack of XP and MAC testers is the primary holdup. ID: 1704772 ·

Jord Volunteer tester Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3	Message 1704774 - Posted: 25 Jul 2015, 1:51:56 UTC - in response to Message 1704772. And someone writing the documentary, manual and wiki. I haven't had much time yet the past weeks, hope to do better this weekend. :) ID: 1704774 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.