Message boards :
Number crunching :
Stderr Truncations
Message board moderation
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · Next
Author | Message |
---|---|
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Part of (but not the complete story I'm certain), involves that frequent DMA engine transfers over the PCIe bus are typically occurring from device to host, and vice versa. OpenCL and DirectCompute as well. In some cases these are application programmer initiated explicit calls, while in others they are underlying library functionality, or even driver management under control of the changing WDDM & Os use of the resources, which has a paged virtual memory model. (now being unified). When you terminate and free host buffers underneath an active DMA transfer, then you induce arbitrary effects in the driver, from fairly benign ones transparently recovered, through to full reset of the driver stack. An old symptom of this was the old 'sticky downclock', which used to occur at task termination, and make it look like the next task starting up broke something. It always looked like that because the juncture of task complete and new task startup nearly always conincided. The associated delays involved in recovery (when it can) are pretty lengthy, and block a lot of system operation for the duration. The effects change with drivers and models of GPU, but you can typically replicate the conditions by terminating an active GPU task via Task manager. For a long time that's been treated here by trying to use critical sections as they're supposed to be used, and still providing some means to synch cleanly on exitwhen possible, but afaik there are quirks and imperfections in all the approaches tried. That's where the recommendation for a registerable onexit callback comes from, though of course that doesn't make the mechanism behave gracefully under external termination. Probably more the domain of ms polishing WDDM and working with nVidia. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Day or 2 ago I got reboot on BOINC restart attempt (2 ATi AP running on GPU at that time + some Flash with resource leakage taking >1G or RAM). Reboot was with display picture damage just before reboot. But worst part of it - client_state.xml was zeroed and full cache was lost (maybe because of all writeback settings for HDD were enabled). So, it's definitely not safe to restart BOINC with GPU apps running (though critical section implemented). |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
So, it's definitely not safe to restart BOINC with GPU apps running (though critical section implemented). Yeah, it hasn't been enough (I still use exit flag polling). Rough structure is user mode driver, with user mode driver helper sitting in the application's area. So if you bring that down, your other user mode drivers can go down too. Still better than original XP Driver model, where it'd BSOD if you looked away, so the slower later XP model hybrid is a bit more solid. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Being user mode did not save from whole system reboot with NTFS file system file damage, unfortunately. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Being user mode did not save from whole system reboot with NTFS file system file damage, unfortunately. Yeah, that sounds pretty extreme. Makes sense, but not nice. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Now, having just looked at that code block again with my non-C++ trained eyes, I'm wondering whether I've still misinterpreted what will happen if that 5-second grace period expires. Does the Sharing Violation actually end up causing that Error Code 32 to be reported, or does the normal code path simply resume, presumably still producing a truncated stderr.txt? I dunno, some expert is gonna hafta 'splain that to me! ;^) Followed up on this with clear eyeballs as promised. I have no issues with the method or logic. It'll just keep trying to open the file for up to 5 seconds (rather than simply a fixed delay). The comment doesn't really say why a magic number of 5 seconds ( not 3, 9 or 42 ), and it only tries once per second for access, but should be way better as your results would suggest. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Now, having just looked at that code block again with my non-C++ trained eyes, I'm wondering whether I've still misinterpreted what will happen if that 5-second grace period expires. Does the Sharing Violation actually end up causing that Error Code 32 to be reported, or does the normal code path simply resume, presumably still producing a truncated stderr.txt? I dunno, some expert is gonna hafta 'splain that to me! ;^) Thanks for the follow-up, Jason. So, if even the 5-second grace period isn't sufficient, BOINC might still produce the truncated stderr.txt but won't throw that new Sharing Violation back at the user as an "Error while computing"? Then that should make the truncations and by extension, the "instant" Invalids, much, much rarer, even if not entirely extinct. A major improvement! |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Now, having just looked at that code block again with my non-C++ trained eyes, I'm wondering whether I've still misinterpreted what will happen if that 5-second grace period expires. Does the Sharing Violation actually end up causing that Error Code 32 to be reported, or does the normal code path simply resume, presumably still producing a truncated stderr.txt? I dunno, some expert is gonna hafta 'splain that to me! ;^) Yep. I mentioned the mindset shift before, and Rom indicated he's well experienced with it. Breaking it down, which always takes some effort with lofty abstract concepts: Old Way (Imperative Control) Do This ... and it does New Way (Asynchronous) request-->wait-->OKdone Trying to mash the two together with a fork: demand-->no, wait---> crash pretty tough, lol [Edit:] So you could argue that the commit mode + patient read is now near optimal, once applications and clients adopt it. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Rom Walton (BOINC) Send message Joined: 28 Apr 00 Posts: 579 Credit: 130,733 RAC: 0 |
Now, having just looked at that code block again with my non-C++ trained eyes, I'm wondering whether I've still misinterpreted what will happen if that 5-second grace period expires. Does the Sharing Violation actually end up causing that Error Code 32 to be reported, or does the normal code path simply resume, presumably still producing a truncated stderr.txt? I dunno, some expert is gonna hafta 'splain that to me! ;^) 99% of the time it'll probably the contents of stderr.txt on the first retry when the error condition is triggered. The extra four retries are to deal with the possibility we have to wait on various forms of anti-malware and content indexers to have had their way with the file as well. Special purpose software such as virus scanners and anti-malware usually have kernel mode components that make sure they get first dibs on files after they have been released by the apps that created them. Normal Windows software hooks into the file system notification API and lets Windows tell it when it is its turns to fiddle with the file. BOINC being a cross-platform async polling app instead relies of checking to see if it can successfully access the file. ----- Rom BOINC Development Team, U.C. Berkeley My Blog |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
99% of the time it'll probably the contents of stderr.txt on the first retry when the error condition is triggered. The extra four retries are to deal with the possibility we have to wait on various forms of anti-malware and content indexers to have had their way with the file as well. For those couple days that I was running with the Process Monitor logging turned on, it never required more than one re-try to be successful. However, the AV on those boxes (Security Essentials on the xw9400 and Windows Defender on the T7400) didn't seem to be taking any interest in those files. It was certainly informative, however, to see that Windows Search indexing was accounting for a significant amount of activity on the T7400 (Message 1701586). Seeing no purpose in indexing such transient files, I was pretty quick to turn the indexing off for the whole slots tree. So, another efficiency improvement resulting from the testing. :^) |
Rom Walton (BOINC) Send message Joined: 28 Apr 00 Posts: 579 Credit: 130,733 RAC: 0 |
99% of the time it'll probably the contents of stderr.txt on the first retry when the error condition is triggered. The extra four retries are to deal with the possibility we have to wait on various forms of anti-malware and content indexers to have had their way with the file as well. Security Essentials is MsMpEng.exe. ----- Rom BOINC Development Team, U.C. Berkeley My Blog |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
99% of the time it'll probably the contents of stderr.txt on the first retry when the error condition is triggered. The extra four retries are to deal with the possibility we have to wait on various forms of anti-malware and content indexers to have had their way with the file as well. Ah, okay, thank you. I hadn't caught on to that. Well, at least it seemed to get in and out very quickly. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13866 Credit: 208,696,464 RAC: 304 |
However, the AV on those boxes (Security Essentials on the xw9400 and Windows Defender on the T7400) didn't seem to be taking any interest in those files. MSE is very resource light. I know people that run Norton & other AV programmes & their system responsiveness is very poor. The programmes themselves don't use much CPU resources, but they do bog things down when downloading emails, opening & saving files etc. I suspect the defaults are to scan everything always. Maybe if they were to use more CPU resources they could do their work faster with less impact, but as they are they really do slow down systems, even high end ones. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13866 Credit: 208,696,464 RAC: 304 |
I'll be glad when a fix is generally available. Haven't had an issue for a month or 2, and then today alone I have 2 truncated Stderr outputs turn up, <core_client_version>7.0.64</core_client_version> <![CDATA[ <stderr_txt> </stderr_txt> ]]> 4281155998 4279260695 Grant Darwin NT |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I'm not sure I understand your concerns. The 'fix' has been available for over a week. BOINC 7.6.6 HERE Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13866 Credit: 208,696,464 RAC: 304 |
I'm not sure I understand your concerns. The 'fix' has been available for over a week. Development version (MAY BE UNSTABLE - USE ONLY FOR TESTING) What are the know issues? Grant Darwin NT |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Absolutely nothing that I or the other testers have come across. I believe it will go to main next week anyway. You definitely should be running it if you do MilkyWay. In fact there is a recommended sticky note to run this client at MilkyWay. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13866 Credit: 208,696,464 RAC: 304 |
Absolutely nothing that I or the other testers have come across. I believe it will go to main next week anyway. You definitely should be running it if you do MilkyWay. In fact there is a recommended sticky note to run this client at MilkyWay. I'm Seti only. Even if it doesn't become the recommended version next week, i'll see how adventurous I'm feeling next weekend. Grant Darwin NT |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
I'm not sure I understand your concerns. The 'fix' has been available for over a week. I don't know about any known issues, but according to the Test results summaries chart it appears that they're 71% complete. The chart seems to show that a lack of XP and MAC testers is the primary holdup. |
Jord Send message Joined: 9 Jun 99 Posts: 15184 Credit: 4,362,181 RAC: 3 |
And someone writing the documentary, manual and wiki. I haven't had much time yet the past weeks, hope to do better this weekend. :) |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.