Message boards :
Number crunching :
Stderr Truncations
Message board moderation
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · Next
Author | Message |
---|---|
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
... Notice that I said "modification" rather than "fix", because now that I believe I've actually caught on to what that code change is doing (a light bulb that flashed on about 2 A.M., of course), I'm wondering if this might just be exchanging one type of rare task failure ("instant" Invalid) for another (Error while computing - Error code 32), at least from a S@h perspective. (For MW, it probably really is a fix.) Heh, my first reaction when I thought I understood why the truncations really were fixed was quite satisfying, the second reaction, after crawling back into bed, was to wonder whether the cure might simply introduce a different disease, not exactly a sleep-inducing thought. Now, having just looked at that code block again with my non-C++ trained eyes, I'm wondering whether I've still misinterpreted what will happen if that 5-second grace period expires. Does the Sharing Violation actually end up causing that Error Code 32 to be reported, or does the normal code path simply resume, presumably still producing a truncated stderr.txt? I dunno, some expert is gonna hafta 'splain that to me! ;^) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
... Notice that I said "modification" rather than "fix", because now that I believe I've actually caught on to what that code change is doing (a light bulb that flashed on about 2 A.M., of course), I'm wondering if this might just be exchanging one type of rare task failure ("instant" Invalid) for another (Error while computing - Error code 32), at least from a S@h perspective. (For MW, it probably really is a fix.) Boinc committee next meeting, urgent matters: "Oh no! the Windows people have discovered process monitor!" A lot becomes clearer if you go look at the scheduler authentication code. It makes this part look solid as... well, something really really solid. [Edit:] sometimes not being trained in something can be an advantage too. It's easier to point out that it doesn't work, lol. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874 |
Now, having just looked at that code block again with my non-C++ trained eyes, I'm wondering whether I've still misinterpreted what will happen if that 5-second grace period expires. Does the Sharing Violation actually end up causing that Error Code 32 to be reported, or does the normal code path simply resume, presumably still producing a truncated stderr.txt? I dunno, some expert is gonna hafta 'splain that to me! ;^) To my (also untrained) eye, it looks like it simply falls out of the bottom, five seconds later. So it does what it was going to do anyway, but slower. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Now, having just looked at that code block again with my non-C++ trained eyes, I'm wondering whether I've still misinterpreted what will happen if that 5-second grace period expires. Does the Sharing Violation actually end up causing that Error Code 32 to be reported, or does the normal code path simply resume, presumably still producing a truncated stderr.txt? I dunno, some expert is gonna hafta 'splain that to me! ;^) Will grab a look with as fresh eyeballs as possible, once things settle down. Even a simple 5 second delay that does nothing else can be an eternity to let things stabilise. The programming to the lowest common denominator approach described to me by Rom, doesn't really add a lot of confidence. But at the same time I'm just glad it's being looked at, which is a great start IMO. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Rom Walton (BOINC) Send message Joined: 28 Apr 00 Posts: 579 Credit: 130,733 RAC: 0 |
... Notice that I said "modification" rather than "fix", because now that I believe I've actually caught on to what that code change is doing (a light bulb that flashed on about 2 A.M., of course), I'm wondering if this might just be exchanging one type of rare task failure ("instant" Invalid) for another (Error while computing - Error code 32), at least from a S@h perspective. (For MW, it probably really is a fix.) Funny, we have been through 10 or so code reviews and security audits in the last ten years by various companies. IBM (in-house, we have to go through a full audit every time they want a new branded client), Intel (through a third party), a bank or two, an oil company, and at least one hospital. Your gripes appear to be more about aesthetics than how solid/stable something is. Don't confuse the two. ----- Rom BOINC Development Team, U.C. Berkeley My Blog |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Now, having just looked at that code block again with my non-C++ trained eyes, I'm wondering whether I've still misinterpreted what will happen if that 5-second grace period expires. Does the Sharing Violation actually end up causing that Error Code 32 to be reported, or does the normal code path simply resume, presumably still producing a truncated stderr.txt? I dunno, some expert is gonna hafta 'splain that to me! ;^) If that's the case, then it will definitely be a significant improvement for S@h, too, even if it is more of a patch than a true fix. And to use Jason's raft analogy ("Fingers crossed this raft never gets used on the open ocean."), you may not want to launch a raft with this sort of patch, but if you're already in the middle of the ocean, it sure will be helpful to patch the pinholes any way you can, until you can get the raft back to shore and build a whole new one! |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Funny, we have been through 10 or so code reviews and security audits in the last ten years by various companies. IBM (in-house, we have to go through a full audit every time they want a new branded client), Intel (through a third party), a bank or two, an oil company, and at least one hospital. haha true :) At that point it would start to become a pointless contest about who's had more security experience than who, if aesthetics are really important at all, costs, and what level of exposure is acceptable, none of which are my decisions. Only thing I know for certain when my employers security audit, they have some pretty stringent aesthetically based rules. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874 |
Boinc committee next meeting, urgent matters: "Oh no! the Windows people have discovered process monitor!" I hope I'm not doing that. I don't know what the brief for a 'code review' would be, but a 'security audit' is presumably focused on not causing damage - not interfering with the working of the host machine, not leaking data to third parties, that sort of thing. All vitally important to the reputation of scientific researchers who submit papers on the basis of research data passed through the BOINC infrastructure. And from that, the reputation of BOINC itself. But as we've seen in recent days, those checks haven't prevented deficiencies in functionality. We've seen wastage (tasks abandoned for unknown reasons, while hosts continue to burn electricity processing them), and at Milkyway I've seen science just thrown away, when these 'validate errors' accumulate to take a workunit beyond the maximum error count. So maybe there's a case for a different kind of audit, for conformity to deign schema? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
So maybe there's a case for a different kind of audit, for conformity to deign schema? Don't know about the companies Rom listed, but here a security audit typically involves assessing exposure to risks, and minimising the visible footprint. There's quite valid points Rom appears to be making about what it looks like being less important from an external threat perspective, however the aesthetics & engineering principles come more into play when you consider risks/vulnerabilities of a more internal nature, which can include having to change and the risks/costs of doing so having uninteded consequences. Obviously the need for change hasn't come up a lot in certain areas, so I imagine those audits would regard the maintainability of that particular code as low priority. Different companies/projects, different needs and risks. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Juha Send message Joined: 7 Mar 04 Posts: 388 Credit: 1,857,738 RAC: 0 |
I mean, the kernel ought to know when it has closed all files and flushed buffers, right? Guess I need to explain. What I expected is that when the kernel gets a request to terminate process, whether the request comes though ExitProcess or TerminateProcess or some other way, the kernel will flush buffers, close files and other handles and releases any memory the process is using and whatever else that's part of the cleanup. And when the only thing that's left of the process is the process object then the kernel would update the exit code. Now, since you like so much to mock everything BOINC devs do, why don't you tell us how you would do all this? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Now, since you like so much to mock everything BOINC devs do, why don't you tell us how you would do all this? Actually those suggestions, to actually commit the files involved were apparently taken on board, and a patch applied in another way, and as I said I'm happy it's being looked at. That I would provide callback points for more flexibility with changing technology, some plugin-ness, has been mentioned. That my descriptions come across as 'mocking everything they do" is unfortunate, but the result of a long road of frustration. It appears that annoying and rocking a few boats gets people talking, looking, debating, analysing, fixing etc. Not something that I used to be equipped for, and doesn't necessarily sit right with me either, but perhaps not liking or agreeing with anything I say was a part of it all. After all, AFAIK in Berkeley's history there's quite a bit of that sortof thing. [Edit:] for the important technical points you make, and questions you raise, the core issues relate to that multithreaded C-Runtimes became standard circa 2005 in the case of Windows, so it requires a mindset shift from sequential/procedural to parallel and out-of order operation. That's proven over time to be a lot tougher than most I know expected, and for me too. Non-deterministic behaviour is a pretty big red flag for this kindof thing too. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Juha Send message Joined: 7 Mar 04 Posts: 388 Credit: 1,857,738 RAC: 0 |
Now, since you like so much to mock everything BOINC devs do, why don't you tell us how you would do all this? I don't see the commit mode change actually fixing the problem. The writes have been delayed somewhere by something for some reason. Now all that is changed is that the app is forced to wait until the writes hit the disk. The fact that the copy that's sitting in filesystem cache is up to date when the client goes to read it is just a fortunate side-effect of the change. And I don't think the commit mode change makes a difference for wrapped apps for that matter. So how does the client, or the app, tell when it's safe to read the files without simply trying and trying again? I can't see callbacks or anything like that helping. I'm just trying to get from 'ok this works sort of' to how it's really done right (if it can). That my descriptions come across as 'mocking everything they do" is unfortunate, but the result of a long road of frustration. My issue with your criticism for BOINC is that for the past few weeks it's been excessive and exaggerating and quite a few other adjectives. It's easy to read your criticism as if you are saying that BOINC is the worst piece of software ever written and the devs are the worst devs to ever live. BOINC just isn't that bad. |
Rom Walton (BOINC) Send message Joined: 28 Apr 00 Posts: 579 Credit: 130,733 RAC: 0 |
[Edit:] for the important technical points you make, and questions you raise, the core issues relate to that multithreaded C-Runtimes became standard circa 2005 in the case of Windows, so it requires a mindset shift from sequential/procedural to parallel and out-of order operation. That's proven over time to be a lot tougher than most I know expected, and for me too. Non-deterministic behaviour is a pretty big red flag for this kindof thing too. Back even further than that. 1992 (Windows NT 3.1 October Beta) is when I had to hunker down and learn the basics of processes, threads, and thread sync mechanisms. Prior experience to that was just Windows 3.1 (16-bit preemptive tasking). IIRC, the Microsoft CRT hadn't even been developed yet. It would be a year or two later, when vendors didn't jump on the NT bandwagon fast enough complaining about difficulties in porting their software to Windows NT. Anyways, the difficulties are the primary reason why BOINC is not already multi-threaded. At this point it would be more trouble than it is worth. BOINC itself doesn't use much CPU time and, for the most part, isn't time sensitive in that it doesn't require millisecond response times. So going multi-threaded just adds complexity and debugging headaches. More so for platforms other than Windows. ----- Rom BOINC Development Team, U.C. Berkeley My Blog |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Now, since you like so much to mock everything BOINC devs do, why don't you tell us how you would do all this? No commit mode is indeed only a workaround. How the Boinc Devs decide to do it better is completely up to them. Thanks for the criticism, and I'll certainly take it on board. Whether I choose to conform or not from here on will certainly be influenced by the sensible discussions I've now had with Rom, which have been far more constructive than the past 7 years of wishful thinking, and watching not knowing how to make the (very real) problems recognised. It took Milkyway validation failure to do that. Was it over the top ? Should I change? Probably. At the same time I am now much better equipped to say that BOINC devs are not the worst I've met, and Boinc is slightly better in a small way. I don't need them to like me, nor do things my way, nor even read anything I post/submit, but if putting some noses out of joint did any more to stir any thought at all this time around, then I can live with that. You're as entitled to not like my behaviour just as much as I am sick to the stomach of saying the things I felt I needed to. That you don't agree with them, or that they needed saying is good too, along with pointing out we aren't the same. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
[Edit:] for the important technical points you make, and questions you raise, the core issues relate to that multithreaded C-Runtimes became standard circa 2005 in the case of Windows, so it requires a mindset shift from sequential/procedural to parallel and out-of order operation. That's proven over time to be a lot tougher than most I know expected, and for me too. Non-deterministic behaviour is a pretty big red flag for this kindof thing too. Yes, tough road. There'll be a few more hurdles with that legacy, but am very grateful for the discussion and explanations. Thanks for taking the technical approach, and I'm sorry I felt I've had to kick up a royal stink of late. despite criticisms it isn't something that came natural, and I hope I find a better way, even though I'm not convinced returning to a totally conformist attitude is going to be the answer either. Thanks again, Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Rom Walton (BOINC) Send message Joined: 28 Apr 00 Posts: 579 Credit: 130,733 RAC: 0 |
I don't see the commit mode change actually fixing the problem. The writes have been delayed somewhere by something for some reason. I have a hypothesis on this, but I don't have a way to prove or disprove it yet. Suppose that when an app calls cuInit() to initialize the CUDA/OpenCL library it passes the current stderr/stdout handles to the CUDA kernel code so that fatal compiler errors can be trapped/written to a file for the calling app. During this process they duplicate and internalize the handle thereby causing it to increase its ref count. Normally the CUDA library assumes it can clean things up during the dllmain unload event, but because boinc_exit() calls TerminateProcess() the event is never fired. The kernel decrements the ref count of the handle, after TerminateProcess() is called and the process is cleaned up, but doesn't close it down because its ref count is still greater than 1. It isn't until the CUDA kernel driver has attempted to do something that it discovers that a handle it holds is no longer valid and cleans things up on its end thereby releasing the write lock on stderr.txt. The CUDA library doesn't really provide a clean-up routine you are supposed to call after you are done, so there isn't a way to test this. We would need to talk to somebody at Nvidia to find out what underlying assumption the CUDA library is making with regards to cleaning up on application shutdown to know what is really going on. ----- Rom BOINC Development Team, U.C. Berkeley My Blog |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
If I could just interject one thing (without understanding much other than CUDA in that post), it would be that truncated Stderr is not unique to the NVIDIA GPUs. It happens on ATI cards and CPUs as well. EDIT: See ancient Message 1469381 in "Strange Invalid MB Overflow tasks with truncated Stderr outputs..." for an example and discussion. Also in other messages in that thread. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874 |
And also that in the Milkyway case, it's the OpenCL component of the NVidia driver/runtime suite which is active. |
Rom Walton (BOINC) Send message Joined: 28 Apr 00 Posts: 579 Credit: 130,733 RAC: 0 |
If I could just interject one thing (without understanding much other than CUDA in that post), it would be that truncated Stderr is not unique to the NVIDIA GPUs. It happens on ATI cards and CPUs as well. Okay, that blows that theory out of the water. ----- Rom BOINC Development Team, U.C. Berkeley My Blog |
Rom Walton (BOINC) Send message Joined: 28 Apr 00 Posts: 579 Credit: 130,733 RAC: 0 |
And also that in the Milkyway case, it's the OpenCL component of the NVidia driver/runtime suite which is active. True, but I suspect that the OpenCL compiler just converts OpenCL code into CUDA instructions. ----- Rom BOINC Development Team, U.C. Berkeley My Blog |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.