Stderr Truncations

Message boards : Number crunching : Stderr Truncations
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 · Next

AuthorMessage
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1701975 - Posted: 16 Jul 2015, 0:26:05 UTC - in response to Message 1701958.  

One of constrains of BOINC app - no GUI-based manifestations of crash should exist.
TerminateProcess not allo to re-throw exception (so OS can't intercept and show anything). exit() can re-throw exception.
So, it can be matter of to what extent parent process (BOINc client) can intercept exceptions and not yield them to OS->GUI
ID: 1701975 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1701976 - Posted: 16 Jul 2015, 0:37:01 UTC - in response to Message 1701975.  

Lol, well I guess they have their reasons. I'll stick to synthetic-aesthetic methods, and save the organic fertiliser for the garden.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1701976 · Report as offensive
Profile Rom Walton (BOINC)
Volunteer tester
Avatar

Send message
Joined: 28 Apr 00
Posts: 579
Credit: 130,733
RAC: 0
United States
Message 1702016 - Posted: 16 Jul 2015, 3:16:59 UTC - in response to Message 1701958.  

Only problem there, is they decided (for unknown reasons) to use an asynchronous TerminateProcess() call and put a 1 second sleep and hard crash, instead of waiting on a synchronisation primitive.


Normal methods for exiting a process involve a risk of the process not shutting down at all.

Normally ExitProcess() (of which exit() and _exit() end up calling) only kill the calling thread (or at least the last time I tested this out in 2006/2007).

Even the ExitProcess docs state:

If one of the terminated threads in the process holds a lock and the DLL detach code in one of the loaded DLLs attempts to acquire the same lock, then calling ExitProcess results in a deadlock. In contrast, if a process terminates by calling TerminateProcess, the DLLs that the process is attached to are not notified of the process termination. Therefore, if you do not know the state of all threads in your process, it is better to call TerminateProcess than ExitProcess. Note that returning from the main function of an application results in a call to ExitProcess.


Since an application can start and use any number of threads, boinc_exit() cannot assume to know the state of any thread outside the thread that called boinc_exit().

Hence, TerminateProcess().

The one second sleep is an attempt to give the Windows thread scheduler a chance to act on all the newly terminated threads. The majority case scenario is that all threads are halted and cleaned up during TerminateProcess(). Using WaitForSingleObject() on the process handle is problematic in that its state is unknown after a TerminateProcess call. Sleep probably does a WaitForSingleObject() against the thread handle though. In any case the overall objective is just to force the thread scheduler to clean things up.

DebugBreak() is for when all else fails, try to capture whatever is left in a debugger that is already attached to the running process.

No idea why the entire codebase seems to be allergic to synchronisation, and likes to use magic numbers (fixed time intervals) on a non-realtime OS.


While the threading model(s) on Windows are rather rich, pthread is not. So we code to the lowest common denominator. BOINC at its core is a single threaded application.
----- Rom
BOINC Development Team, U.C. Berkeley
My Blog
ID: 1702016 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1702020 - Posted: 16 Jul 2015, 3:27:48 UTC - in response to Message 1702016.  
Last modified: 16 Jul 2015, 3:46:29 UTC

All reasonable, and thanks for the explanation Rom.

May I suggest that these kindof unusual practices be documented, such that the havoc they create for GPUs and other out of process dependancies, can be more readily avoided using the mechanisms available ?

[Edit:] Now that I know the reasoning, and that not much of it applies for specific applications at all, optional exit handling strategies & thread management become a lot easier to work in.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1702020 · Report as offensive
Profile Rom Walton (BOINC)
Volunteer tester
Avatar

Send message
Joined: 28 Apr 00
Posts: 579
Credit: 130,733
RAC: 0
United States
Message 1702054 - Posted: 16 Jul 2015, 5:41:28 UTC - in response to Message 1702020.  
Last modified: 16 Jul 2015, 5:42:01 UTC

[Edit:] Now that I know the reasoning, and that not much of it applies for specific applications at all, optional exit handling strategies & thread management become a lot easier to work in.


Honestly, the majority of scientific applications are single threaded applications anyway. Most scientists are not computer scientists, so multi-threaded applications are a rarity.

Things used to be pretty simple and straight forward. Windows Vista changed all that with the 5 second shutdown requirement. Now we have to be able to have BOINC and all of its child processes shutdown within 5 seconds, when the bulk of the child applications are themselves single threaded. It causes quite a few contourtions.

The biggest stumbling block I found related to threading was the fact that the pthread style threading libraries didn't have an equivalent of EnumThreads(). Yet the threading library used by Linux, Mac OS X, BSD, and Android is pthread.

You can make the Windows threading API conform with pthread much easier than the reverse.
----- Rom
BOINC Development Team, U.C. Berkeley
My Blog
ID: 1702054 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1702067 - Posted: 16 Jul 2015, 6:42:18 UTC - in response to Message 1702054.  
Last modified: 16 Jul 2015, 6:44:16 UTC

The recommendations I'd probably be looking at making after a couple of unrelated development sprints, relate more to adding a small api user registerable callback to the exit event (with suitable conditions on maximum execution time).

As the critical section approach to guarding the GPU code from termination, has some response time problems with respect to:
- asynchronous streams/queues in minimal CPU usage scenarios (well optimised GPU applications tend to be in critical sections a majority of the time),
- large launches (until adaptive scaling is more commonly used), and
- freeing host memory underneath GPU driver operations,

informing the application in some low cost fashion that it needs to shutdown cleanly & promptly, as opposed to launching more complex operations, seems like it would help the situation a lot. Of course I appreciate the above situations at least, didn't commonly exist ~2006-2007, and the flexibility wasn't necessary.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1702067 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1702099 - Posted: 16 Jul 2015, 8:59:44 UTC - in response to Message 1702067.  

AFAIK currently scientific app informed about pending exit via perhaps most low-overhead way - flag is set. Perhaps, in real multithreaded app any thread could read it act act appropriately. In modern GPU apps worker thread reads it and performs corresponding cleanup actions. Callback function would act outside of worker thread context so in result it should inform worker thread to stop queuing work (or just terminate worker thead with similar nasty consequencies as whole process kill currently).
Would be interesting to see how coherent exit procedure could be done w/o polling inside worker thread at all (reading flag is form of polling too).
ID: 1702099 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1702102 - Posted: 16 Jul 2015, 9:12:04 UTC

Rom has also checked in a very simple fix to the API, to ensure that (Windows only) stderr.txt is always opened in 'commit' mode.

431ec9e48dc8a5b4ef2ad92faaa3f643f8fb0a5e
ID: 1702102 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1702103 - Posted: 16 Jul 2015, 9:14:34 UTC - in response to Message 1702099.  

Yes, if it can be done the beauty of callbacks (provided you obey practical restrictions), as with OpenGL, DirectX and many many libraries, is then you need never poll a flag or anything, and if you don't need them you don't use them.

So you have near instant response with no hardware interrupt mechanism or polling, and reduced code complexity (through not having to manage checking flags at the right times etc, which are volatile on multicore). That timing and resulting complexity over many generations is the biggest problem pushing Cuda applications to the next level here. Most modern Api's are of course non blocking, so as not to waste cycles on spins/sleeps/locks (locks being the main allergy the Boinc codebase appears to have that I mentioned before).

With respect to programming to the 'lowest common denominator', well any Cuda or OpenCL equipped platform is inherently multithreaded (even if only at a minimum by separate hardware devices, which it's not only that), and adding low (development) cost flexibility, jibes with modern Agile practices the world has been reapidly moving to for some years (over rigid interfaces)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1702103 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1702104 - Posted: 16 Jul 2015, 9:15:22 UTC - in response to Message 1702102.  

Rom has also checked in a very simple fix to the API, to ensure that (Windows only) stderr.txt is always opened in 'commit' mode.

431ec9e48dc8a5b4ef2ad92faaa3f643f8fb0a5e



Eyyy! :-D, one down. n! to go :D
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1702104 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1702105 - Posted: 16 Jul 2015, 9:29:58 UTC - in response to Message 1702102.  

Rom has also checked in a very simple fix to the API, to ensure that (Windows only) stderr.txt is always opened in 'commit' mode.

431ec9e48dc8a5b4ef2ad92faaa3f643f8fb0a5e

Well, AFAIK linking versus commode.obj should result in same behavior, not?
All OpenCL builds are linked versus that obj more than year already. So, if any stderr truncations detected for them this change will not help too.
ID: 1702105 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1702106 - Posted: 16 Jul 2015, 9:32:37 UTC - in response to Message 1702103.  
Last modified: 16 Jul 2015, 9:36:14 UTC

Yes, if it can be done the beauty of callbacks (provided you obey practical restrictions), as with OpenGL, DirectX and many many libraries, is then you need never poll a flag or anything, and if you don't need them you don't use them.

Well, my post was exactly that "if". IMHO that "if" is quite exsistential one ;)
[in fact, libs have no responsibility of process termination]
ID: 1702106 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1702107 - Posted: 16 Jul 2015, 9:36:36 UTC - in response to Message 1702105.  

Rom has also checked in a very simple fix to the API, to ensure that (Windows only) stderr.txt is always opened in 'commit' mode.

431ec9e48dc8a5b4ef2ad92faaa3f643f8fb0a5e

Well, AFAIK linking versus commode.obj should result in same behavior, not?
All OpenCL builds are linked versus that obj more than year already. So, if any stderr truncations detected for them this change will not help too.


Yeah there are some extra levels because we are 'special windows people' (ignoring that our stuff builds and runs on Linux/Mac fine too), but at the same time I would see any progress toward more tasty multithreadingness as good, just because it lets me look at other things than obsessing over stupid truncated files.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1702107 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1702108 - Posted: 16 Jul 2015, 9:39:12 UTC - in response to Message 1702106.  

Yes, if it can be done the beauty of callbacks (provided you obey practical restrictions), as with OpenGL, DirectX and many many libraries, is then you need never poll a flag or anything, and if you don't need them you don't use them.

Well, my post was exactly that "if". IMHO that "if" is quite exsistential one ;)
[in fact, libs have no responsibility of process termination]


Well naturally as an api developer, you would want to make your library as flexible, extensible, scalable, and portable as possible (missing some important ones like simpel to use out). Weird quirks and no callbacks is less useful than fewer quirks, options to disable the quirks, and built in plugin-ness
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1702108 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1702109 - Posted: 16 Jul 2015, 9:46:22 UTC - in response to Message 1702105.  

Rom has also checked in a very simple fix to the API, to ensure that (Windows only) stderr.txt is always opened in 'commit' mode.

431ec9e48dc8a5b4ef2ad92faaa3f643f8fb0a5e

Well, AFAIK linking versus commode.obj should result in same behavior, not?
All OpenCL builds are linked versus that obj more than year already. So, if any stderr truncations detected for them this change will not help too.

That's what a lot of the discussion in this thread has been teasing out.

I don't think there's been any evidence displayed yet that the writing of stderr.txt (to local disk) has ever been truncated 'in the field', although MSDN does warn "The state of global data maintained by dynamic-link libraries (DLLs) may [my emphasis] be compromised".

Instead, the evidence that we saw (until we looked more deeply) was that incomplete stderrs were written to the project database, several steps later. I suspect that COMMODE (and the new API) mainly helps by reducing the time interval between calling boinc_finish and flushing stderr.txt, giving them a better chance of happening in the right order. The new v7.6.6 client also addresses that same part of the timing problem, and apparently successfully: so far, I've completed 1,640 Milkyway tasks since installing it, with zero new errors (previously I had an approximately 2% error rate).
ID: 1702109 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1702112 - Posted: 16 Jul 2015, 9:52:39 UTC - in response to Message 1702109.  
Last modified: 16 Jul 2015, 9:53:29 UTC

That's what a lot of the discussion in this thread has been teasing out.

I don't think there's been any evidence displayed yet that the writing of stderr.txt (to local disk) has ever been truncated 'in the field', although MSDN does warn "The state of global data maintained by dynamic-link libraries (DLLs) may [my emphasis] be compromised".

Instead, the evidence that we saw (until we looked more deeply) was that incomplete stderrs were written to the project database, several steps later. I suspect that COMMODE (and the new API) mainly helps by reducing the time interval between calling boinc_finish and flushing stderr.txt, giving them a better chance of happening in the right order. The new v7.6.6 client also addresses that same part of the timing problem, and apparently successfully: so far, I've completed 1,640 Milkyway tasks since installing it, with zero new errors (previously I had an approximately 2% error rate).


Yes. This is why I never laboured on the point of commode.obj, and only consider it a workaround. First of all with non-blocking multithreaded behaviour there should be no expectations of a time sensitive nature whatsoever (which is hard to grasp), and secondly there are more layers that doesn't control. For the second part it appears from the supplied process monitor logs that the underlying Windows Api calls are entirely synchronous, however no-one says that doesn't change in a couple of weeks with Win10 :D

build to last!
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1702112 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1702232 - Posted: 16 Jul 2015, 17:27:42 UTC

Another night with NO truncations. The PM logs show 6 SHARING VIOLATIONS successfully trapped on my xw9400 (3 before the stderr.txt was closed, 3 after) and 3 on my T7400 (1 before, two after). I also saw one that I don't think was trapped for an Astropulse task. I'll go back and try to dig up the details on that later on.

Anyway, now that I think I understand what the modification is doing, I don't expect to see any more truncated Stderr reports. Notice that I said "modification" rather than "fix", because now that I believe I've actually caught on to what that code change is doing (a light bulb that flashed on about 2 A.M., of course), I'm wondering if this might just be exchanging one type of rare task failure ("instant" Invalid) for another (Error while computing - Error code 32), at least from a S@h perspective. (For MW, it probably really is a fix.)

In my posts yesterday, I expressed some doubt about how the code change could be trapping Sharing Violations for truncations that hadn't actually exhibited any Sharing Violations (or Error Code 32) previously. Then in that 2 A.M. flash I realized that I had only partially understood what the code was doing. (My only previous experience with C++ came from playing around with Turbo C++ for DOS in about 1993, which was a very limited and frustrating experiment.) I had thought the mod was just trapping Sharing Violations that were already occurring, like Richard was documenting over at MW. Then it dawned on me (long before dawn) that it was actually creating Sharing Violations by trying to open the stderr.txt file in WRITE mode, instead of just for reading, as BOINC currently does.

All well and good for eliminating the truncations. But the truncations by themselves don't seem to cause any problems here at S@h (at least none that I remember reading about in the forums). It's only when those truncations occur on -9 overflows with less than 30 Spikes and an Autocorr count of 0 that the "instant" Invalids are created.

Richard had estimated an approximately 2% truncation/invalid rate at MW. I just checked my numbers on S@h for the first 2 weeks of July and find about a 0.8% rate. Now, although I haven't actually had any "instant" Invalids this month, the normal average is about 5, which I'd estimate represents only about 5-10% of my truncations. So, the overall Invalid rate is actually quite tiny, though obviously still noticeable for high-volume crunchers.

Now, what happens with the new modification? The truncations disappear, along with the "instant" Invalids. However, Sharing Violations have now been introduced where they didn't exist before. Certainly, the 5-second window included in the code to counteract that condition should almost always be far, far in excess of the time routinely needed.

But much like a kid who throws a snowball at his 3rd-grade teacher every morning with the expectation that he'll always be able to duck around the corner before she turns to see who threw it, and then finds one day that the sidewalk is too icy to get the traction he needs for his escape, there will undoubtedly be times when even that 5-second window fails and BOINC will mark otherwise perfectly successful tasks with a scarlet "E".

With something like, what, 1.5M tasks being crunched every day, with probably a fairly high percentage of them on Windows boxes (competing with virus scans, daily backups, streaming video, etc., etc.), I'd think that even a tiny percentage of them failing that 5-second grace period for a newly introduced error condition won't take long to show up in a new thread or two on this board, replacing the complaints about "instant" Invalids for completed tasks, with those for unwarranted "Error while computing" black marks.

Anyway, I guess I'll go back to repeating that I believe the best true "fix" for the "instant" Invalid condition here on S@h would have been to make that 1-line code change to the validator the Joe Segur proposed a year and a half ago.
ID: 1702232 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1702235 - Posted: 16 Jul 2015, 17:34:57 UTC - in response to Message 1702105.  

Rom has also checked in a very simple fix to the API, to ensure that (Windows only) stderr.txt is always opened in 'commit' mode.

431ec9e48dc8a5b4ef2ad92faaa3f643f8fb0a5e

Well, AFAIK linking versus commode.obj should result in same behavior, not?
All OpenCL builds are linked versus that obj more than year already. So, if any stderr truncations detected for them this change will not help too.

Yes, for Windows builds made with Microsoft compilers linking with commode.obj does the same thing when flushing stderr.txt. But most other toolchains do not have the commode.obj file, so the Lunatics rev 2549 AKv8c CPU builds I made with MinGW/GCC use exactly the method Rom has now added to the BOINC repo code.
                                                                   Joe
ID: 1702235 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1702264 - Posted: 16 Jul 2015, 18:39:06 UTC - in response to Message 1702232.  

I also saw one that I don't think was trapped for an Astropulse task. I'll go back and try to dig up the details on that later on.

Just to tie up this loose end, I went back and looked again at the PM log and found that the sharing violation for this AP task (a 100% blanked task that ran for 7 seconds) was, indeed, a result of the new code because the AP app hadn't yet actually closed stderr.txt. The first 1-second delay resulted in a successful read. This resulted in no truncation and the BOINC Event Log shows nothing odd.
ID: 1702264 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1702270 - Posted: 16 Jul 2015, 18:52:01 UTC - in response to Message 1702232.  
Last modified: 16 Jul 2015, 19:21:58 UTC

... Notice that I said "modification" rather than "fix", because now that I believe I've actually caught on to what that code change is doing (a light bulb that flashed on about 2 A.M., of course), I'm wondering if this might just be exchanging one type of rare task failure ("instant" Invalid) for another (Error while computing - Error code 32), at least from a S@h perspective. (For MW, it probably really is a fix.)


Lol, when you reach that point, It's actually a pretty unique feeling isn't it ? Kindof relief that something's done, mixed with dissapointment at the particular choice of chewing gum, bits of string, and duct tape to plug the holes.

Fingers crossed this raft never gets used on the open ocean.

[E.T calls up via Arecibo just to ask "What the heck is that thing you're driving? Impressive! Don't build spaceships though!"]
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1702270 · Report as offensive
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 · Next

Message boards : Number crunching : Stderr Truncations


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.