Message boards :
Number crunching :
Stderr Truncations
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next
Author | Message |
---|---|
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
One of the things that intrigues me is what happens right after 11:26:10.2482866, when there seems to be a break in writing of the stderr.txt. At that point, the app first writes the boinc_finish_called file. Only after that file is created does the app write init_data.xml, followed by deletion of the boinc_lockfile. Once all that is done, the writing of the stderr.txt resumes, at 11:26:10.2586581. However, the writing does seem to finish and the file gets closed before BOINC opens and reads it, with a length of 2,002 bytes. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I'd have to check, but I think the end of line chars are translated in there somewhere along the way, from CRLF (Windows) to LF (*Linux)/ That would account for at least some number fewer characters (one per line), and some of the 2 byte writes. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
I'd have to check, but I think the end of line chars are translated in there somewhere along the way, from CRLF (Windows) to LF (*Linux)/ That would account for at least some number fewer characters (one per line), and some of the 2 byte writes. Yeah, I'm sure those are end of line markers, however they're actually written. I just counted 20 of them, which appear to be the typical number of additional lines that would appear in a complete Stderr following the truncation at "Thread call stack limit is: 1k". And a couple of them appear as back-to-back writes, which should correspond to the 2 blank lines in there. EDIT: Ah, and I see that 16 of them come before that apparent break in the writing of the stderr.txt, which would take it through the "called boinc_finish" and before the "Exit Status: 0". That makes sense. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
For the sakes of comparing the timing against this documentation at: https://msdn.microsoft.com/en-us/library/windows/desktop/aa365683(v=vs.85).aspx I wonder would the commode builds in comparison show the stderr Writes as synchronous (as I'd expect) ? That would just give some illustrations of the user level, driver and kernel underlying sequences involved. (Side point at this stage I realise) I did get a chuckle out of this part: If a file or device is opened for synchronous I/O (that is, FILE_FLAG_OVERLAPPED is not specified), subsequent calls to functions such as WriteFile can block execution of the calling thread until one of the following events occurs: "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
My eyeballs seemed to have recovered enough for me to take a look at the other truncation that happened on the xw9400 overnight, Task 4264184752. The only significant difference on this one is the Stderr truncation point, which occurs 3+ lines lower than "normal". I say "3+, because the 4th additional line is incomplete. Thread call stack limit is: 1k cudaAcc_free() called... cudaAcc_free() running... cudaAcc_free() PulseFind freed... cudaAcc_free() Gaussfit </stderr_txt> However, looking at the PM log, it shows exactly the same sequence as the previous one I detailed, with 20 lines being written to stderr.txt in that termination phase. So, those additional lines were likely the first ones written at that time, not earlier. There's no break in the PM log in the middle of the 4th line. The stderr.txt file is 6 bytes larger this time, since "750 Ti" is written twice instead of "660". That's it! Yet somehow the reported Stderr seems to have picked up an additional 111(?) characters. This "alternate" truncation is not common, but did happen on one of the tasks a couple days ago, I believe, so it does repeat occasionally. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Yep, looks like a fight between the app trying to kill it, and the client trying to read it too early. Well at least as you say the alternate truncation is not common. Let's hope their patch gets the majority one at least (whichever that be) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
I wonder would the commode builds in comparison show the stderr Writes as synchronous (as I'd expect) ? That would just give some illustrations of the user level, driver and kernel underlying sequences involved. Hmmm, I think I've still got the commode build on another one of my boxes that's been in mothballs, since November, host 6979886. I suppose if it would help, I could resurrect it to run a few tasks just for the routine PM log entries, not necessarily to try and catch any truncations. (When it suddenly became clear back in November that there was far more crunching capacity in the community than there was available S@h work to keep everybody busy, I idled 3 boxes and significantly reduced the hours of 2 of the others. I figured I'd let the other fanatics compete to suck up their electricity until the WU/crunch capacity shifted back the other way, perhaps once Green Bank starts producing. I already have a toaster, and a toaster oven, too!) |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Yep, looks like a fight between the app trying to kill it, and the client trying to read it too early. Well at least as you say the alternate truncation is not common. Let's hope their patch gets the majority one at least (whichever that be) So, you really feel like the BOINC client isn't reading the whole file, even though the PM log entries appear to show the file as being complete (based on the file length) when BOINC opens and reads it? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
lol, yeah, wound back quite a bit here too. Probably not a real urgency with comparing the commode build, because I pretty much know how that one works around the issues. I think a lot of the urgency has gone out now that boinc devs are finally looking at it. At least with next app release you'll be able to torture it until it breaks. It revives my urge to ruggedise some app and client things myself, but that'll have to wait until after the next Cuda app. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Yep, looks like a fight between the app trying to kill it, and the client trying to read it too early. Well at least as you say the alternate truncation is not common. Let's hope their patch gets the majority one at least (whichever that be) mmm, nearly, but not in all cases possible either. Being a memory buffer operation with multiple buffers involved, perhaps think of it like a mini flash drive, that it needs to first extend the file by some length (placing the marker further on, adding extra blocks if needed etc), then start filling in the space from the previous end to the new end. So at the user level (Writefile Call) it can return success before the write is complete, and the data sits in buffers until something extends the file, then copies data probably 32-128 bits at a time into staging buffers and so on. There is every possibility the Boinc client reads up to a truncation point, then 0's all the way up to the full length. It starts to get a bit too internal at that point more than to say it's just not a safe way of writing/synchronising files... which is pretty much what the MS documentation breaks down to. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
lol, yeah, wound back quite a bit here too. Probably not a real urgency with comparing the commode build, because I pretty much know how that one works around the issues. I think a lot of the urgency has gone out now that boinc devs are finally looking at it. At least with next app release you'll be able to torture it until it breaks. It revives my urge to ruggedise some app and client things myself, but that'll have to wait until after the next Cuda app. Okay, I think I'll put the T7400 back to sleep, for now. (It ran out of tasks about an hour before I got up this morning, anyway.) I may still keep trying to trap a truncation with a remove/unlink failure on the xw9400, just to see if I can succeed in capturing the orphaned stderr.txt. (I've got the timer running on a 1 second trigger now and can probably drop it to half a second without any performance hit.) |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Yep, looks like a fight between the app trying to kill it, and the client trying to read it too early. Well at least as you say the alternate truncation is not common. Let's hope their patch gets the majority one at least (whichever that be) It looks like the initial disk allocation for writing the stderr.txt (AllocationSize: 4,096) is sufficient for the 2,000+ bytes actually being written, so even though they're only going out 1 or 2 at a time, no extensions would be needed. Don't know about staging buffers, though. What strikes me, however, is that at the time the BOINC client opens the file to read, it appears to have the correct file length. So, why would that file length info be written to disk before the actual data? Puzzling to me, anyway. ;^) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874 |
Yep, looks like a fight between the app trying to kill it, and the client trying to read it too early. Well at least as you say the alternate truncation is not common. Let's hope their patch gets the majority one at least (whichever that be) I didn't go down to as low a level as you in byte counts etc., but I certainly saw that at Milkyway - the stderr file was written completely to local disk, but only part of it was reported to the project. It's a rare event - something like 2% of tasks during my test run. SETI was down at the moment I finally caught one, but you might like to compare your results with Milkyway message 63799. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
So, why would that file length info be written to disk before the actual data? Puzzling to me, anyway. ;^) That's probably an easier part to explain. You can't actually put data there until you write the new length (shift the EOF) making it longer. So the allocation was already there, then the EOF moved, then the data filled in. Not saying it 'the' or 'the only' possible point of failure that could result in a longer file with zeroed (NUL Character) ending, but is just one means Boinc might read a full file but get a trailing bunch of zeroes to happily ignore. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14680 Credit: 200,643,578 RAC: 874 |
Yep, looks like a fight between the app trying to kill it, and the client trying to read it too early. Well at least as you say the alternate truncation is not common. Let's hope their patch gets the majority one at least (whichever that be) Isn't 'AllocationSize: 4,096' simply the quantum size of the disk filing system - one cluster? The OS can't allocate less than that. Here's the file property screenshot from the post I've just linked: File size (i.e. contents) 3,337 bytes: Size on disk 4,096 bytes. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
It looks like the initial disk allocation for writing the stderr.txt (AllocationSize: 4,096) is sufficient for the 2,000+ bytes actually being written, so even though they're only going out 1 or 2 at a time, no extensions would be needed. Yeah 4kb allocation units are pretty standard, It's what's in those allocation units after the EOF is moved but before the data being written enters that interests me. I'm guessing zeroes. It'd be more obvious but probably just as invisible on the webpage if it were hexadecimal guard diigits DEADDEAD like you might use for RAM buffer guards. Easier to just make sure the thing's committed to disk before you read it, without killing things. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Yep, looks like a fight between the app trying to kill it, and the client trying to read it too early. Well at least as you say the alternate truncation is not common. Let's hope their patch gets the majority one at least (whichever that be) The thing there is that you're getting a remove/unlink failure (in this case with that "Error 32"), which none of my recent truncations have experienced. BOINC "appears" to not only successfully delete the file, but seems to read it okay first. The Process Monitor log sequence at least make it look like the app finishes writing the file before BOINC starts to read it, although the interval is only 0.0059528 sec: 11:26:10.3062540 PM Lunatics_x41zc_win32_cuda50.exe 2540 QueryStandardInformationFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\stderr.txt SUCCESS AllocationSize: 4,096, EndOfFile: 2,000, NumberOfLinks: 1, DeletePending: False, Directory: False 11:26:10.3064260 PM Lunatics_x41zc_win32_cuda50.exe 2540 WriteFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\stderr.txt SUCCESS Offset: 2,000, Length: 2 11:26:10.3108823 PM Lunatics_x41zc_win32_cuda50.exe 2540 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2 SUCCESS 11:26:10.3110268 PM Lunatics_x41zc_win32_cuda50.exe 2540 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\stderr.txt SUCCESS ... 11:26:10.3125120 PM boinc.exe 3612 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2 SUCCESS Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened 11:26:10.3125905 PM boinc.exe 3612 QueryDirectory C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\stderr.txt SUCCESS Filter: stderr.txt, 1: stderr.txt 11:26:10.3126958 PM boinc.exe 3612 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2 SUCCESS 11:26:10.3128127 PM boinc.exe 3612 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2 SUCCESS Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened 11:26:10.3128816 PM boinc.exe 3612 QueryDirectory C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\stderr.txt SUCCESS Filter: stderr.txt, 1: stderr.txt 11:26:10.3129925 PM boinc.exe 3612 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2 SUCCESS 11:26:10.3131086 PM boinc.exe 3612 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\stderr.txt SUCCESS Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened 11:26:10.3169796 PM boinc.exe 3612 ReadFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\stderr.txt SUCCESS Offset: 0, Length: 2,002 11:26:10.3171062 PM boinc.exe 3612 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\2\stderr.txt SUCCESS Unfortunately, I don't think there's anything there that would indicate the status of any Write Lock, although boinc.exe opens it with "ShareMode: Read, Write". |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
So, why would that file length info be written to disk before the actual data? Puzzling to me, anyway. ;^) My first reaction was that, in this case, the size would actually have been reduced, going from 4,096 down to 2,002, but then I realized that the 4,096 allocation may just represent the buffer allocation. Process Monitor says that it's tracking "File System" activity, but I don't think I'd have any way of knowing if that represents only physical activity or if it's tracking buffer stuff, too. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
It looks like the initial disk allocation for writing the stderr.txt (AllocationSize: 4,096) is sufficient for the 2,000+ bytes actually being written, so even though they're only going out 1 or 2 at a time, no extensions would be needed. Don't know about staging buffers, though. What strikes me, however, is that at the time the BOINC client opens the file to read, it appears to have the correct file length. So, why would that file length info be written to disk before the actual data? Puzzling to me, anyway. ;^) Yeah, I think that's the default cluster size on all the newer Win File Systems, certainly with NTFS. I remember (oh, my) back on my Win 3.1 system (with FAT), I used to have multiple logical drives, I used to set up different cluster sizes, 2K, 4K, and 8K, and then try to match file sizes as closely as possible to the cluster sizes, so as not to waste "valuable" disk space. I may have even done that for a while, too, with FAT32 on Win 95 and 98. Nowadays, disk drives are so cheap, nobody cares about wasted space! ;^) |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Isn't 'AllocationSize: 4,096' simply the quantum size of the disk filing system - one cluster? The OS can't allocate less than that. Here's the file property screenshot from the post I've just linked Well, now, I just went back and looked at the PM log for the early moments of that job and, in fact, the allocation doesn't start out at 4,096. It actually starts out at 8, then goes up by just 8 bytes at a time until it gets to 728, after which it just throws its hands up and says, gimme 4096! Doesn't seem particularly efficient to me. ;^) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.