Message boards :
Number crunching :
Stderr Truncations
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 11 · Next
Author | Message |
---|---|
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
Well, I was going to put the T7400 back into mothballs, but perhaps I'll go ahead and try that BOINC v7.6.6 "fix" tonight on both boxes. The T7400 is 64-bit Win 8.1 while the xw9400 is 32-bit Win XP. That'll test both versions. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14686 Credit: 200,643,578 RAC: 874 ![]() ![]() |
The thing there is that you're getting a remove/unlink failure (in this case with that "Error 32"), which none of my recent truncations have experienced. BOINC "appears" to not only successfully delete the file, but seems to read it okay first. The Process Monitor log sequence at least make it look like the app finishes writing the file before BOINC starts to read it, although the interval is only 0.0059528 sec: Well, you see the one I caught, but you didn't see the 50, 100, or however many it was that slipped away without anything odd happening. I saw maybe a 2% 'Validation error' overall during the test run: that's a big majority for doing things in the right order, as you logs have shown. It's only the (very) rare special case which gets noticed - but it's the special cases that started this thread (and its predecessors) in the first place. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14686 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Well, I was going to put the T7400 back into mothballs, but perhaps I'll go ahead and try that BOINC v7.6.6 "fix" tonight on both boxes. The T7400 is 64-bit Win 8.1 while the xw9400 is 32-bit Win XP. That'll test both versions. Have fun testing while I get some shut-eye. I'll join you in the morning with another MW run. |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
The thing there is that you're getting a remove/unlink failure (in this case with that "Error 32"), which none of my recent truncations have experienced. BOINC "appears" to not only successfully delete the file, but seems to read it okay first. The Process Monitor log sequence at least make it look like the app finishes writing the file before BOINC starts to read it, although the interval is only 0.0059528 sec: Yeah, it's the ones without any oddness showing in the logs that I've been primarily experiencing since turning on the debug flags. I'm still trying to trap an orphaned stderr.txt if I do happen to experience a remove/unlink failure, but I suspect it'll be just like you found, complete. Here on S@h, it's not the truncated Stderr by itself that really causes any problems, it's the combination with the unfortunate behavior of the validator on the occasions when the truncation happens on a -9 overflow that has less than 30 Spikes and no Autocorr count. That seemed to be an easy fix, based on the one-line code change that Joe Segur proposed a year and a half ago but, for reasons unknown, it never got implemented. If it had, I don't think we'd even be discussing it here, much less putting in all this additional time looking at the truncation side of the equation. EDIT: Nighty-night, Richard. It's about time for my dinner. |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
I've started running the new 7.6.6 client and am looking for invalids at MW. Nothing yet, just one valid and some inconclusives. I posted the tracefile for the valid task at MW. Task 1188200894. The post at MW is Message 63803. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
Sounds good! I just got v7.6.6 running on my xw9400 about a half hour ago, but was just about out of GPU tasks, so I kept an eye on it until about 30 or so finally downloaded. Probably in the next hour I'll start up my T7400 and switch it over to v7.6.6, too. It was completely out of tasks by about 5:30 this morning, so it'll be starting completely fresh. I'll run both boxes all night. I don't monitor for truncations as the tasks complete, since only a handful each month actually fail validation, but I do have a routine that runs twice a day to pick out the Stderr truncations and give me a list. This morning I had six. Who knows what tomorrow will bring. |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
Just a quick note (I think) on something that caught my eye in the Process Monitor log shortly after I installed v7.6.6 on the T7400. One of the first tasks to download for a GPU quickly finished with what turned out to be a -9 overflow with a Spike count of 30, Task 4265643291. Nothing unusual there, and nothing unusual in the BOINC event log, no remove/unlink failures. But just because it was the first task to finish, I decided to take a look at the PM log. What I found was a SHARING VIOLATION (at 8:01:04.0965121) on the stderr.txt file, where BOINC really did try to read it before the science app closed it. This was not a scenario that I had seen in the truncated Stderr tasks that I looked at earlier today, but I assume is a scenario that would have led to that Error 32 found in those MW tasks. In any event, BOINC seems to have gracefully backed off, the science app closed the file (at 8:01:04.1352078), and BOINC tried again (at 8:01:05.1051927), successfully this time. I'm not sure that this situation actually applies to the truncated Stderr issue that I've been experiencing, but I thought it looked like it might demonstrate that the Sharing Violation backoff and retry worked. Here's the whole PM log for the relevant period (including a whole bunch of Windows Search indexing events recorded at the time): 8:01:04.0849225 PM setiathome_7.00_windows_intelx86__cuda42.exe 1324 QueryStandardInformationFile C:\ProgramData\BOINC\slots\10\stderr.txt SUCCESS AllocationSize: 4,096, EndOfFile: 2,158, NumberOfLinks: 1, DeletePending: False, Directory: False 8:01:04.0849355 PM setiathome_7.00_windows_intelx86__cuda42.exe 1324 WriteFile C:\ProgramData\BOINC\slots\10\stderr.txt SUCCESS Offset: 2,158, Length: 2 8:01:04.0856647 PM SearchProtocolHost.exe 2684 CreateFile C:\ProgramData\BOINC\slots\10\boinc_finish_called:ms-properties NAME NOT FOUND Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Open Reparse Point, Disallow Exclusive, Attributes: n/a, ShareMode: Read, Delete, AllocationSize: n/a 8:01:04.0858163 PM SearchProtocolHost.exe 2684 CreateFile C:\ProgramData\BOINC\slots\10\boinc_finish_called:ms-properties NAME NOT FOUND Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Open Reparse Point, Disallow Exclusive, Attributes: n/a, ShareMode: Read, Delete, AllocationSize: n/a 8:01:04.0876634 PM SearchProtocolHost.exe 2684 CloseFile C:\ProgramData\BOINC\slots\10\boinc_finish_called SUCCESS 8:01:04.0915538 PM SearchProtocolHost.exe 2684 CreateFile C:\ProgramData\BOINC\slots\10\init_data.xml SUCCESS Desired Access: Generic Read, Disposition: Open, Options: Open Reparse Point, Open Requiring Oplock, Attributes: N, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened 8:01:04.0917386 PM SearchProtocolHost.exe 2684 FileSystemControl C:\ProgramData\BOINC\slots\10\init_data.xml OPLOCK HANDLE CLOSED Control: FSCTL_REQUEST_OPLOCK 8:01:04.0933485 PM SearchProtocolHost.exe 2684 QueryAttributeTagFile C:\ProgramData\BOINC\slots\10\init_data.xml SUCCESS Attributes: ANCI, ReparseTag: 0x0 8:01:04.0933774 PM SearchProtocolHost.exe 2684 QueryBasicInformationFile C:\ProgramData\BOINC\slots\10\init_data.xml SUCCESS CreationTime: 7/14/2015 8:00:53 PM, LastAccessTime: 7/14/2015 8:00:53 PM, LastWriteTime: 7/14/2015 8:01:04 PM, ChangeTime: 7/14/2015 8:01:04 PM, FileAttributes: ANCI 8:01:04.0934001 PM SearchProtocolHost.exe 2684 QueryStandardInformationFile C:\ProgramData\BOINC\slots\10\init_data.xml SUCCESS AllocationSize: 8,192, EndOfFile: 4,739, NumberOfLinks: 1, DeletePending: False, Directory: False 8:01:04.0934671 PM SearchProtocolHost.exe 2684 QueryFileInternalInformationFile C:\ProgramData\BOINC\slots\10\init_data.xml SUCCESS IndexNumber: 0x1a000000009357 8:01:04.0935064 PM SearchProtocolHost.exe 2684 QueryRemoteProtocolInformation C:\ProgramData\BOINC\slots\10\init_data.xml INVALID PARAMETER 8:01:04.0935345 PM SearchProtocolHost.exe 2684 QuerySecurityFile C:\ProgramData\BOINC\slots\10\init_data.xml SUCCESS Information: Owner, Group, DACL 8:01:04.0955193 PM boinc.exe 3932 CreateFile C:\ProgramData\BOINC\slots\10\boinc_finish_called SUCCESS Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened 8:01:04.0956564 PM boinc.exe 3932 ReadFile C:\ProgramData\BOINC\slots\10\boinc_finish_called END OF FILE Offset: 0, Length: 4,096, Priority: Normal 8:01:04.0956872 PM boinc.exe 3932 ReadFile C:\ProgramData\BOINC\slots\10\boinc_finish_called END OF FILE Offset: 0, Length: 4,096, Priority: Normal 8:01:04.0957060 PM boinc.exe 3932 ReadFile C:\ProgramData\BOINC\slots\10\boinc_finish_called END OF FILE Offset: 0, Length: 4,096, Priority: Normal 8:01:04.0957368 PM boinc.exe 3932 CloseFile C:\ProgramData\BOINC\slots\10\boinc_finish_called SUCCESS 8:01:04.0962169 PM SearchProtocolHost.exe 2684 CreateFile C:\ProgramData\BOINC\slots\10\init_data.xml:ms-properties NAME NOT FOUND Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Open Reparse Point, Disallow Exclusive, Attributes: n/a, ShareMode: Read, Delete, AllocationSize: n/a 8:01:04.0962346 PM boinc.exe 3932 CreateFile C:\ProgramData\BOINC\slots\10 SUCCESS Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened 8:01:04.0962773 PM boinc.exe 3932 QueryDirectory C:\ProgramData\BOINC\slots\10\stderr.txt SUCCESS Filter: stderr.txt, 1: stderr.txt 8:01:04.0963362 PM boinc.exe 3932 CloseFile C:\ProgramData\BOINC\slots\10 SUCCESS 8:01:04.0964910 PM SearchProtocolHost.exe 2684 CreateFile C:\ProgramData\BOINC\slots\10\init_data.xml:ms-properties NAME NOT FOUND Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Open Reparse Point, Disallow Exclusive, Attributes: n/a, ShareMode: Read, Delete, AllocationSize: n/a 8:01:04.0965121 PM boinc.exe 3932 CreateFile C:\ProgramData\BOINC\slots\10\stderr.txt SHARING VIOLATION Desired Access: Generic Write, Read Attributes, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: n/a 8:01:04.0980812 PM SearchProtocolHost.exe 2684 CloseFile C:\ProgramData\BOINC\slots\10\init_data.xml SUCCESS 8:01:04.1349295 PM setiathome_7.00_windows_intelx86__cuda42.exe 1324 CloseFile C:\ProgramData\BOINC\slots\10 SUCCESS 8:01:04.1352078 PM setiathome_7.00_windows_intelx86__cuda42.exe 1324 CloseFile C:\ProgramData\BOINC\slots\10\stderr.txt SUCCESS 8:01:04.1361768 PM MsMpEng.exe 1844 CreateFileMapping C:\ProgramData\BOINC\slots\10\stderr.txt FILE LOCKED WITH WRITERS SyncType: SyncTypeCreateSection, PageProtection: 8:01:04.1362203 PM MsMpEng.exe 1844 QueryStandardInformationFile C:\ProgramData\BOINC\slots\10\stderr.txt SUCCESS AllocationSize: 4,096, EndOfFile: 2,160, NumberOfLinks: 1, DeletePending: False, Directory: False 8:01:04.1419897 PM SearchProtocolHost.exe 2684 CreateFile C:\ProgramData\BOINC\slots\10\stderr.txt SUCCESS Desired Access: Generic Read, Disposition: Open, Options: Open Reparse Point, Open Requiring Oplock, Attributes: N, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened 8:01:04.1420909 PM SearchProtocolHost.exe 2684 FileSystemControl C:\ProgramData\BOINC\slots\10\stderr.txt OPLOCK HANDLE CLOSED Control: FSCTL_REQUEST_OPLOCK 8:01:04.1421452 PM SearchProtocolHost.exe 2684 QueryAttributeTagFile C:\ProgramData\BOINC\slots\10\stderr.txt SUCCESS Attributes: ANCI, ReparseTag: 0x0 8:01:04.1421672 PM SearchProtocolHost.exe 2684 QueryBasicInformationFile C:\ProgramData\BOINC\slots\10\stderr.txt SUCCESS CreationTime: 7/14/2015 8:00:53 PM, LastAccessTime: 7/14/2015 8:00:53 PM, LastWriteTime: 7/14/2015 8:01:04 PM, ChangeTime: 7/14/2015 8:01:04 PM, FileAttributes: ANCI 8:01:04.1421872 PM SearchProtocolHost.exe 2684 QueryStandardInformationFile C:\ProgramData\BOINC\slots\10\stderr.txt SUCCESS AllocationSize: 4,096, EndOfFile: 2,160, NumberOfLinks: 1, DeletePending: False, Directory: False 8:01:04.1422060 PM SearchProtocolHost.exe 2684 QueryFileInternalInformationFile C:\ProgramData\BOINC\slots\10\stderr.txt SUCCESS IndexNumber: 0x1bd0000000001c5 8:01:04.1422480 PM SearchProtocolHost.exe 2684 QueryRemoteProtocolInformation C:\ProgramData\BOINC\slots\10\stderr.txt INVALID PARAMETER 8:01:04.1422830 PM SearchProtocolHost.exe 2684 QuerySecurityFile C:\ProgramData\BOINC\slots\10\stderr.txt SUCCESS Information: Owner, Group, DACL 8:01:04.1431311 PM SearchProtocolHost.exe 2684 CreateFile C:\ProgramData\BOINC\slots\10\stderr.txt:ms-properties NAME NOT FOUND Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Open Reparse Point, Disallow Exclusive, Attributes: n/a, ShareMode: Read, Delete, AllocationSize: n/a 8:01:04.1432940 PM SearchProtocolHost.exe 2684 CreateFile C:\ProgramData\BOINC\slots\10\stderr.txt:ms-properties NAME NOT FOUND Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Open Reparse Point, Disallow Exclusive, Attributes: n/a, ShareMode: Read, Delete, AllocationSize: n/a 8:01:04.1445293 PM SearchProtocolHost.exe 2684 CloseFile C:\ProgramData\BOINC\slots\10\stderr.txt SUCCESS 8:01:05.0552838 PM MsMpEng.exe 1844 CreateFile C:\ProgramData\BOINC\slots\10\state.sah SUCCESS Desired Access: Read Attributes, Disposition: Open, Options: Non-Directory File, Open For Backup, Open Reparse Point, Open Requiring Oplock, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened 8:01:05.0553781 PM MsMpEng.exe 1844 FileSystemControl C:\ProgramData\BOINC\slots\10\state.sah OPLOCK HANDLE CLOSED Control: FSCTL_REQUEST_OPLOCK 8:01:05.0554224 PM MsMpEng.exe 1844 FileSystemControl C:\ProgramData\BOINC\slots\10\state.sah SUCCESS Control: 0x902eb (Device:0x9 Function:186 Method: 3) 8:01:05.0554520 PM MsMpEng.exe 1844 CloseFile C:\ProgramData\BOINC\slots\10\state.sah SUCCESS 8:01:05.1051927 PM boinc.exe 3932 CreateFile C:\ProgramData\BOINC\slots\10\stderr.txt SUCCESS Desired Access: Generic Write, Read Attributes, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: n/a, OpenResult: Opened 8:01:05.1053132 PM boinc.exe 3932 CloseFile C:\ProgramData\BOINC\slots\10\stderr.txt SUCCESS 8:01:05.1054802 PM boinc.exe 3932 CreateFile C:\ProgramData\BOINC\slots\10 SUCCESS Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened 8:01:05.1055156 PM boinc.exe 3932 QueryDirectory C:\ProgramData\BOINC\slots\10\stderr.txt SUCCESS Filter: stderr.txt, 1: stderr.txt 8:01:05.1055753 PM boinc.exe 3932 CloseFile C:\ProgramData\BOINC\slots\10 SUCCESS 8:01:05.1057158 PM boinc.exe 3932 CreateFile C:\ProgramData\BOINC\slots\10\stderr.txt SUCCESS Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened 8:01:05.1057890 PM boinc.exe 3932 ReadFile C:\ProgramData\BOINC\slots\10\stderr.txt SUCCESS Offset: 0, Length: 2,160, Priority: Normal 8:01:05.1058918 PM boinc.exe 3932 CloseFile C:\ProgramData\BOINC\slots\10\stderr.txt SUCCESS |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
Jeff, that sure looks like the smoking gun of the proposed failure mechanism. And it looks like the new 7.6.6 client gracefully handled the faux pax. My systems are down for the night waiting for the new solar day to start up again. Will give MW a good workout tomorrow looking for invalids. Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
Yep, that'd imply the Boinc client side of the interface is being a little more patient. I wouldn't call it super rugged, or fantastic design, but is definitely better than not waiting at all. Whether or not any portion of your truncations came from that side is an open question. Presumably if the occurances reduce or disappear then some or all were. If you see no change (except with a commit mode build) then they'd better duct tape the API side eventually as well, which is currently just as ungraceful. Anyway, the problem with duct tape and similar bandaid fixes, is they tend to hurt when they eventually need to be ripped off. Quick with a distraction is my recommendation. You mentioned the 4k allocation after allocating small buffers etc, as seeming inefficient. Well true, but a small text file typically doesn't warrant using the more efficient mechanisms, since the latency of the IO is about 1000+ times slower than those memory buffers passing from user space, through the user mode driver helper, to the kernel, kernel mode driver, then across several pieces of hardware, and finally onto disk. What makes it pretty slick since most drives & later OSes went to 4kb allocation units, is that size matches the RAM memory page size, so the granularity for large transactions (or lots of small ones write combined) requires less translation than in the past, and there are hardware elemts in the CPU, motherboard and other hardware capable of working with those sizes efficiently. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14686 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Just started running v7.6.6 - seems clean so far, no showstoppers. The good thing is, for once we got the timing right. BOINC devs have been wanting to release the v7.6 branch as 'recommended' for a while now, and with the governance changes I doubt there'll be another major release for a while. We just scraped it in under the wire (and it was a convenient excuse to apply this month's bumper crop of 'Patch Tuesday' updates from M$). |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
First stage check this morning shows my truncated Stderr list is empty, so no new ones on any machines overnight. That doesn't necessarily prove anything, since that's always happened occasionally, anyway, but it might be encouraging. Breakfast is next on the agenda, and then I'll start going through the BOINC and PM logs to see if anything looks interesting. |
Tom* Send message Joined: 12 Aug 11 Posts: 127 Credit: 20,769,223 RAC: 9 ![]() |
Ran MW with 7.6.6 for the last 18 hours, no invalids, so far so good. Thanks everyone for pushing this issue. Is the workaround only relevant to stderr or other files returned to the server? |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
Is the workaround only relevant to stderr or other files returned to the server?In theory it's relevant to any files, though as we know in Boinc practice the best of theory can go to waste amongst the spaghetti code. I'll remain in disgusted silence at the obscene use of duct tape to mend the hull breaches. Fingers crossed no-one drowns. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
![]() ![]() ![]() ![]() Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 ![]() |
I've started going through the PM logs for the xw9400 and already have a couple examples of slightly different situations where Sharing Violations occurred and the backoff/retry seems to have worked fine. The first is a situation that appears to be comparable to the S@h truncations I experience, where the science app actually does successfully close the stderr.txt file (at 1:12:15.7354433 AM, in the log below). However, BOINC then gets a SHARING VIOLATION when it tries to open it (at 1:12:15.7411881 AM). Just over a second later, the BOINC retry succeeds, with the file Read (at 1:12:16.7409243 AM) apparently going off without a hitch. As part of the slot cleanup, the stderr.exe gets successfully deleted (at 1:12:16.7489738 AM), as well. 1:12:15.7309708 AM Lunatics_x41zc_win32_cuda50.exe 508 QueryStandardInformationFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt SUCCESS AllocationSize: 4,096, EndOfFile: 2,006, NumberOfLinks: 1, DeletePending: False, Directory: False 1:12:15.7309923 AM Lunatics_x41zc_win32_cuda50.exe 508 WriteFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt SUCCESS Offset: 2,006, Length: 2 1:12:15.7351981 AM Lunatics_x41zc_win32_cuda50.exe 508 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4 SUCCESS 1:12:15.7354433 AM Lunatics_x41zc_win32_cuda50.exe 508 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt SUCCESS 1:12:15.7364817 AM MsMpEng.exe 1260 CreateFileMapping C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt SUCCESS SyncType: SyncTypeCreateSection, PageProtection: PAGE_READONLY 1:12:15.7364928 AM MsMpEng.exe 1260 QueryStandardInformationFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt SUCCESS AllocationSize: 4,096, EndOfFile: 2,008, NumberOfLinks: 1, DeletePending: False, Directory: False 1:12:15.7365094 AM MsMpEng.exe 1260 CreateFileMapping C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt SUCCESS SyncType: SyncTypeOther 1:12:15.7404256 AM boinc.exe 1456 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\boinc_finish_called SUCCESS Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened 1:12:15.7406701 AM boinc.exe 1456 ReadFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\boinc_finish_called END OF FILE Offset: 0, Length: 4,096 1:12:15.7407123 AM boinc.exe 1456 ReadFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\boinc_finish_called END OF FILE Offset: 0, Length: 4,096 1:12:15.7407220 AM boinc.exe 1456 ReadFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\boinc_finish_called END OF FILE Offset: 0, Length: 4,096 1:12:15.7407386 AM boinc.exe 1456 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\boinc_finish_called SUCCESS 1:12:15.7409397 AM boinc.exe 1456 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4 SUCCESS Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened 1:12:15.7410155 AM boinc.exe 1456 QueryDirectory C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt SUCCESS Filter: stderr.txt, 1: stderr.txt 1:12:15.7410549 AM boinc.exe 1456 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4 SUCCESS 1:12:15.7411881 AM boinc.exe 1456 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt SHARING VIOLATION Desired Access: Generic Write, Read Attributes, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: n/a 1:12:16.7403601 AM boinc.exe 1456 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt SUCCESS Desired Access: Generic Write, Read Attributes, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: n/a, OpenResult: Opened 1:12:16.7404283 AM boinc.exe 1456 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt SUCCESS 1:12:16.7405817 AM boinc.exe 1456 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4 SUCCESS Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened 1:12:16.7406554 AM boinc.exe 1456 QueryDirectory C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt SUCCESS Filter: stderr.txt, 1: stderr.txt 1:12:16.7407030 AM boinc.exe 1456 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4 SUCCESS 1:12:16.7408535 AM boinc.exe 1456 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt SUCCESS Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened 1:12:16.7409243 AM boinc.exe 1456 ReadFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt SUCCESS Offset: 0, Length: 2,008 1:12:16.7409517 AM boinc.exe 1456 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt SUCCESS ... 1:12:16.7489360 AM boinc.exe 1456 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt SUCCESS Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened 1:12:16.7489630 AM boinc.exe 1456 QueryAttributeTagFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt SUCCESS Attributes: A, ReparseTag: 0x0 1:12:16.7489738 AM boinc.exe 1456 SetDispositionInformationFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt SUCCESS Delete: True 1:12:16.7489881 AM boinc.exe 1456 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt SUCCESS Checking the BOINC log for slot 4 at that time, I see no anomalies. The task appears to be 4265616744. 15-Jul-2015 01:12:15 [SETI@home] [task] Process for 03jn15aa.22800.1058.438086664197.12.86_0 exited, exit code 0, task state 1 15-Jul-2015 01:12:15 [SETI@home] [task] task_state=EXITED for 03jn15aa.22800.1058.438086664197.12.86_0 from handle_exited_app 15-Jul-2015 01:12:16 [---] [slot] cleaning out slots/4: handle_exited_app() 15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/boinc_finish_called 15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/boinc_task_state.xml 15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/cudart32_50_35.dll 15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/cufft32_50_35.dll 15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/init_data.xml 15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/Lunatics_x41zc_win32_cuda50.exe 15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/mbcuda.cfg 15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/result.sah 15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/state.sah 15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/stderr.txt 15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/work_unit.sah 15-Jul-2015 01:12:16 [SETI@home] Computation for task 03jn15aa.22800.1058.438086664197.12.86_0 finished 15-Jul-2015 01:12:16 [SETI@home] [task] result state=FILES_UPLOADING for 03jn15aa.22800.1058.438086664197.12.86_0 from CS::app_finished 15-Jul-2015 01:12:16 [---] [slot] cleaning out slots/4: get_free_slot() 15-Jul-2015 01:12:16 [SETI@home] [slot] assigning slot 4 to 03jn15aa.22800.1058.438086664197.12.142_1 The second Sharing Violation appears to be similar to the one I posted from the T7400 last night. In this instance, BOINC tries to open the stderr.txt (at 1:38:19.7588006 AM) before the science app has actually closed it, which doesn't happen until 1:38:19.8029969 AM. The one second backoff by BOINC results in a successful read at 1:38:20.7585977 AM, followed by successful deletion of the stderr.txt. 1:38:18.0578099 AM Lunatics_x41zc_win32_cuda50.exe 3012 QueryStandardInformationFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt SUCCESS AllocationSize: 4,096, EndOfFile: 1,999, NumberOfLinks: 1, DeletePending: False, Directory: False 1:38:18.0578247 AM Lunatics_x41zc_win32_cuda50.exe 3012 WriteFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt SUCCESS Offset: 1,999, Length: 2 1:38:19.7418218 AM boinc.exe 1456 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\boinc_finish_called SUCCESS Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened 1:38:19.7418783 AM boinc.exe 1456 ReadFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\boinc_finish_called END OF FILE Offset: 0, Length: 4,096 1:38:19.7418901 AM boinc.exe 1456 ReadFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\boinc_finish_called END OF FILE Offset: 0, Length: 4,096 1:38:19.7418987 AM boinc.exe 1456 ReadFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\boinc_finish_called END OF FILE Offset: 0, Length: 4,096 1:38:19.7419096 AM boinc.exe 1456 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\boinc_finish_called SUCCESS 1:38:19.7585712 AM boinc.exe 1456 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8 SUCCESS Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened 1:38:19.7585970 AM boinc.exe 1456 QueryDirectory C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt SUCCESS Filter: stderr.txt, 1: stderr.txt 1:38:19.7586654 AM boinc.exe 1456 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8 SUCCESS 1:38:19.7588006 AM boinc.exe 1456 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt SHARING VIOLATION Desired Access: Generic Write, Read Attributes, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: n/a 1:38:19.8029031 AM Lunatics_x41zc_win32_cuda50.exe 3012 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8 SUCCESS 1:38:19.8029969 AM Lunatics_x41zc_win32_cuda50.exe 3012 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt SUCCESS 1:38:19.8037380 AM MsMpEng.exe 1260 CreateFileMapping C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt SUCCESS SyncType: SyncTypeCreateSection, PageProtection: PAGE_READONLY 1:38:19.8037520 AM MsMpEng.exe 1260 QueryStandardInformationFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt SUCCESS AllocationSize: 4,096, EndOfFile: 2,001, NumberOfLinks: 1, DeletePending: False, Directory: False 1:38:19.8037685 AM MsMpEng.exe 1260 CreateFileMapping C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt SUCCESS SyncType: SyncTypeOther 1:38:20.7573285 AM boinc.exe 1456 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt SUCCESS Desired Access: Generic Write, Read Attributes, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: n/a, OpenResult: Opened 1:38:20.7573974 AM boinc.exe 1456 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt SUCCESS 1:38:20.7577468 AM boinc.exe 1456 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8 SUCCESS Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened 1:38:20.7577837 AM boinc.exe 1456 QueryDirectory C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt SUCCESS Filter: stderr.txt, 1: stderr.txt 1:38:20.7578234 AM boinc.exe 1456 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8 SUCCESS 1:38:20.7583389 AM boinc.exe 1456 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt SUCCESS Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened 1:38:20.7585977 AM boinc.exe 1456 ReadFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt SUCCESS Offset: 0, Length: 2,001 1:38:20.7587059 AM boinc.exe 1456 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt SUCCESS ... 1:38:20.7910412 AM boinc.exe 1456 CreateFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt SUCCESS Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened 1:38:20.7910685 AM boinc.exe 1456 QueryAttributeTagFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt SUCCESS Attributes: A, ReparseTag: 0x0 1:38:20.7910847 AM boinc.exe 1456 SetDispositionInformationFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt SUCCESS Delete: True 1:38:20.7911022 AM boinc.exe 1456 CloseFile C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt SUCCESS Again, checking the BOINC log for Slot 8 at that time shows no anomalies. The task appears to be 4265625793. 15-Jul-2015 01:38:19 [SETI@home] [task] task_state=EXITED for 03jn15aa.22800.1876.438086664197.12.212_1 from handle_exited_app 15-Jul-2015 01:38:21 [---] [slot] cleaning out slots/8: handle_exited_app() 15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/boinc_finish_called 15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/boinc_task_state.xml 15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/cudart32_50_35.dll 15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/cufft32_50_35.dll 15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/init_data.xml 15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/Lunatics_x41zc_win32_cuda50.exe 15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/mbcuda.cfg 15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/result.sah 15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/state.sah 15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/stderr.txt 15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/work_unit.sah 15-Jul-2015 01:38:21 [SETI@home] Computation for task 03jn15aa.22800.1876.438086664197.12.212_1 finished 15-Jul-2015 01:38:21 [SETI@home] [task] result state=FILES_UPLOADING for 03jn15aa.22800.1876.438086664197.12.212_1 from CS::app_finished 15-Jul-2015 01:38:21 [---] [slot] cleaning out slots/8: get_free_slot() 15-Jul-2015 01:38:21 [SETI@home] [slot] assigning slot 8 to 03jn15aa.22800.1876.438086664197.12.210_1 In short, the fix seems like it might be working, even in the S@h Stderr truncation situation where a Sharing Violation hadn't seemed to pop up before. I'll try to get back to looking further through these logs a bit later on. EDIT: Heh, it looks like those were the only two SHARING VIOLATIONS on the xw9400 all night. The T7400 had 3, the one I posted earlier, along with one additional "read before close" and one "read after close". I won't try to parse out those details unless someone asks. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14686 Credit: 200,643,578 RAC: 874 ![]() ![]() |
The same here - no new validate errors at Milkyway since I restarted this morning, with over 500 tasks completed so far. |
![]() ![]() ![]() Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 ![]() ![]() |
I'll chime in with over 200 valid tasks at MW since the changeover to 7.6.6. Last invalid was on 7.6.2 right before the changeover. Don't want to jinx it.... but maybe the bug is finally squashed. !YAY! Seti@Home classic workunits:20,676 CPU time:74,226 hours ![]() ![]() A proud member of the OFA (Old Farts Association) |
Juha Send message Joined: 7 Mar 04 Posts: 388 Credit: 1,857,738 RAC: 0 ![]() |
I wonder would the commode builds in comparison show the stderr Writes as synchronous (as I'd expect) ? That would just give some illustrations of the user level, driver and kernel underlying sequences involved. Are you sure it isn't synchronous already? I mean, it does say 'Synchronous IO Non-Alert'. 22:13:08,4138160 Lunatics_x41zc_win32_cuda50.exe 2496 CreateFile C:\b\stderr.txt SUCCESS Desired Access: Generic Write, Read Attributes, Disposition: OpenIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Created Oh and if you compare the one byte writes to stderr.txt with writes to result.sah it looks like stderr is unbuffered (in C stdio sense). 22:13:15,4669828 Lunatics_x41zc_win32_cuda50.exe 2496 CreateFile C:\b\result.sah SUCCESS Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Overwritten 22:13:15,4694588 Lunatics_x41zc_win32_cuda50.exe 2496 WriteFile C:\b\result.sah SUCCESS Offset: 0, Length: 4Â 096 22:13:15,4698946 Lunatics_x41zc_win32_cuda50.exe 2496 WriteFile C:\b\result.sah SUCCESS Offset: 4Â 096, Length: 3Â 592 22:13:15,4699876 Lunatics_x41zc_win32_cuda50.exe 2496 CloseFile C:\b\result.sah SUCCESS As it should be. From lib/diagnostics.cpp:diagnostics_init(): if (flags & BOINC_DIAG_REDIRECTSTDERR) { file_size(stderr_log, stderr_file_size); stderr_file = freopen(stderr_log, "a", stderr); if (!stderr_file) { return ERR_FOPEN; } setbuf(stderr_file, 0); } |
Juha Send message Joined: 7 Mar 04 Posts: 388 Credit: 1,857,738 RAC: 0 ![]() |
As for synchronizing writes by app and reads by client. How about the client waits until the app has exited before it goes reading files? And the client could tell this by asking kernel. I mean, the kernel ought to know when it has closed all files and flushed buffers, right? And the code could look something like this (from app_control.cpp): // See if any processes have exited // bool ACTIVE_TASK_SET::check_app_exited() { ACTIVE_TASK* atp; bool found = false; #ifdef _WIN32 unsigned long exit_code; unsigned int i; for (i=0; i<active_tasks.size(); i++) { atp = active_tasks[i]; if (!atp->process_exists()) continue; if (GetExitCodeProcess(atp->process_handle, &exit_code)) { if (exit_code != STILL_ACTIVE) { found = true; atp->handle_exited_app(exit_code); } ... handle_exited_app() does the reading and stuff. So what do you do when you can't trust the kernel? (The code has minor bug, it can't tell the difference between exit_code=259 and STILL_ACTIVE=259) |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
Yes, Importantly those open/create and WriteFile calls are indeed synchronous, HOWEVER, the application (boinc api part) never calls those Windows Api functions at all. Those are being invoked from inside microsoft C-Runtime, which by defaults has yet another layer of buffers behind the multithreaded library. Linking to commode.obj disables that extra layer. Threadsafe practices are threadsafe practices, the exisitng implementation wasn't/isn't threadsafe, and the added client wait is marginally better (but not awesome :) ) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
I mean, the kernel ought to know when it has closed all files and flushed buffers, right? Only problem there, is they decided (for unknown reasons) to use an asynchronous TerminateProcess() call and put a 1 second sleep and hard crash, instead of waiting on a synchronisation primitive. No idea why the entire codebase seems to be allergic to synchronisation, and likes to use magic numbers (fixed time intervals) on a non-realtime OS. So the exit code can indeed appear before the app [especially at low or idle priority] has really finished exiting. No idea why they chose the ugliest possible implementation. Probably one day I'll come to admire it as I might a really horrible (unsettling) piece of artwork, but I suspect it'll take me quite a while to be that OK with it. Perhaps that's what they're going for. Some kindof 'organic' approach, lol. Break all the rules, and yet somehow make it almost work. Kindof like a wonky ashtray made by a 5 year-old can be 'special' too. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.