Stderr Truncations

Message boards : Number crunching : Stderr Truncations
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 11 · Next

AuthorMessage
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1701533 - Posted: 14 Jul 2015, 23:26:54 UTC

Well, I was going to put the T7400 back into mothballs, but perhaps I'll go ahead and try that BOINC v7.6.6 "fix" tonight on both boxes. The T7400 is 64-bit Win 8.1 while the xw9400 is 32-bit Win XP. That'll test both versions.
ID: 1701533 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1701534 - Posted: 14 Jul 2015, 23:31:32 UTC - in response to Message 1701516.  

The thing there is that you're getting a remove/unlink failure (in this case with that "Error 32"), which none of my recent truncations have experienced. BOINC "appears" to not only successfully delete the file, but seems to read it okay first. The Process Monitor log sequence at least make it look like the app finishes writing the file before BOINC starts to read it, although the interval is only 0.0059528 sec:

Well, you see the one I caught, but you didn't see the 50, 100, or however many it was that slipped away without anything odd happening. I saw maybe a 2% 'Validation error' overall during the test run: that's a big majority for doing things in the right order, as you logs have shown. It's only the (very) rare special case which gets noticed - but it's the special cases that started this thread (and its predecessors) in the first place.
ID: 1701534 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1701536 - Posted: 14 Jul 2015, 23:33:40 UTC - in response to Message 1701533.  

Well, I was going to put the T7400 back into mothballs, but perhaps I'll go ahead and try that BOINC v7.6.6 "fix" tonight on both boxes. The T7400 is 64-bit Win 8.1 while the xw9400 is 32-bit Win XP. That'll test both versions.

Have fun testing while I get some shut-eye. I'll join you in the morning with another MW run.
ID: 1701536 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1701542 - Posted: 14 Jul 2015, 23:49:31 UTC - in response to Message 1701534.  
Last modified: 14 Jul 2015, 23:50:47 UTC

The thing there is that you're getting a remove/unlink failure (in this case with that "Error 32"), which none of my recent truncations have experienced. BOINC "appears" to not only successfully delete the file, but seems to read it okay first. The Process Monitor log sequence at least make it look like the app finishes writing the file before BOINC starts to read it, although the interval is only 0.0059528 sec:

Well, you see the one I caught, but you didn't see the 50, 100, or however many it was that slipped away without anything odd happening. I saw maybe a 2% 'Validation error' overall during the test run: that's a big majority for doing things in the right order, as you logs have shown. It's only the (very) rare special case which gets noticed - but it's the special cases that started this thread (and its predecessors) in the first place.

Yeah, it's the ones without any oddness showing in the logs that I've been primarily experiencing since turning on the debug flags. I'm still trying to trap an orphaned stderr.txt if I do happen to experience a remove/unlink failure, but I suspect it'll be just like you found, complete.

Here on S@h, it's not the truncated Stderr by itself that really causes any problems, it's the combination with the unfortunate behavior of the validator on the occasions when the truncation happens on a -9 overflow that has less than 30 Spikes and no Autocorr count. That seemed to be an easy fix, based on the one-line code change that Joe Segur proposed a year and a half ago but, for reasons unknown, it never got implemented. If it had, I don't think we'd even be discussing it here, much less putting in all this additional time looking at the truncation side of the equation.

EDIT: Nighty-night, Richard. It's about time for my dinner.
ID: 1701542 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1701554 - Posted: 15 Jul 2015, 1:48:42 UTC

I've started running the new 7.6.6 client and am looking for invalids at MW. Nothing yet, just one valid and some inconclusives. I posted the tracefile for the valid task at MW. Task 1188200894.
The post at MW is Message 63803.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1701554 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1701559 - Posted: 15 Jul 2015, 2:09:55 UTC - in response to Message 1701554.  

Sounds good! I just got v7.6.6 running on my xw9400 about a half hour ago, but was just about out of GPU tasks, so I kept an eye on it until about 30 or so finally downloaded. Probably in the next hour I'll start up my T7400 and switch it over to v7.6.6, too. It was completely out of tasks by about 5:30 this morning, so it'll be starting completely fresh. I'll run both boxes all night.

I don't monitor for truncations as the tasks complete, since only a handful each month actually fail validation, but I do have a routine that runs twice a day to pick out the Stderr truncations and give me a list. This morning I had six. Who knows what tomorrow will bring.
ID: 1701559 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1701586 - Posted: 15 Jul 2015, 4:20:44 UTC

Just a quick note (I think) on something that caught my eye in the Process Monitor log shortly after I installed v7.6.6 on the T7400. One of the first tasks to download for a GPU quickly finished with what turned out to be a -9 overflow with a Spike count of 30, Task 4265643291. Nothing unusual there, and nothing unusual in the BOINC event log, no remove/unlink failures. But just because it was the first task to finish, I decided to take a look at the PM log. What I found was a SHARING VIOLATION (at 8:01:04.0965121) on the stderr.txt file, where BOINC really did try to read it before the science app closed it. This was not a scenario that I had seen in the truncated Stderr tasks that I looked at earlier today, but I assume is a scenario that would have led to that Error 32 found in those MW tasks. In any event, BOINC seems to have gracefully backed off, the science app closed the file (at 8:01:04.1352078), and BOINC tried again (at 8:01:05.1051927), successfully this time.

I'm not sure that this situation actually applies to the truncated Stderr issue that I've been experiencing, but I thought it looked like it might demonstrate that the Sharing Violation backoff and retry worked. Here's the whole PM log for the relevant period (including a whole bunch of Windows Search indexing events recorded at the time):

8:01:04.0849225 PM	setiathome_7.00_windows_intelx86__cuda42.exe	1324	QueryStandardInformationFile	C:\ProgramData\BOINC\slots\10\stderr.txt	SUCCESS	AllocationSize: 4,096, EndOfFile: 2,158, NumberOfLinks: 1, DeletePending: False, Directory: False
8:01:04.0849355 PM	setiathome_7.00_windows_intelx86__cuda42.exe	1324	WriteFile	C:\ProgramData\BOINC\slots\10\stderr.txt	SUCCESS	Offset: 2,158, Length: 2
8:01:04.0856647 PM	SearchProtocolHost.exe	2684	CreateFile	C:\ProgramData\BOINC\slots\10\boinc_finish_called:ms-properties	NAME NOT FOUND	Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Open Reparse Point, Disallow Exclusive, Attributes: n/a, ShareMode: Read, Delete, AllocationSize: n/a
8:01:04.0858163 PM	SearchProtocolHost.exe	2684	CreateFile	C:\ProgramData\BOINC\slots\10\boinc_finish_called:ms-properties	NAME NOT FOUND	Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Open Reparse Point, Disallow Exclusive, Attributes: n/a, ShareMode: Read, Delete, AllocationSize: n/a
8:01:04.0876634 PM	SearchProtocolHost.exe	2684	CloseFile	C:\ProgramData\BOINC\slots\10\boinc_finish_called	SUCCESS	
8:01:04.0915538 PM	SearchProtocolHost.exe	2684	CreateFile	C:\ProgramData\BOINC\slots\10\init_data.xml	SUCCESS	Desired Access: Generic Read, Disposition: Open, Options: Open Reparse Point, Open Requiring Oplock, Attributes: N, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
8:01:04.0917386 PM	SearchProtocolHost.exe	2684	FileSystemControl	C:\ProgramData\BOINC\slots\10\init_data.xml	OPLOCK HANDLE CLOSED	Control: FSCTL_REQUEST_OPLOCK
8:01:04.0933485 PM	SearchProtocolHost.exe	2684	QueryAttributeTagFile	C:\ProgramData\BOINC\slots\10\init_data.xml	SUCCESS	Attributes: ANCI, ReparseTag: 0x0
8:01:04.0933774 PM	SearchProtocolHost.exe	2684	QueryBasicInformationFile	C:\ProgramData\BOINC\slots\10\init_data.xml	SUCCESS	CreationTime: 7/14/2015 8:00:53 PM, LastAccessTime: 7/14/2015 8:00:53 PM, LastWriteTime: 7/14/2015 8:01:04 PM, ChangeTime: 7/14/2015 8:01:04 PM, FileAttributes: ANCI
8:01:04.0934001 PM	SearchProtocolHost.exe	2684	QueryStandardInformationFile	C:\ProgramData\BOINC\slots\10\init_data.xml	SUCCESS	AllocationSize: 8,192, EndOfFile: 4,739, NumberOfLinks: 1, DeletePending: False, Directory: False
8:01:04.0934671 PM	SearchProtocolHost.exe	2684	QueryFileInternalInformationFile	C:\ProgramData\BOINC\slots\10\init_data.xml	SUCCESS	IndexNumber: 0x1a000000009357
8:01:04.0935064 PM	SearchProtocolHost.exe	2684	QueryRemoteProtocolInformation	C:\ProgramData\BOINC\slots\10\init_data.xml	INVALID PARAMETER	
8:01:04.0935345 PM	SearchProtocolHost.exe	2684	QuerySecurityFile	C:\ProgramData\BOINC\slots\10\init_data.xml	SUCCESS	Information: Owner, Group, DACL
8:01:04.0955193 PM	boinc.exe	3932	CreateFile	C:\ProgramData\BOINC\slots\10\boinc_finish_called	SUCCESS	Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened
8:01:04.0956564 PM	boinc.exe	3932	ReadFile	C:\ProgramData\BOINC\slots\10\boinc_finish_called	END OF FILE	Offset: 0, Length: 4,096, Priority: Normal
8:01:04.0956872 PM	boinc.exe	3932	ReadFile	C:\ProgramData\BOINC\slots\10\boinc_finish_called	END OF FILE	Offset: 0, Length: 4,096, Priority: Normal
8:01:04.0957060 PM	boinc.exe	3932	ReadFile	C:\ProgramData\BOINC\slots\10\boinc_finish_called	END OF FILE	Offset: 0, Length: 4,096, Priority: Normal
8:01:04.0957368 PM	boinc.exe	3932	CloseFile	C:\ProgramData\BOINC\slots\10\boinc_finish_called	SUCCESS	
8:01:04.0962169 PM	SearchProtocolHost.exe	2684	CreateFile	C:\ProgramData\BOINC\slots\10\init_data.xml:ms-properties	NAME NOT FOUND	Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Open Reparse Point, Disallow Exclusive, Attributes: n/a, ShareMode: Read, Delete, AllocationSize: n/a
8:01:04.0962346 PM	boinc.exe	3932	CreateFile	C:\ProgramData\BOINC\slots\10	SUCCESS	Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
8:01:04.0962773 PM	boinc.exe	3932	QueryDirectory	C:\ProgramData\BOINC\slots\10\stderr.txt	SUCCESS	Filter: stderr.txt, 1: stderr.txt
8:01:04.0963362 PM	boinc.exe	3932	CloseFile	C:\ProgramData\BOINC\slots\10	SUCCESS	
8:01:04.0964910 PM	SearchProtocolHost.exe	2684	CreateFile	C:\ProgramData\BOINC\slots\10\init_data.xml:ms-properties	NAME NOT FOUND	Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Open Reparse Point, Disallow Exclusive, Attributes: n/a, ShareMode: Read, Delete, AllocationSize: n/a
8:01:04.0965121 PM	boinc.exe	3932	CreateFile	C:\ProgramData\BOINC\slots\10\stderr.txt	SHARING VIOLATION	Desired Access: Generic Write, Read Attributes, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: n/a
8:01:04.0980812 PM	SearchProtocolHost.exe	2684	CloseFile	C:\ProgramData\BOINC\slots\10\init_data.xml	SUCCESS	
8:01:04.1349295 PM	setiathome_7.00_windows_intelx86__cuda42.exe	1324	CloseFile	C:\ProgramData\BOINC\slots\10	SUCCESS	
8:01:04.1352078 PM	setiathome_7.00_windows_intelx86__cuda42.exe	1324	CloseFile	C:\ProgramData\BOINC\slots\10\stderr.txt	SUCCESS	
8:01:04.1361768 PM	MsMpEng.exe	1844	CreateFileMapping	C:\ProgramData\BOINC\slots\10\stderr.txt	FILE LOCKED WITH WRITERS	SyncType: SyncTypeCreateSection, PageProtection: 
8:01:04.1362203 PM	MsMpEng.exe	1844	QueryStandardInformationFile	C:\ProgramData\BOINC\slots\10\stderr.txt	SUCCESS	AllocationSize: 4,096, EndOfFile: 2,160, NumberOfLinks: 1, DeletePending: False, Directory: False
8:01:04.1419897 PM	SearchProtocolHost.exe	2684	CreateFile	C:\ProgramData\BOINC\slots\10\stderr.txt	SUCCESS	Desired Access: Generic Read, Disposition: Open, Options: Open Reparse Point, Open Requiring Oplock, Attributes: N, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
8:01:04.1420909 PM	SearchProtocolHost.exe	2684	FileSystemControl	C:\ProgramData\BOINC\slots\10\stderr.txt	OPLOCK HANDLE CLOSED	Control: FSCTL_REQUEST_OPLOCK
8:01:04.1421452 PM	SearchProtocolHost.exe	2684	QueryAttributeTagFile	C:\ProgramData\BOINC\slots\10\stderr.txt	SUCCESS	Attributes: ANCI, ReparseTag: 0x0
8:01:04.1421672 PM	SearchProtocolHost.exe	2684	QueryBasicInformationFile	C:\ProgramData\BOINC\slots\10\stderr.txt	SUCCESS	CreationTime: 7/14/2015 8:00:53 PM, LastAccessTime: 7/14/2015 8:00:53 PM, LastWriteTime: 7/14/2015 8:01:04 PM, ChangeTime: 7/14/2015 8:01:04 PM, FileAttributes: ANCI
8:01:04.1421872 PM	SearchProtocolHost.exe	2684	QueryStandardInformationFile	C:\ProgramData\BOINC\slots\10\stderr.txt	SUCCESS	AllocationSize: 4,096, EndOfFile: 2,160, NumberOfLinks: 1, DeletePending: False, Directory: False
8:01:04.1422060 PM	SearchProtocolHost.exe	2684	QueryFileInternalInformationFile	C:\ProgramData\BOINC\slots\10\stderr.txt	SUCCESS	IndexNumber: 0x1bd0000000001c5
8:01:04.1422480 PM	SearchProtocolHost.exe	2684	QueryRemoteProtocolInformation	C:\ProgramData\BOINC\slots\10\stderr.txt	INVALID PARAMETER	
8:01:04.1422830 PM	SearchProtocolHost.exe	2684	QuerySecurityFile	C:\ProgramData\BOINC\slots\10\stderr.txt	SUCCESS	Information: Owner, Group, DACL
8:01:04.1431311 PM	SearchProtocolHost.exe	2684	CreateFile	C:\ProgramData\BOINC\slots\10\stderr.txt:ms-properties	NAME NOT FOUND	Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Open Reparse Point, Disallow Exclusive, Attributes: n/a, ShareMode: Read, Delete, AllocationSize: n/a
8:01:04.1432940 PM	SearchProtocolHost.exe	2684	CreateFile	C:\ProgramData\BOINC\slots\10\stderr.txt:ms-properties	NAME NOT FOUND	Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Open Reparse Point, Disallow Exclusive, Attributes: n/a, ShareMode: Read, Delete, AllocationSize: n/a
8:01:04.1445293 PM	SearchProtocolHost.exe	2684	CloseFile	C:\ProgramData\BOINC\slots\10\stderr.txt	SUCCESS	
8:01:05.0552838 PM	MsMpEng.exe	1844	CreateFile	C:\ProgramData\BOINC\slots\10\state.sah	SUCCESS	Desired Access: Read Attributes, Disposition: Open, Options: Non-Directory File, Open For Backup, Open Reparse Point, Open Requiring Oplock, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
8:01:05.0553781 PM	MsMpEng.exe	1844	FileSystemControl	C:\ProgramData\BOINC\slots\10\state.sah	OPLOCK HANDLE CLOSED	Control: FSCTL_REQUEST_OPLOCK
8:01:05.0554224 PM	MsMpEng.exe	1844	FileSystemControl	C:\ProgramData\BOINC\slots\10\state.sah	SUCCESS	Control: 0x902eb (Device:0x9 Function:186 Method: 3)
8:01:05.0554520 PM	MsMpEng.exe	1844	CloseFile	C:\ProgramData\BOINC\slots\10\state.sah	SUCCESS	
8:01:05.1051927 PM	boinc.exe	3932	CreateFile	C:\ProgramData\BOINC\slots\10\stderr.txt	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: n/a, OpenResult: Opened
8:01:05.1053132 PM	boinc.exe	3932	CloseFile	C:\ProgramData\BOINC\slots\10\stderr.txt	SUCCESS	
8:01:05.1054802 PM	boinc.exe	3932	CreateFile	C:\ProgramData\BOINC\slots\10	SUCCESS	Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
8:01:05.1055156 PM	boinc.exe	3932	QueryDirectory	C:\ProgramData\BOINC\slots\10\stderr.txt	SUCCESS	Filter: stderr.txt, 1: stderr.txt
8:01:05.1055753 PM	boinc.exe	3932	CloseFile	C:\ProgramData\BOINC\slots\10	SUCCESS	
8:01:05.1057158 PM	boinc.exe	3932	CreateFile	C:\ProgramData\BOINC\slots\10\stderr.txt	SUCCESS	Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened
8:01:05.1057890 PM	boinc.exe	3932	ReadFile	C:\ProgramData\BOINC\slots\10\stderr.txt	SUCCESS	Offset: 0, Length: 2,160, Priority: Normal
8:01:05.1058918 PM	boinc.exe	3932	CloseFile	C:\ProgramData\BOINC\slots\10\stderr.txt	SUCCESS	
ID: 1701586 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1701599 - Posted: 15 Jul 2015, 5:02:39 UTC - in response to Message 1701586.  

Jeff, that sure looks like the smoking gun of the proposed failure mechanism. And it looks like the new 7.6.6 client gracefully handled the faux pax.
My systems are down for the night waiting for the new solar day to start up again. Will give MW a good workout tomorrow looking for invalids.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1701599 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1701650 - Posted: 15 Jul 2015, 8:34:38 UTC - in response to Message 1701586.  
Last modified: 15 Jul 2015, 8:37:54 UTC

Yep, that'd imply the Boinc client side of the interface is being a little more patient. I wouldn't call it super rugged, or fantastic design, but is definitely better than not waiting at all. Whether or not any portion of your truncations came from that side is an open question. Presumably if the occurances reduce or disappear then some or all were. If you see no change (except with a commit mode build) then they'd better duct tape the API side eventually as well, which is currently just as ungraceful.

Anyway, the problem with duct tape and similar bandaid fixes, is they tend to hurt when they eventually need to be ripped off. Quick with a distraction is my recommendation.

You mentioned the 4k allocation after allocating small buffers etc, as seeming inefficient. Well true, but a small text file typically doesn't warrant using the more efficient mechanisms, since the latency of the IO is about 1000+ times slower than those memory buffers passing from user space, through the user mode driver helper, to the kernel, kernel mode driver, then across several pieces of hardware, and finally onto disk.

What makes it pretty slick since most drives & later OSes went to 4kb allocation units, is that size matches the RAM memory page size, so the granularity for large transactions (or lots of small ones write combined) requires less translation than in the past, and there are hardware elemts in the CPU, motherboard and other hardware capable of working with those sizes efficiently.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1701650 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1701678 - Posted: 15 Jul 2015, 10:15:09 UTC - in response to Message 1701650.  

Just started running v7.6.6 - seems clean so far, no showstoppers.

The good thing is, for once we got the timing right. BOINC devs have been wanting to release the v7.6 branch as 'recommended' for a while now, and with the governance changes I doubt there'll be another major release for a while. We just scraped it in under the wire (and it was a convenient excuse to apply this month's bumper crop of 'Patch Tuesday' updates from M$).
ID: 1701678 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1701757 - Posted: 15 Jul 2015, 15:06:33 UTC

First stage check this morning shows my truncated Stderr list is empty, so no new ones on any machines overnight. That doesn't necessarily prove anything, since that's always happened occasionally, anyway, but it might be encouraging.

Breakfast is next on the agenda, and then I'll start going through the BOINC and PM logs to see if anything looks interesting.
ID: 1701757 · Report as offensive
Tom*

Send message
Joined: 12 Aug 11
Posts: 127
Credit: 20,769,223
RAC: 9
United States
Message 1701761 - Posted: 15 Jul 2015, 15:19:28 UTC
Last modified: 15 Jul 2015, 15:22:35 UTC

Ran MW with 7.6.6 for the last 18 hours, no invalids, so far so good.

Thanks everyone for pushing this issue.

Is the workaround only relevant to stderr or other files returned to the server?
ID: 1701761 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1701772 - Posted: 15 Jul 2015, 15:34:52 UTC - in response to Message 1701761.  

Is the workaround only relevant to stderr or other files returned to the server?
In theory it's relevant to any files, though as we know in Boinc practice the best of theory can go to waste amongst the spaghetti code. I'll remain in disgusted silence at the obscene use of duct tape to mend the hull breaches. Fingers crossed no-one drowns.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1701772 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1701822 - Posted: 15 Jul 2015, 17:20:01 UTC
Last modified: 15 Jul 2015, 18:10:02 UTC

I've started going through the PM logs for the xw9400 and already have a couple examples of slightly different situations where Sharing Violations occurred and the backoff/retry seems to have worked fine. The first is a situation that appears to be comparable to the S@h truncations I experience, where the science app actually does successfully close the stderr.txt file (at 1:12:15.7354433 AM, in the log below). However, BOINC then gets a SHARING VIOLATION when it tries to open it (at 1:12:15.7411881 AM). Just over a second later, the BOINC retry succeeds, with the file Read (at 1:12:16.7409243 AM) apparently going off without a hitch. As part of the slot cleanup, the stderr.exe gets successfully deleted (at 1:12:16.7489738 AM), as well.

1:12:15.7309708 AM	Lunatics_x41zc_win32_cuda50.exe	508	QueryStandardInformationFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt	SUCCESS	AllocationSize: 4,096, EndOfFile: 2,006, NumberOfLinks: 1, DeletePending: False, Directory: False
1:12:15.7309923 AM	Lunatics_x41zc_win32_cuda50.exe	508	WriteFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt	SUCCESS	Offset: 2,006, Length: 2
1:12:15.7351981 AM	Lunatics_x41zc_win32_cuda50.exe	508	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4	SUCCESS	
1:12:15.7354433 AM	Lunatics_x41zc_win32_cuda50.exe	508	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt	SUCCESS	
1:12:15.7364817 AM	MsMpEng.exe	1260	CreateFileMapping	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt	SUCCESS	SyncType: SyncTypeCreateSection, PageProtection: PAGE_READONLY
1:12:15.7364928 AM	MsMpEng.exe	1260	QueryStandardInformationFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt	SUCCESS	AllocationSize: 4,096, EndOfFile: 2,008, NumberOfLinks: 1, DeletePending: False, Directory: False
1:12:15.7365094 AM	MsMpEng.exe	1260	CreateFileMapping	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt	SUCCESS	SyncType: SyncTypeOther
1:12:15.7404256 AM	boinc.exe	1456	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\boinc_finish_called	SUCCESS	Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened
1:12:15.7406701 AM	boinc.exe	1456	ReadFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\boinc_finish_called	END OF FILE	Offset: 0, Length: 4,096
1:12:15.7407123 AM	boinc.exe	1456	ReadFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\boinc_finish_called	END OF FILE	Offset: 0, Length: 4,096
1:12:15.7407220 AM	boinc.exe	1456	ReadFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\boinc_finish_called	END OF FILE	Offset: 0, Length: 4,096
1:12:15.7407386 AM	boinc.exe	1456	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\boinc_finish_called	SUCCESS	
1:12:15.7409397 AM	boinc.exe	1456	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4	SUCCESS	Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened
1:12:15.7410155 AM	boinc.exe	1456	QueryDirectory	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt	SUCCESS	Filter: stderr.txt, 1: stderr.txt
1:12:15.7410549 AM	boinc.exe	1456	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4	SUCCESS	
1:12:15.7411881 AM	boinc.exe	1456	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt	SHARING VIOLATION	Desired Access: Generic Write, Read Attributes, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: n/a
1:12:16.7403601 AM	boinc.exe	1456	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: n/a, OpenResult: Opened
1:12:16.7404283 AM	boinc.exe	1456	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt	SUCCESS	
1:12:16.7405817 AM	boinc.exe	1456	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4	SUCCESS	Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened
1:12:16.7406554 AM	boinc.exe	1456	QueryDirectory	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt	SUCCESS	Filter: stderr.txt, 1: stderr.txt
1:12:16.7407030 AM	boinc.exe	1456	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4	SUCCESS	
1:12:16.7408535 AM	boinc.exe	1456	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt	SUCCESS	Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened
1:12:16.7409243 AM	boinc.exe	1456	ReadFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt	SUCCESS	Offset: 0, Length: 2,008
1:12:16.7409517 AM	boinc.exe	1456	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt	SUCCESS	
...
1:12:16.7489360 AM	boinc.exe	1456	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt	SUCCESS	Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
1:12:16.7489630 AM	boinc.exe	1456	QueryAttributeTagFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt	SUCCESS	Attributes: A, ReparseTag: 0x0
1:12:16.7489738 AM	boinc.exe	1456	SetDispositionInformationFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt	SUCCESS	Delete: True
1:12:16.7489881 AM	boinc.exe	1456	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\4\stderr.txt	SUCCESS	

Checking the BOINC log for slot 4 at that time, I see no anomalies. The task appears to be 4265616744.
15-Jul-2015 01:12:15 [SETI@home] [task] Process for 03jn15aa.22800.1058.438086664197.12.86_0 exited, exit code 0, task state 1
15-Jul-2015 01:12:15 [SETI@home] [task] task_state=EXITED for 03jn15aa.22800.1058.438086664197.12.86_0 from handle_exited_app
15-Jul-2015 01:12:16 [---] [slot] cleaning out slots/4: handle_exited_app()
15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/boinc_finish_called
15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/boinc_task_state.xml
15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/cudart32_50_35.dll
15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/cufft32_50_35.dll
15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/init_data.xml
15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/Lunatics_x41zc_win32_cuda50.exe
15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/mbcuda.cfg
15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/result.sah
15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/state.sah
15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/stderr.txt
15-Jul-2015 01:12:16 [---] [slot] removed file slots/4/work_unit.sah
15-Jul-2015 01:12:16 [SETI@home] Computation for task 03jn15aa.22800.1058.438086664197.12.86_0 finished
15-Jul-2015 01:12:16 [SETI@home] [task] result state=FILES_UPLOADING for 03jn15aa.22800.1058.438086664197.12.86_0 from CS::app_finished
15-Jul-2015 01:12:16 [---] [slot] cleaning out slots/4: get_free_slot()
15-Jul-2015 01:12:16 [SETI@home] [slot] assigning slot 4 to 03jn15aa.22800.1058.438086664197.12.142_1

The second Sharing Violation appears to be similar to the one I posted from the T7400 last night. In this instance, BOINC tries to open the stderr.txt (at 1:38:19.7588006 AM) before the science app has actually closed it, which doesn't happen until 1:38:19.8029969 AM. The one second backoff by BOINC results in a successful read at 1:38:20.7585977 AM, followed by successful deletion of the stderr.txt.

1:38:18.0578099 AM	Lunatics_x41zc_win32_cuda50.exe	3012	QueryStandardInformationFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt	SUCCESS	AllocationSize: 4,096, EndOfFile: 1,999, NumberOfLinks: 1, DeletePending: False, Directory: False
1:38:18.0578247 AM	Lunatics_x41zc_win32_cuda50.exe	3012	WriteFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt	SUCCESS	Offset: 1,999, Length: 2
1:38:19.7418218 AM	boinc.exe	1456	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\boinc_finish_called	SUCCESS	Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened
1:38:19.7418783 AM	boinc.exe	1456	ReadFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\boinc_finish_called	END OF FILE	Offset: 0, Length: 4,096
1:38:19.7418901 AM	boinc.exe	1456	ReadFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\boinc_finish_called	END OF FILE	Offset: 0, Length: 4,096
1:38:19.7418987 AM	boinc.exe	1456	ReadFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\boinc_finish_called	END OF FILE	Offset: 0, Length: 4,096
1:38:19.7419096 AM	boinc.exe	1456	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\boinc_finish_called	SUCCESS	
1:38:19.7585712 AM	boinc.exe	1456	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8	SUCCESS	Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened
1:38:19.7585970 AM	boinc.exe	1456	QueryDirectory	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt	SUCCESS	Filter: stderr.txt, 1: stderr.txt
1:38:19.7586654 AM	boinc.exe	1456	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8	SUCCESS	
1:38:19.7588006 AM	boinc.exe	1456	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt	SHARING VIOLATION	Desired Access: Generic Write, Read Attributes, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: n/a
1:38:19.8029031 AM	Lunatics_x41zc_win32_cuda50.exe	3012	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8	SUCCESS	
1:38:19.8029969 AM	Lunatics_x41zc_win32_cuda50.exe	3012	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt	SUCCESS	
1:38:19.8037380 AM	MsMpEng.exe	1260	CreateFileMapping	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt	SUCCESS	SyncType: SyncTypeCreateSection, PageProtection: PAGE_READONLY
1:38:19.8037520 AM	MsMpEng.exe	1260	QueryStandardInformationFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt	SUCCESS	AllocationSize: 4,096, EndOfFile: 2,001, NumberOfLinks: 1, DeletePending: False, Directory: False
1:38:19.8037685 AM	MsMpEng.exe	1260	CreateFileMapping	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt	SUCCESS	SyncType: SyncTypeOther
1:38:20.7573285 AM	boinc.exe	1456	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: n/a, OpenResult: Opened
1:38:20.7573974 AM	boinc.exe	1456	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt	SUCCESS	
1:38:20.7577468 AM	boinc.exe	1456	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8	SUCCESS	Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened
1:38:20.7577837 AM	boinc.exe	1456	QueryDirectory	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt	SUCCESS	Filter: stderr.txt, 1: stderr.txt
1:38:20.7578234 AM	boinc.exe	1456	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8	SUCCESS	
1:38:20.7583389 AM	boinc.exe	1456	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt	SUCCESS	Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened
1:38:20.7585977 AM	boinc.exe	1456	ReadFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt	SUCCESS	Offset: 0, Length: 2,001
1:38:20.7587059 AM	boinc.exe	1456	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt	SUCCESS	
...
1:38:20.7910412 AM	boinc.exe	1456	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt	SUCCESS	Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
1:38:20.7910685 AM	boinc.exe	1456	QueryAttributeTagFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt	SUCCESS	Attributes: A, ReparseTag: 0x0
1:38:20.7910847 AM	boinc.exe	1456	SetDispositionInformationFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt	SUCCESS	Delete: True
1:38:20.7911022 AM	boinc.exe	1456	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\8\stderr.txt	SUCCESS	

Again, checking the BOINC log for Slot 8 at that time shows no anomalies. The task appears to be 4265625793.
15-Jul-2015 01:38:19 [SETI@home] [task] task_state=EXITED for 03jn15aa.22800.1876.438086664197.12.212_1 from handle_exited_app
15-Jul-2015 01:38:21 [---] [slot] cleaning out slots/8: handle_exited_app()
15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/boinc_finish_called
15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/boinc_task_state.xml
15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/cudart32_50_35.dll
15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/cufft32_50_35.dll
15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/init_data.xml
15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/Lunatics_x41zc_win32_cuda50.exe
15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/mbcuda.cfg
15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/result.sah
15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/state.sah
15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/stderr.txt
15-Jul-2015 01:38:21 [---] [slot] removed file slots/8/work_unit.sah
15-Jul-2015 01:38:21 [SETI@home] Computation for task 03jn15aa.22800.1876.438086664197.12.212_1 finished
15-Jul-2015 01:38:21 [SETI@home] [task] result state=FILES_UPLOADING for 03jn15aa.22800.1876.438086664197.12.212_1 from CS::app_finished
15-Jul-2015 01:38:21 [---] [slot] cleaning out slots/8: get_free_slot()
15-Jul-2015 01:38:21 [SETI@home] [slot] assigning slot 8 to 03jn15aa.22800.1876.438086664197.12.210_1

In short, the fix seems like it might be working, even in the S@h Stderr truncation situation where a Sharing Violation hadn't seemed to pop up before.

I'll try to get back to looking further through these logs a bit later on.

EDIT: Heh, it looks like those were the only two SHARING VIOLATIONS on the xw9400 all night. The T7400 had 3, the one I posted earlier, along with one additional "read before close" and one "read after close". I won't try to parse out those details unless someone asks.
ID: 1701822 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1701855 - Posted: 15 Jul 2015, 18:19:09 UTC

The same here - no new validate errors at Milkyway since I restarted this morning, with over 500 tasks completed so far.
ID: 1701855 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1701861 - Posted: 15 Jul 2015, 18:30:48 UTC - in response to Message 1701855.  

I'll chime in with over 200 valid tasks at MW since the changeover to 7.6.6. Last invalid was on 7.6.2 right before the changeover. Don't want to jinx it.... but maybe the bug is finally squashed. !YAY!
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1701861 · Report as offensive
Juha
Volunteer tester

Send message
Joined: 7 Mar 04
Posts: 388
Credit: 1,857,738
RAC: 0
Finland
Message 1701890 - Posted: 15 Jul 2015, 19:54:02 UTC - in response to Message 1701474.  

I wonder would the commode builds in comparison show the stderr Writes as synchronous (as I'd expect) ? That would just give some illustrations of the user level, driver and kernel underlying sequences involved.


Are you sure it isn't synchronous already? I mean, it does say 'Synchronous IO Non-Alert'.

22:13:08,4138160	Lunatics_x41zc_win32_cuda50.exe	2496	CreateFile	C:\b\stderr.txt	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: OpenIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Created


Oh and if you compare the one byte writes to stderr.txt with writes to result.sah it looks like stderr is unbuffered (in C stdio sense).

22:13:15,4669828	Lunatics_x41zc_win32_cuda50.exe	2496	CreateFile	C:\b\result.sah	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Overwritten
22:13:15,4694588	Lunatics_x41zc_win32_cuda50.exe	2496	WriteFile	C:\b\result.sah	SUCCESS	Offset: 0, Length: 4 096
22:13:15,4698946	Lunatics_x41zc_win32_cuda50.exe	2496	WriteFile	C:\b\result.sah	SUCCESS	Offset: 4 096, Length: 3 592
22:13:15,4699876	Lunatics_x41zc_win32_cuda50.exe	2496	CloseFile	C:\b\result.sah	SUCCESS


As it should be. From lib/diagnostics.cpp:diagnostics_init():

if (flags & BOINC_DIAG_REDIRECTSTDERR) {
    file_size(stderr_log, stderr_file_size);
    stderr_file = freopen(stderr_log, "a", stderr);
    if (!stderr_file) {
        return ERR_FOPEN;
    }
    setbuf(stderr_file, 0);
}
ID: 1701890 · Report as offensive
Juha
Volunteer tester

Send message
Joined: 7 Mar 04
Posts: 388
Credit: 1,857,738
RAC: 0
Finland
Message 1701902 - Posted: 15 Jul 2015, 20:34:38 UTC

As for synchronizing writes by app and reads by client.

How about the client waits until the app has exited before it goes reading files? And the client could tell this by asking kernel. I mean, the kernel ought to know when it has closed all files and flushed buffers, right?

And the code could look something like this (from app_control.cpp):

// See if any processes have exited
//
bool ACTIVE_TASK_SET::check_app_exited() {
    ACTIVE_TASK* atp;
    bool found = false;

#ifdef _WIN32
    unsigned long exit_code;
    unsigned int i;

    for (i=0; i<active_tasks.size(); i++) {
        atp = active_tasks[i];
        if (!atp->process_exists()) continue;
        if (GetExitCodeProcess(atp->process_handle, &exit_code)) {
            if (exit_code != STILL_ACTIVE) {
                found = true;
                atp->handle_exited_app(exit_code);
            }
...

handle_exited_app() does the reading and stuff.


So what do you do when you can't trust the kernel?


(The code has minor bug, it can't tell the difference between exit_code=259 and STILL_ACTIVE=259)
ID: 1701902 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1701951 - Posted: 15 Jul 2015, 23:07:13 UTC - in response to Message 1701902.  
Last modified: 15 Jul 2015, 23:48:42 UTC

Yes,
Importantly those open/create and WriteFile calls are indeed synchronous, HOWEVER, the application (boinc api part) never calls those Windows Api functions at all. Those are being invoked from inside microsoft C-Runtime, which by defaults has yet another layer of buffers behind the multithreaded library. Linking to commode.obj disables that extra layer. Threadsafe practices are threadsafe practices, the exisitng implementation wasn't/isn't threadsafe, and the added client wait is marginally better (but not awesome :) )
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1701951 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1701958 - Posted: 15 Jul 2015, 23:17:55 UTC - in response to Message 1701902.  
Last modified: 16 Jul 2015, 0:09:34 UTC

I mean, the kernel ought to know when it has closed all files and flushed buffers, right?


Only problem there, is they decided (for unknown reasons) to use an asynchronous TerminateProcess() call and put a 1 second sleep and hard crash, instead of waiting on a synchronisation primitive. No idea why the entire codebase seems to be allergic to synchronisation, and likes to use magic numbers (fixed time intervals) on a non-realtime OS.

So the exit code can indeed appear before the app [especially at low or idle priority] has really finished exiting. No idea why they chose the ugliest possible implementation.

Probably one day I'll come to admire it as I might a really horrible (unsettling) piece of artwork, but I suspect it'll take me quite a while to be that OK with it. Perhaps that's what they're going for. Some kindof 'organic' approach, lol. Break all the rules, and yet somehow make it almost work. Kindof like a wonky ashtray made by a 5 year-old can be 'special' too.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1701958 · Report as offensive
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 11 · Next

Message boards : Number crunching : Stderr Truncations


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.