Postponed: Waiting to acquire lock

Message boards : Number crunching : Postponed: Waiting to acquire lock
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 · Next

AuthorMessage
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1911645 - Posted: 7 Jan 2018, 23:24:12 UTC
Last modified: 7 Jan 2018, 23:27:27 UTC

Here's another bit of hard evidence to add to the pile. Here on my daily driver, running Win7, I just experimented with two BOINC restarts, running 1 GPU and 5 CPU tasks. The first run had boinc_lockfiles in the GPU and one of the CPU slots. The second run had all slots clear of lockfiles. Before each restart I turned on Process Monitor, just long enough to capture all events until all tasks were in a "Running" state. I then tried to determine what might be going on with the lockfiles and/or with the slots themselves. Here are the events I extracted from the first trial.

The first thing to note is that, even before the science apps start trying to create lockfiles, the BOINC client polls all the existing slots to see what files are already present. You can see that, in this case, slots 2 and 5 already had lockfiles present. Secondly, when the science apps then try to allocate lockfiles in each slot, all CreateFile attempts end with SUCCESS, but....if you look to the end of each of those lines, you'll find "OpenResult: Created" for the 4 slots that were free of lockfiles, while the two with pre-existing lockfiles both got "OpenResult: Opened". The apps didn't care if the lockfiles needed to be created first, only that they could open them as non-shared objects. A third thing to note, though I have no idea what it means, is that a subsequent polling of the lockfiles by Explorer only looked at the 4 slots that had newly created lockfiles.
2:21:21.0503030 PM	boinc.exe	7564	QueryDirectory	C:\ProgramData\BOINC\slots\4	SUCCESS	0: .., 1: boinc_task_state.xml, 2: init_data.xml, 3: libfftw3f-3-3-4_x86.dll, 4: MB8_win_x86_SSE3_VS2008_r3330.exe, 5: mb_cmdline.txt, 6: result.sah, 7: state.sah, 8: stderr.txt, 9: work_unit.sah
2:21:21.0504632 PM	boinc.exe	7564	QueryDirectory	C:\ProgramData\BOINC\slots\1	SUCCESS	0: .., 1: boinc_task_state.xml, 2: init_data.xml, 3: libfftw3f-3-3-4_x86.dll, 4: MB8_win_x86_SSE3_VS2008_r3330.exe, 5: mb_cmdline.txt, 6: result.sah, 7: state.sah, 8: stderr.txt, 9: work_unit.sah
2:21:21.0506212 PM	boinc.exe	7564	QueryDirectory	C:\ProgramData\BOINC\slots\0	SUCCESS	0: .., 1: boinc_task_state.xml, 2: init_data.xml, 3: libfftw3f-3-3-4_x86.dll, 4: MB8_win_x86_SSE3_VS2008_r3330.exe, 5: mb_cmdline.txt, 6: result.sah, 7: state.sah, 8: stderr.txt, 9: work_unit.sah
2:21:21.0507793 PM	boinc.exe	7564	QueryDirectory	C:\ProgramData\BOINC\slots\2	SUCCESS	0: .., 1: boinc_lockfile, 2: boinc_task_state.xml, 3: init_data.xml, 4: libfftw3f-3-3-4_x86.dll, 5: MB8_win_x86_SSE3_VS2008_r3330.exe, 6: mb_cmdline.txt, 7: result.sah, 8: state.sah, 9: stderr.txt, 10: work_unit.sah
2:21:21.0509374 PM	boinc.exe	7564	QueryDirectory	C:\ProgramData\BOINC\slots\3	SUCCESS	0: .., 1: boinc_task_state.xml, 2: init_data.xml, 3: libfftw3f-3-3-4_x86.dll, 4: MB8_win_x86_SSE3_VS2008_r3330.exe, 5: mb_cmdline.txt, 6: result.sah, 7: state.sah, 8: stderr.txt, 9: work_unit.sah
2:21:21.0510946 PM	boinc.exe	7564	QueryDirectory	C:\ProgramData\BOINC\slots\5	SUCCESS	0: .., 1: boinc_lockfile, 2: boinc_task_state.xml, 3: cudart32_50_35.dll, 4: cufft32_50_35.dll, 5: init_data.xml, 6: Lunatics_x41zi_win32_cuda50.exe, 7: mbcuda.cfg, 8: result.sah, 9: state.sah, 10: stderr.txt, 11: work_unit.sah

2:21:21.4824767 PM	MB8_win_x86_SSE3_VS2008_r3330.exe	3052	CreateFile	C:\ProgramData\BOINC\slots\4\boinc_lockfile	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: OpenIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: 0, OpenResult: Created
2:21:21.4836711 PM	MB8_win_x86_SSE3_VS2008_r3330.exe	2108	CreateFile	C:\ProgramData\BOINC\slots\0\boinc_lockfile	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: OpenIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: 0, OpenResult: Created
2:21:21.4908853 PM	MB8_win_x86_SSE3_VS2008_r3330.exe	2788	CreateFile	C:\ProgramData\BOINC\slots\1\boinc_lockfile	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: OpenIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: 0, OpenResult: Created
2:21:21.5190641 PM	MB8_win_x86_SSE3_VS2008_r3330.exe	6396	CreateFile	C:\ProgramData\BOINC\slots\3\boinc_lockfile	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: OpenIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: 0, OpenResult: Created
2:21:21.5321052 PM	MB8_win_x86_SSE3_VS2008_r3330.exe	5884	CreateFile	C:\ProgramData\BOINC\slots\2\boinc_lockfile	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: OpenIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: 0, OpenResult: Opened
2:21:21.6638521 PM	Lunatics_x41zi_win32_cuda50.exe	7312	CreateFile	C:\ProgramData\BOINC\slots\5\boinc_lockfile	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: OpenIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: 0, OpenResult: Opened
2:21:21.7998308 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\0\boinc_lockfile	SUCCESS	Filter: boinc_lockfile, 1: boinc_lockfile
2:21:21.8063409 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\1\boinc_lockfile	SUCCESS	Filter: boinc_lockfile, 1: boinc_lockfile
2:21:21.8075796 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\3\boinc_lockfile	SUCCESS	Filter: boinc_lockfile, 1: boinc_lockfile
2:21:21.8089144 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\4\boinc_lockfile	SUCCESS	Filter: boinc_lockfile, 1: boinc_lockfile
2:21:21.8339625 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\5	SUCCESS	0: ., 1: .., 2: boinc_lockfile, 3: boinc_task_state.xml, 4: cudart32_50_35.dll, 5: cufft32_50_35.dll, 6: init_data.xml, 7: Lunatics_x41zi_win32_cuda50.exe, 8: mbcuda.cfg, 9: result.sah, 10: state.sah, 11: stderr.txt, 12: work_unit.sah
2:21:22.1674186 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\0	SUCCESS	0: ., 1: .., 2: boinc_lockfile, 3: boinc_task_state.xml, 4: init_data.xml, 5: libfftw3f-3-3-4_x86.dll, 6: MB8_win_x86_SSE3_VS2008_r3330.exe, 7: mb_cmdline.txt, 8: result.sah, 9: state.sah, 10: stderr.txt, 11: work_unit.sah
2:21:22.1677302 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\1	SUCCESS	0: ., 1: .., 2: boinc_lockfile, 3: boinc_task_state.xml, 4: init_data.xml, 5: libfftw3f-3-3-4_x86.dll, 6: MB8_win_x86_SSE3_VS2008_r3330.exe, 7: mb_cmdline.txt, 8: result.sah, 9: state.sah, 10: stderr.txt, 11: work_unit.sah
2:21:22.1680209 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\2	SUCCESS	0: ., 1: .., 2: boinc_lockfile, 3: boinc_task_state.xml, 4: init_data.xml, 5: libfftw3f-3-3-4_x86.dll, 6: MB8_win_x86_SSE3_VS2008_r3330.exe, 7: mb_cmdline.txt, 8: result.sah, 9: state.sah, 10: stderr.txt, 11: work_unit.sah
2:21:22.1683021 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\3	SUCCESS	0: ., 1: .., 2: boinc_lockfile, 3: boinc_task_state.xml, 4: init_data.xml, 5: libfftw3f-3-3-4_x86.dll, 6: MB8_win_x86_SSE3_VS2008_r3330.exe, 7: mb_cmdline.txt, 8: result.sah, 9: state.sah, 10: stderr.txt, 11: work_unit.sah
2:21:22.1685834 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\4	SUCCESS	0: ., 1: .., 2: boinc_lockfile, 3: boinc_task_state.xml, 4: init_data.xml, 5: libfftw3f-3-3-4_x86.dll, 6: MB8_win_x86_SSE3_VS2008_r3330.exe, 7: mb_cmdline.txt, 8: result.sah, 9: state.sah, 10: stderr.txt, 11: work_unit.sah

The second run, with no lockfiles in any of the slots, doesn't appear to hold any surprises. All 6 slots had new lockfiles created and got "OpenResult: Created".
2:42:22.7528413 PM	boinc.exe	3236	QueryDirectory	C:\ProgramData\BOINC\slots\0	SUCCESS	0: .., 1: boinc_task_state.xml, 2: init_data.xml, 3: libfftw3f-3-3-4_x86.dll, 4: MB8_win_x86_SSE3_VS2008_r3330.exe, 5: mb_cmdline.txt, 6: result.sah, 7: state.sah, 8: stderr.txt, 9: work_unit.sah
2:42:22.7530051 PM	boinc.exe	3236	QueryDirectory	C:\ProgramData\BOINC\slots\2	SUCCESS	0: .., 1: boinc_task_state.xml, 2: init_data.xml, 3: libfftw3f-3-3-4_x86.dll, 4: MB8_win_x86_SSE3_VS2008_r3330.exe, 5: mb_cmdline.txt, 6: result.sah, 7: state.sah, 8: stderr.txt, 9: work_unit.sah
2:42:22.7531944 PM	boinc.exe	3236	QueryDirectory	C:\ProgramData\BOINC\slots\3	SUCCESS	0: .., 1: boinc_task_state.xml, 2: init_data.xml, 3: libfftw3f-3-3-4_x86.dll, 4: MB8_win_x86_SSE3_VS2008_r3330.exe, 5: mb_cmdline.txt, 6: result.sah, 7: state.sah, 8: stderr.txt, 9: work_unit.sah
2:42:22.7533824 PM	boinc.exe	3236	QueryDirectory	C:\ProgramData\BOINC\slots\5	SUCCESS	0: .., 1: boinc_task_state.xml, 2: cudart32_50_35.dll, 3: cufft32_50_35.dll, 4: init_data.xml, 5: Lunatics_x41zi_win32_cuda50.exe, 6: mbcuda.cfg, 7: result.sah, 8: state.sah, 9: stderr.txt, 10: work_unit.sah
2:42:22.7535422 PM	boinc.exe	3236	QueryDirectory	C:\ProgramData\BOINC\slots\4	SUCCESS	0: .., 1: boinc_task_state.xml, 2: init_data.xml, 3: libfftw3f-3-3-4_x86.dll, 4: MB8_win_x86_SSE3_VS2008_r3330.exe, 5: mb_cmdline.txt, 6: result.sah, 7: state.sah, 8: stderr.txt, 9: work_unit.sah
2:42:22.7537047 PM	boinc.exe	3236	QueryDirectory	C:\ProgramData\BOINC\slots\1	SUCCESS	0: .., 1: boinc_task_state.xml, 2: init_data.xml, 3: libfftw3f-3-3-4_x86.dll, 4: MB8_win_x86_SSE3_VS2008_r3330.exe, 5: mb_cmdline.txt, 6: result.sah, 7: state.sah, 8: stderr.txt, 9: work_unit.sah

2:42:23.1579723 PM	MB8_win_x86_SSE3_VS2008_r3330.exe	9068	CreateFile	C:\ProgramData\BOINC\slots\0\boinc_lockfile	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: OpenIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: 0, OpenResult: Created
2:42:23.1801118 PM	MB8_win_x86_SSE3_VS2008_r3330.exe	8440	CreateFile	C:\ProgramData\BOINC\slots\2\boinc_lockfile	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: OpenIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: 0, OpenResult: Created
2:42:23.2202281 PM	MB8_win_x86_SSE3_VS2008_r3330.exe	9280	CreateFile	C:\ProgramData\BOINC\slots\3\boinc_lockfile	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: OpenIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: 0, OpenResult: Created
2:42:23.2249198 PM	MB8_win_x86_SSE3_VS2008_r3330.exe	5504	CreateFile	C:\ProgramData\BOINC\slots\4\boinc_lockfile	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: OpenIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: 0, OpenResult: Created
2:42:23.2515987 PM	MB8_win_x86_SSE3_VS2008_r3330.exe	9232	CreateFile	C:\ProgramData\BOINC\slots\1\boinc_lockfile	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: OpenIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: 0, OpenResult: Created
2:42:23.3393588 PM	Lunatics_x41zi_win32_cuda50.exe	10144	CreateFile	C:\ProgramData\BOINC\slots\5\boinc_lockfile	SUCCESS	Desired Access: Generic Write, Read Attributes, Disposition: OpenIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: None, AllocationSize: 0, OpenResult: Created
2:42:23.5946561 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\0\boinc_lockfile	SUCCESS	Filter: boinc_lockfile, 1: boinc_lockfile
2:42:23.5969254 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\1\boinc_lockfile	SUCCESS	Filter: boinc_lockfile, 1: boinc_lockfile
2:42:23.5981805 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\2\boinc_lockfile	SUCCESS	Filter: boinc_lockfile, 1: boinc_lockfile
2:42:23.6025992 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\3\boinc_lockfile	SUCCESS	Filter: boinc_lockfile, 1: boinc_lockfile
2:42:23.6031871 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\4\boinc_lockfile	SUCCESS	Filter: boinc_lockfile, 1: boinc_lockfile
2:42:23.6054408 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\5\boinc_lockfile	SUCCESS	Filter: boinc_lockfile, 1: boinc_lockfile
2:42:23.6062870 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\0\boinc_lockfile	SUCCESS	Filter: boinc_lockfile, 1: boinc_lockfile
2:42:23.7187158 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\0	SUCCESS	0: ., 1: .., 2: boinc_lockfile, 3: boinc_task_state.xml, 4: init_data.xml, 5: libfftw3f-3-3-4_x86.dll, 6: MB8_win_x86_SSE3_VS2008_r3330.exe, 7: mb_cmdline.txt, 8: result.sah, 9: state.sah, 10: stderr.txt, 11: work_unit.sah
2:42:23.7863756 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\0	SUCCESS	0: ., 1: .., 2: boinc_lockfile, 3: boinc_task_state.xml, 4: init_data.xml, 5: libfftw3f-3-3-4_x86.dll, 6: MB8_win_x86_SSE3_VS2008_r3330.exe, 7: mb_cmdline.txt, 8: result.sah, 9: state.sah, 10: stderr.txt, 11: work_unit.sah
2:42:23.7864861 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\1	SUCCESS	0: ., 1: .., 2: boinc_lockfile, 3: boinc_task_state.xml, 4: init_data.xml, 5: libfftw3f-3-3-4_x86.dll, 6: MB8_win_x86_SSE3_VS2008_r3330.exe, 7: mb_cmdline.txt, 8: result.sah, 9: state.sah, 10: stderr.txt, 11: work_unit.sah
2:42:23.7866737 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\2	SUCCESS	0: ., 1: .., 2: boinc_lockfile, 3: boinc_task_state.xml, 4: init_data.xml, 5: libfftw3f-3-3-4_x86.dll, 6: MB8_win_x86_SSE3_VS2008_r3330.exe, 7: mb_cmdline.txt, 8: result.sah, 9: state.sah, 10: stderr.txt, 11: work_unit.sah
2:42:23.7867948 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\3	SUCCESS	0: ., 1: .., 2: boinc_lockfile, 3: boinc_task_state.xml, 4: init_data.xml, 5: libfftw3f-3-3-4_x86.dll, 6: MB8_win_x86_SSE3_VS2008_r3330.exe, 7: mb_cmdline.txt, 8: result.sah, 9: state.sah, 10: stderr.txt, 11: work_unit.sah
2:42:23.7869701 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\4	SUCCESS	0: ., 1: .., 2: boinc_lockfile, 3: boinc_task_state.xml, 4: init_data.xml, 5: libfftw3f-3-3-4_x86.dll, 6: MB8_win_x86_SSE3_VS2008_r3330.exe, 7: mb_cmdline.txt, 8: result.sah, 9: state.sah, 10: stderr.txt, 11: work_unit.sah
2:42:23.7870933 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\5	SUCCESS	0: ., 1: .., 2: boinc_lockfile, 3: boinc_task_state.xml, 4: cudart32_50_35.dll, 5: cufft32_50_35.dll, 6: init_data.xml, 7: Lunatics_x41zi_win32_cuda50.exe, 8: mbcuda.cfg, 9: result.sah, 10: state.sah, 11: stderr.txt, 12: work_unit.sah
2:42:26.2022608 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\0	SUCCESS	0: ., 1: .., 2: boinc_lockfile, 3: boinc_task_state.xml, 4: init_data.xml, 5: libfftw3f-3-3-4_x86.dll, 6: MB8_win_x86_SSE3_VS2008_r3330.exe, 7: mb_cmdline.txt, 8: result.sah, 9: state.sah, 10: stderr.txt, 11: work_unit.sah
2:42:26.2024415 PM	explorer.exe	2572	QueryDirectory	C:\ProgramData\BOINC\slots\1	SUCCESS	0: ., 1: .., 2: boinc_lockfile, 3: boinc_task_state.xml, 4: init_data.xml, 5: libfftw3f-3-3-4_x86.dll, 6: MB8_win_x86_SSE3_VS2008_r3330.exe, 7: mb_cmdline.txt, 8: result.sah, 9: state.sah, 10: stderr.txt, 11: work_unit.sah

Of course, as I reported before, none of my restarted tasks in Windows resulted in "Task postponed" messages.

What this test shows me is that it's not just the presence or absence of a lockfile that these apps care about so much as it is the ability to take ownership of that lockfile as a non-shared resource. I suspect that the apps running into a problem may be taking a slightly different approach. Unfortunately, I don't know that there's a Linux equivalent to Process Monitor to get such a detailed view of exactly what's going on at the application level.

EDIT: BTW, on this machine I'm running BOINC 7.6.33.
ID: 1911645 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1911646 - Posted: 7 Jan 2018, 23:24:14 UTC - in response to Message 1911527.  

Then stop rescheduling, so that that can be taken out of the equation to see if BOINC and the apps operate normally without changing the client_state file on restarts.

But RueiKe has the same issue and i'm almost sure he not use the same rescheduler program i use, but it's better to ask him.

@RueiKe Could you tell us is you do rescheduling and what program or script you use to do that if you do?

<edit> I PM him and ask him to help us with the answer and post it here. Let's wait.


I don't do any rescheduling.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1911646 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911647 - Posted: 7 Jan 2018, 23:35:42 UTC - in response to Message 1911646.  

I don't do any rescheduling.

Thanks for the answer.
Did you follow the thread? Since apparently only we two has the issue.
I believe we find a way to bypass the issue.
Just delete the lock file on the "postponed" WU slot.
Try the next time you get the issue and share to us is that works for you too.

So that give the answer of the question... rescheduler is not the source of the problem. And I believe that takes out the client file as a source of the problem too.

My clue the resheduler just made it worst because it stops & starts the Boinc more times, so the "timing error" has more chances to happening.
That could explain why i see the issue more commonly.

Now is with you guy's
ID: 1911647 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1911735 - Posted: 8 Jan 2018, 10:15:39 UTC

Not sure if this observation is relevant, but on my Linux system I have always had an issue where if I started boincmgr too soon after exiting it, it would not connect to the project and I would have to terminate and try again. To avoid this issue I would always monitor MB processes in system monitor and wait for all to finish before starting boincmgr again. It usually takes a long time (~1 min) of some processes being idle before they stop running. I did this just now and found 8 processes still listed after 3min:


Now it has been over 10min and 3 of those 8 are still listed in the system monitor. I just restarted boincmgr and after more than 30min, those 3 processes still show up as active:


18836   1696  0 Jan06 pts/18   00:00:17 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
18837   1696  0 Jan06 pts/18   00:00:17 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
18838   1696  0 Jan06 pts/18   00:00:17 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59633  59606 96 17:45 pts/18   00:21:42 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59635  59606 98 17:45 pts/18   00:22:07 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59637  59606 95 17:45 pts/18   00:21:36 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59638  59606 96 17:45 pts/18   00:21:47 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59640  59606 97 17:45 pts/18   00:22:00 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59642  59606 96 17:45 pts/18   00:21:50 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59645  59606 96 17:45 pts/18   00:21:48 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59646  59606 96 17:45 pts/18   00:21:50 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59648  59606 97 17:45 pts/18   00:21:54 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59650  59606 96 17:45 pts/18   00:21:48 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59652  59606 96 17:45 pts/18   00:21:43 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59654  59606 96 17:45 pts/18   00:21:40 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59656  59606 95 17:45 pts/18   00:21:33 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59659  59606 96 17:45 pts/18   00:21:40 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59661  59606 97 17:45 pts/18   00:21:58 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59663  59606 96 17:45 pts/18   00:21:38 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59665  59606 97 17:45 pts/18   00:22:02 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59667  59606 96 17:45 pts/18   00:21:44 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59669  59606 96 17:45 pts/18   00:21:41 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59670  59606 97 17:45 pts/18   00:21:58 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59671  59606 95 17:45 pts/18   00:21:25 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59673  59606 95 17:45 pts/18   00:21:37 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59681  59606 97 17:45 pts/18   00:22:01 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59685  59606 97 17:45 pts/18   00:22:00 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59687  59606 96 17:45 pts/18   00:21:45 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59825  59606 97 17:54 pts/18   00:13:40 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59954  59606 94 18:03 pts/18   00:04:39 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59973  59606 95 18:04 pts/18   00:03:29 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64
59977  59606 97 18:04 pts/18   00:03:27 ../../projects/setiathome.berkeley.edu/MBv8_8.05r3345_avx_linux64

GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1911735 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1911736 - Posted: 8 Jan 2018, 10:24:56 UTC - in response to Message 1911735.  

Thanks - that's very clear about which app to focus on, too.

Just to be absolutely clear, you are aware that BOINC Manager (boincmgr) doesn't need to be running for the BOINC Client (boinc) to do its work? There is an option "Stop running tasks when exiting the BOINC Manager": if that option is unchecked, it will behave - deliberately - as you are describing.

The option is contained in the Exit Confirmation dialog: if that doesn't appear, enable it from the Options --> Other options... menu in BOINC Manager.
ID: 1911736 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1911737 - Posted: 8 Jan 2018, 10:30:51 UTC - in response to Message 1911736.  

Thanks - that's very clear about which app to focus on, too.

Just to be absolutely clear, you are aware that BOINC Manager (boincmgr) doesn't need to be running for the BOINC Client (boinc) to do its work? There is an option "Stop running tasks when exiting the BOINC Manager": if that option is unchecked, it will behave - deliberately - as you are describing.

The option is contained in the Exit Confirmation dialog: if that doesn't appear, enable it from the Options --> Other options... menu in BOINC Manager.


Yes, I am aware of that option and checked it and indicated it should remember, so it should stop all tasks each time I quit. Plus all but 3 MB process did exit.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1911737 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1911738 - Posted: 8 Jan 2018, 10:35:06 UTC - in response to Message 1911737.  

Thanks again. Just wanted to be certain. So, MBv8_8.05r3345_avx_linux64 needs to go under the microscope.
ID: 1911738 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1911743 - Posted: 8 Jan 2018, 10:52:13 UTC - in response to Message 1911738.  

Thanks again. Just wanted to be certain. So, MBv8_8.05r3345_avx_linux64 needs to go under the microscope.


One more item to point out. Even though those 3 tasks were still active, I did not observe the "Waiting to acquire lock" error. Actually, I have only observed that error the one time I posted here. I was only raising these observations as being potentially relevant.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1911743 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911758 - Posted: 8 Jan 2018, 13:30:56 UTC - in response to Message 1911735.  
Last modified: 8 Jan 2018, 14:19:09 UTC

Not sure if this observation is relevant, but on my Linux system I have always had an issue where if I started boincmgr too soon after exiting it, it would not connect to the project and I would have to terminate and try again. To avoid this issue I would always monitor MB processes in system monitor and wait for all to finish before starting boincmgr again.

The same behaviour is happening with my Linux box. Sometimes after i stop the Boinc (yes myStop running tasks .. is checked) when i try to restart, it restart with a completely empty screen (like when we start with no projects attached, no projects or Wu are displayed). To fix that i need to exit Boinc. Wait few seconds and restart. Most of the times the second try restart it normally, when no i repeat the cycle. The next time it happening i will look the process monitor and check if something was left behind like posted. I only start Boinc by the Boinc Manager.
ID: 1911758 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1911770 - Posted: 8 Jan 2018, 17:10:28 UTC

If you are using System Monitor to see those zombie tasks still running, then you can also check at the top of the list of System Monitor for the two processes, boinc and boincmgr. Unless you check the option to exit the client as well when exiting or shutting down the Manager the Client can still be running and supporting all your processes.

It was explained earlier in the thread, and code was posted, that after the Manager was shut down as well as the Client, that if there any zombie processes or in other words left running, there is a 60 second timer plus 5 seconds before the the zombie running tasks are "bopped on the head" with a "kill" command.

I have seen this in action many times. The blank Manager after a fast restart is caused by these zombie tasks.

The question now is ..... why were 3 zombie tasks running 3 minutes after supposed client and manager shut down. I would doubly make sure to check to see if the client is still running when tasks don't ever disappear from the System Monitor. If the client is still running, then I wouldn't expect it to issue the kill command.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1911770 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1911776 - Posted: 8 Jan 2018, 17:48:09 UTC - in response to Message 1911770.  

The blank Manager after a fast restart is caused by these zombie tasks.
I'm not sure about that - or maybe I'm just thinking of a slightly different way of describing it.

When you close down the Manager (with the 'stop tasks' box ticked), the Manager will tell the client to initiate closedown, and the client will tell the tasks to closedown. When the tasks have all finished, the client will close, and all will be clean.

When you re-start the Manager, it will try to start a new Client. But only one client is allowed to run at the same time (without setting special switches). So, if the tasks have gone zombie, the old client will still be running, and the new client won't start - it'll exit again immediately. That's why you don't see client data in the new Manager.
ID: 1911776 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1911780 - Posted: 8 Jan 2018, 17:53:58 UTC

Back when I was testing the older versions of BOINC in Linux I found it was common for tasks to take up to a minute to quit on some systems after the Manager was exited. I found another method that quit the tasks within seconds. When you wish to stop all running tasks go to the File Menu and select Shut down connected client... Answer OK to the first dialogue, then answer Cancel to the second. That should stop all tasks quickly, then exit the manager. I don't know why it takes so long on some systems, but that method will speed up the process.
ID: 1911780 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1911781 - Posted: 8 Jan 2018, 18:08:00 UTC - in response to Message 1911776.  

Thanks Richard for better explaining it is not the zombie tasks directly preventing restart but the old client still running.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1911781 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1911782 - Posted: 8 Jan 2018, 18:10:36 UTC - in response to Message 1911780.  

I have seen that too, on the newer BOINC versions. I would bet that somewhere in the code there is a "kill" exit when you do the Cancel in the second step.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1911782 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911784 - Posted: 8 Jan 2018, 18:16:01 UTC - in response to Message 1911780.  

I don't know why it takes so long on some systems

A long shot, could be that is the origin of the issue i have? If the Task takes too long to close and the Client is ended before that could leave the file lock? Maybe is the way the Linux kill the task who does that. I know i last to many questions. LOL

All is working fine for now. I made few reschedules , I know that's is not needed with only bls05 WU available, made just to make the test more real. Keep the 6 CPU WU running + AVX2 builds. My caches are full . Let's wait tomorrow outage to see if something changes.
ID: 1911784 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911787 - Posted: 8 Jan 2018, 18:20:03 UTC - in response to Message 1911776.  
Last modified: 8 Jan 2018, 18:25:02 UTC

When the tasks have all finished, the client will close, and all will be clean.

It's expected to work this way, but that is not what really happening. For some reason sometimes in my host the client closes but the task remain in the memory.

I will try to post an example when i see that. I look in the system monitor.
ID: 1911787 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1911794 - Posted: 8 Jan 2018, 18:50:31 UTC

Is it possible that BOINC version enters into these Zombie task situations? I just tried a normal BOINC Manager Exit on each of my 3 Linux systems. All of mine are set to automatically shut down the client and running tasks. On all 3, System Monitor showed the longest shutdown delay for the last of the running tasks was no more than 4 seconds. All three of my Linux boxes are running BOINC 7.2.42.

One other observation. I notice from Ruelke's screenshot that his tasks are running with "Normal" priority, while on all 3 of my boxes the tasks were set to "Very Low" priority. (boinc and boincmgr show as Normal priority.) Could that be a factor?
ID: 1911794 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1911800 - Posted: 8 Jan 2018, 19:20:58 UTC - in response to Message 1911794.  
Last modified: 8 Jan 2018, 19:22:32 UTC

It could be, all the systems that have reported the problem are running 7.8.3. I don't regularly read the BOINC GitHub Manager feeds but I do drop in occasionally to see what is brewing in changes and look over past commits in earlier versions. I don't see anything in the areas that theoretically could be a vector. I am not a code writer so someone more expert than me would have to comment.

I have only had cpu zombie tasks hang around until killed and they were always running in "low" priority because I run a script to make that so. I also use the script to elevate any gpu task to high priority mode. I have never seen a gpu task take more than a couple of seconds to drop off the System Monitor.

So, low priority processes might be the clue here, they might be so low that the system takes too long in polling to get around to looking for the kill command and misses them. And once the 65 seconds has timed out, the process won't be revisited. At least that is my suspicion, I would have to crawl through that API code that was posted to see if I could find if the kill process is reentered.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1911800 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1911804 - Posted: 8 Jan 2018, 19:27:22 UTC - in response to Message 1911800.  
Last modified: 8 Jan 2018, 19:33:19 UTC

Actually, I was noting that my own tasks, all running with "Very Low" priority, were the ones that took no more than 4 seconds to kill, whereas the Zombie examples that Ruelke posted were ones running in "Normal" priority.

EDIT: It may also be worth noting that my machines all have much slower CPUs than yours and Ruelke's.
ID: 1911804 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911809 - Posted: 8 Jan 2018, 19:43:03 UTC
Last modified: 8 Jan 2018, 19:55:50 UTC

FYI All my tasks runs with Very Low priority.
But why i have all this process running in my host if i only have 4 GPU + 6 CPU actually running?

Since yesterday I not even run SSE4.1 anymore!!! . Something is not clearing the old process from the memory.

https://1drv.ms/i/s!Asjkc9Jyluh3zxCec5AdKTaWh7Ll
ID: 1911809 · Report as offensive
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 · Next

Message boards : Number crunching : Postponed: Waiting to acquire lock


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.