Postponed: Waiting to acquire lock

Message boards : Number crunching : Postponed: Waiting to acquire lock
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 14 · Next

AuthorMessage
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1911371 - Posted: 7 Jan 2018, 1:17:56 UTC - in response to Message 1911353.  

I hate to break it to you guys, however, the 3711 App uses the Exact same AXv8 folder as the 3712 apps. They both use the AKv8 version 3710 source. The only difference is a couple Flags and the 3711 App was compiled in an older version of Ubuntu, otherwise, it uses the exact same code.
BTW, if the problem only happens when the CPU cache is Very low, then you are going to have to run the cache down to test it. Running it otherwise means nothing.
If you want to test a different version of BOINC, you can try 7.4.44 as it's much different than 7.8.3 and will work from the Home folder in newer versions of Ubuntu. Just replace the five 7.8.3 files with the five 7.4.44 files from here, http://www.arkayn.us/forum/index.php?topic=197.msg4515#msg4515
Of course, most other people aren't having any trouble with 7.8.3 either.
ID: 1911371 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1911372 - Posted: 7 Jan 2018, 1:18:17 UTC - in response to Message 1911340.  

I woke today to the same problem described by the OP. Yesterday, I was not getting any CPU tasks, so I updated the hosts file and it seemed that all was fine. Got a full cache of CPU work. This morning I found only 3 CPU tasks and all Waiting to acquire lock. I stopped boincmgr and deleted the lockfile. Tried again and got errors waiting for slots lockfile, so I deleted those and tried again. Now the 3 CPUs tasks are running and wile I am typing, CPU WUs started to download. Everything seems fine now, but not sure of the original cause. It is happening on this host Eos
Definitely good to get another report. Did you have a BOINC shutdown and restart immediately preceding the first appearance of the lockfile messages?

It appears that you're running "AVXxjf Linux64 Build 3345", while Juan is running a newer app, "AVX2jf Linux64 Build 3712". So, if it is an app-specific issue, it may not be confined to a single build.


Jeff, Thanks for asking this question as it made me think of exaclty what I did last night before calling it a day. As I mentioned, yesterday my system was not getting any CPU tasks, so I modified my hosts file and solved that problem. In doing that, I found it looked like GPU tasks were taking longer with CPU tasks running. I made some changes to app_config, increasing and deacressing CPU allocation to GPU tasks. So I did go through several starts and stops. I ended up returning the system back to ints original configuration before going to sleep. I did check the system to make sure all was ok. So perhaps my activity from yesterday evening did trigger the problem. I will continue to observe and report any additional reoccurence here.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1911372 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911375 - Posted: 7 Jan 2018, 1:21:38 UTC - in response to Message 1911371.  
Last modified: 7 Jan 2018, 1:34:54 UTC

BTW, if the problem only happens when the CPU cache is Very low, then you are going to have to run the cache down to test it. Running it otherwise means nothing

OK I will do that tomorrow in the morning with less alcohol and more time.
Completely dry the cache, do some reschedules and wait to see if the issue appears with the SSE 4.1 Builds

<edit> To clarify. I'm not said the problem only appears with the cache very low, I say I detect the problem when my cache was almost dry. Maybe because the host crunch all the other WU and leave only the problematic WU on the screen then was easy to see. Who knows?
ID: 1911375 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1911378 - Posted: 7 Jan 2018, 1:33:09 UTC - in response to Message 1911372.  

Jeff, Thanks for asking this question as it made me think of exaclty what I did last night before calling it a day. As I mentioned, yesterday my system was not getting any CPU tasks, so I modified my hosts file and solved that problem. In doing that, I found it looked like GPU tasks were taking longer with CPU tasks running. I made some changes to app_config, increasing and deacressing CPU allocation to GPU tasks. So I did go through several starts and stops. I ended up returning the system back to ints original configuration before going to sleep. I did check the system to make sure all was ok. So perhaps my activity from yesterday evening did trigger the problem. I will continue to observe and report any additional reoccurence here.
You should be able to pinpoint when the lockfile issue first showed up by taking a look at your BOINC Event Log (stored in "stdoutdae.txt" and "stdoutdae.old"). You should start seeing "task postponed 600.000000 sec: Waiting to acquire lock" at some point shortly after a restart.
ID: 1911378 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1911379 - Posted: 7 Jan 2018, 1:34:24 UTC - in response to Message 1911353.  

Okay, that does seem to help narrow it down, although the extra beer throws another variable into the equation. ;^)

@Keith, you could probably still try the experiment on your Linux machine, as Ruelke seems to have experienced the issue with his Ryzen, but with an AVX rather than AVX2 build.

Okay, I guess I don't exactly understand how to cause the problem or I don't understand the steps necessary to recreate Juan's problem.

I thought that the lockfile was being left behind in the cpu task slots after exiting the Manager and Client. I chose a cpu task in /slot 3 and then exited BOINC. I thought I was going to delete the lockfile and then restart BOINC and see if I caused the unable to acquire the lockfile message. But I didn't find any lockfile present in any of my 11 slots. It looks like the lockfile is being removed as it is supposed to on my machine if I read the code file that was posted earlier says it is supposed to do.

Or am I supposed to delete the lockfile while BOINC is running?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1911379 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1911389 - Posted: 7 Jan 2018, 1:57:10 UTC - in response to Message 1911360.  

Thanks Jeff not understand that msg make me feel more stupid than usual.

Will leave running and if anybody wish to test anything else on my host just ask.
It shouldn't make you feel that way at all. I expect most people get confused by that one's posts.

So far, all we've really tested is on the restart side of the equation, showing that at least a couple of AVX apps running in BOINC 7.8.3 seem to recognize the presence of a lockfile and go into a holding pattern. That may actually be correct behavior, based on the section of code that Richard pointed to. If other apps ignore an existing lockfile, it seems like that could be problem, unless there's some other criteria that enters into the decision as to whether or not an occupied slot can be reused. But testing other combinations of BOINC and app versions would certainly be needed to help narrow down the focus.

The other side of the issue, and one which I so far can't think how to test, is how those lockfiles get left behind in the first place after a BOINC shutdown. It certainly seems as if something's out of sync there, but at this point it's hard to tell what. Factors might include the nearly empty cache, as you've noted, and/or an exceedingly fast machine, and/or an OS that terminates apps too quickly, and/or etc., etc., etc........

How to replicate the exact conditions necessary to identify the shutdown side of the equation is going to be a challenge. BTW, even though it so far only appears to be the CPU tasks that run into the problem on startup, it's entirely possible, even likely, that if lockfiles are left behind in the CPU task slots, they're probably left there for the GPU ones, as well. It's just that the GPU apps handle the pre-existing lockfiles condition differently on startup. The only way I can think of to actually verify that assumption would be to capture all slot contents immediately after every shutdown, and then review them for the presence of lockfiles. Not a simple task!
ID: 1911389 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1911391 - Posted: 7 Jan 2018, 2:05:53 UTC - in response to Message 1911379.  

Okay, I guess I don't exactly understand how to cause the problem or I don't understand the steps necessary to recreate Juan's problem.

I thought that the lockfile was being left behind in the cpu task slots after exiting the Manager and Client. I chose a cpu task in /slot 3 and then exited BOINC. I thought I was going to delete the lockfile and then restart BOINC and see if I caused the unable to acquire the lockfile message. But I didn't find any lockfile present in any of my 11 slots. It looks like the lockfile is being removed as it is supposed to on my machine if I read the code file that was posted earlier says it is supposed to do.

Or am I supposed to delete the lockfile while BOINC is running?
Although the problem occurs because lockfiles are not getting deleted at shutdown, it seems like it will be very difficult to replicate the conditions that cause that to happen. So this little experiment doesn't address that side of the issue. It just looks at how an app handles a restart with a pre-existing lockfile. To do that, you just need to shut down BOINC, find one of the slots that a CPU task was running in, and add a lockfile to that folder. I suggested simply doing that by opening that slot's lockfile (in gedit) before BOINC shutdown, then saving it back to the same slot after the shutdown. (It's an empty file, so you can basically use any technique you wish to create, copy or save such a file.) Once you restart BOINC, keep an eye on that slot and on your Event Log and just see if any lockfile-related messages start showing up within the first minute or two, or if the task in your chosen slot resumes progressing normally.
ID: 1911391 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1911392 - Posted: 7 Jan 2018, 2:09:49 UTC - in response to Message 1911378.  

Jeff, Thanks for asking this question as it made me think of exaclty what I did last night before calling it a day. As I mentioned, yesterday my system was not getting any CPU tasks, so I modified my hosts file and solved that problem. In doing that, I found it looked like GPU tasks were taking longer with CPU tasks running. I made some changes to app_config, increasing and deacressing CPU allocation to GPU tasks. So I did go through several starts and stops. I ended up returning the system back to ints original configuration before going to sleep. I did check the system to make sure all was ok. So perhaps my activity from yesterday evening did trigger the problem. I will continue to observe and report any additional reoccurence here.
You should be able to pinpoint when the lockfile issue first showed up by taking a look at your BOINC Event Log (stored in "stdoutdae.txt" and "stdoutdae.old"). You should start seeing "task postponed 600.000000 sec: Waiting to acquire lock" at some point shortly after a restart.


I looked at the timestamp of /etc/hosts and found I made the modification that fixed the download issue at 17:04 6-Jan. Here is the log of postponed messages:

06-Jan-2018 17:44:00 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
06-Jan-2018 17:44:01 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
06-Jan-2018 17:44:02 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
06-Jan-2018 18:48:27 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
06-Jan-2018 18:49:02 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
06-Jan-2018 18:49:37 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
06-Jan-2018 19:02:03 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
06-Jan-2018 19:02:25 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
06-Jan-2018 19:02:28 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
06-Jan-2018 19:22:19 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
06-Jan-2018 19:22:33 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
06-Jan-2018 19:22:54 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
06-Jan-2018 19:55:37 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
06-Jan-2018 19:56:13 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
06-Jan-2018 19:56:48 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
06-Jan-2018 20:09:40 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
06-Jan-2018 20:09:44 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
06-Jan-2018 20:09:46 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
06-Jan-2018 20:28:39 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
06-Jan-2018 20:29:08 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
06-Jan-2018 20:29:14 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
***message repeats all night****
07-Jan-2018 06:08:21 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 06:08:53 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 06:09:28 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 06:19:03 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 06:19:31 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 06:20:06 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 06:29:39 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 06:30:11 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 06:30:42 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 06:40:29 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 06:41:05 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 06:41:18 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 06:51:10 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 06:51:45 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 06:52:20 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 07:01:53 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 07:02:25 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 07:03:01 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 07:12:46 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 07:13:09 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 07:13:45 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 07:23:57 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 07:23:58 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 07:24:22 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 07:35:11 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 07:35:12 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
07-Jan-2018 07:35:13 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.

GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1911392 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911394 - Posted: 7 Jan 2018, 2:12:10 UTC - in response to Message 1911389.  
Last modified: 7 Jan 2018, 2:15:22 UTC

The other side of the issue, and one which I so far can't think how to test, is how those lockfiles get left behind in the first place after a BOINC shutdown. It certainly seems as if something's out of sync there, but at this point it's hard to tell what. Factors might include the nearly empty cache, as you've noted, and/or an exceedingly fast machine, and/or an OS that terminates apps too quickly, and/or etc., etc., etc........

That was exactly what i think, Why the slots left locked when we shutdown Boinc?

I have something for you to think about, I have a 4 relatively fast GPU host and Ruelke has a 6 GPU host too, maybe that is the path to follow. Not see any other host running Linux with this configuration. Who has 4 GPU actually has some slower models in the mix or most of them has 3. Petri has but he runs AFIIK in a completely different app (CUDA91 or something like that).

Just maybe our hosts when we shutdown the Boinc need more time to do the house cleaning than the others or the opposite they shutdown so fast don't leavin time for the housecleaning.

Just a long shoot.
ID: 1911394 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1911395 - Posted: 7 Jan 2018, 2:17:34 UTC - in response to Message 1911392.  

I looked at the timestamp of /etc/hosts and found I made the modification that fixed the download issue at 17:04 6-Jan. Here is the log of postponed messages:

06-Jan-2018 17:44:00 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
.....
.....
07-Jan-2018 07:35:13 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
Take a look in the Event Log again and see when BOINC last restarted prior to that 17:44:00 timestamp. There should be a line that has "Starting BOINC client version" as part of the text. My guess would be that the timestamp on that line would be much closer to 17:44 than the 17:04 timestamp on your hosts file.
ID: 1911395 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1911402 - Posted: 7 Jan 2018, 2:32:17 UTC - in response to Message 1911394.  

I have something for you to think about, I have a 4 relatively fast GPU host and Ruelke has a 6 GPU host too, maybe that is the path to follow. Not see any other host running Linux with this configuration. Who has 4 GPU actually has some slower models in the mix or most of them has 3. Petri has but he runs AFIIK in a completely different app (CUDA91 or something like that).

Just maybe our hosts when we shutdown the Boinc need more time to do the house cleaning than the others or the opposite they shutdown so fast don't leavin time for the housecleaning.

Just a long shoot.
Yeah, I have a 4 GPU host running Linux (8289033), but it's got older, slower AMD Opteron processors and the GPUs are 750Tis and 960s. My other two Linux hosts only have 3 GPUs, a mix of 980s and 960s, with one having the same Opterons and the other, older Xeons, so I'm currently running older CPU apps, as well. No lockfile problems seen, so far.

It seems likely that the speed of some component could be a factor that enters into the equation, but pinpointing that factor could be quite difficult. Your host certainly reboots very quickly, and Richard seemed to think that you were using an SSD boot drive. If so, I suppose that rapid I/O could be a consideration, with the OS completing it's chores at BOINC shutdown somewhat faster than BOINC itself does, or the science apps, for that matter. Are your OS and BOINC on different types of drives?
ID: 1911402 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911404 - Posted: 7 Jan 2018, 2:39:01 UTC - in response to Message 1911402.  

No all are in one SSD drive. Kingston SUV400S37240G not a relatively new or fast SSD. My MB is a AsRock Formula OC running with no OC.
ID: 1911404 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1911406 - Posted: 7 Jan 2018, 3:04:19 UTC - in response to Message 1911395.  

I looked at the timestamp of /etc/hosts and found I made the modification that fixed the download issue at 17:04 6-Jan. Here is the log of postponed messages:

06-Jan-2018 17:44:00 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
.....
.....
07-Jan-2018 07:35:13 [SETI@home] task postponed 600.000000 sec: Waiting to acquire slot directory lock.  Another instance may be running.
Take a look in the Event Log again and see when BOINC last restarted prior to that 17:44:00 timestamp. There should be a line that has "Starting BOINC client version" as part of the text. My guess would be that the timestamp on that line would be much closer to 17:44 than the 17:04 timestamp on your hosts file.


Yes, my last start of boincmgr was at 17:43:17. The previous entry was suspension of computation at 17:42:09. I typically make sure all relevant processes stop before starting boincmgr again. I suspect this was when I changed app_config back to original settings.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1911406 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1911411 - Posted: 7 Jan 2018, 3:20:23 UTC - in response to Message 1911406.  

Take a look in the Event Log again and see when BOINC last restarted prior to that 17:44:00 timestamp. There should be a line that has "Starting BOINC client version" as part of the text. My guess would be that the timestamp on that line would be much closer to 17:44 than the 17:04 timestamp on your hosts file.

Yes, my last start of boincmgr was at 17:43:17. The previous entry was suspension of computation at 17:42:09. I typically make sure all relevant processes stop before starting boincmgr again. I suspect this was when I changed app_config back to original settings.
Thanks for checking. That seems to confirm that the trigger for your episode was the same as what Juan has been experiencing.
ID: 1911411 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911465 - Posted: 7 Jan 2018, 12:02:32 UTC
Last modified: 7 Jan 2018, 12:03:04 UTC

Good Morning all.
As suggested by TBar
Drying the cache now.
Will going to made a few reschedules,
Back in 3/4 hrs to see what happening.

@Keith - Was you able to test if the issue appears on your host?
ID: 1911465 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1911468 - Posted: 7 Jan 2018, 12:27:55 UTC

Good morning. Two other thoughts occurred to me overnight. They are separate and unrelated, but either or both might prompt ideas for further testing.

1) Permissions.
We know that BOINC creates the slot folders, and also does a cleanup when started (to remove any empty, unneeded, folders). Thus the {account, process} which is running BOINC will have ownership of those folders. We've also seen that the SETI application creates, tests, and ultimately deletes the lockfiles. Could a permissions problem get in the way of any of that?

2) Timings.
The problems you've been discussing all seem to happen on machines with seriously high processing power, and seem to be correlated with restarting BOINC. BOINC on those machines will be trying to start a large number of tasks all at once after a restart. Could we have reached a scalability problem, otherwise known as a race condition? Database software developers are well aware of the problems of making sure that multiple transactions don't trip over each other, and the BOINC server software makes full use of the database solutions available. But I'm not sure whether the BOINC client is as good: some developers have asserted that it isn't. Remember that BOINC was first developed at a time when a simple dual-CPU workstation was seen as adventurous and prohibitively expensive: we sometimes forget how much computing has moved on since then. Has the software moved on in step?
ID: 1911468 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911471 - Posted: 7 Jan 2018, 12:45:30 UTC - in response to Message 1911468.  

After i read Richard's last msg i made a small test.

Try to see what happening if i start the Boinc with more or less CPU WU and i was able to reproduce the error even on the SSE4.1 build.
In All cases i maintained the 4 GPU WU.

Number of CPU WU
2 - Works ok no postponed error
4 - Idem
6 - The error apears

So i believe Timings (as posted by Richard) are playing a role here.
ID: 1911471 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1911473 - Posted: 7 Jan 2018, 12:54:51 UTC

Interesting. That's the easy part - now what do we do to cure it? Nice sunny day here, so I think I'll go out for a walk and think about it.
ID: 1911473 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911485 - Posted: 7 Jan 2018, 14:18:23 UTC - in response to Message 1911473.  
Last modified: 7 Jan 2018, 14:22:31 UTC

It's sunny here too & hot as always, no rain. I was out too buying some beer & shrimp for todayÅ› sunday lunch.

Will try with the stock app and see if i could replicate the error and post the test latter.
Did anyone could post the link to DL the CPU Linux 64 stock file?

The only way i imagine to bypass this problem, from my user side, besides babysitting and delete the lock file each time a postponed WU appears, is by a script who delete all the CPU lock files before the Boinc itself starts.
What the impact of this on a current crunching WU's? I imagine that will not be good.
Or Made something who start the Boinc with less crunched WU and in some time add more WU to crunch but that will requise babysitting.
In both cases manage the rescheduler on/off will add an additional pain.

But that fix the start up problem and bypass the problem, not fix it.
The real solution is to find why the lock file is leaved closed when the Boinc stops normally.
And as that is a very rare issue who apparently not affect the vast majority of Windows hosts and appears on very few Linux hosts not believe that will be fixed soon.

To really fix that only with the devs of the app itself.
ID: 1911485 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1911498 - Posted: 7 Jan 2018, 15:30:49 UTC - in response to Message 1911485.  
Last modified: 7 Jan 2018, 15:31:46 UTC

With the rescheduler, are you using the script I made? i.e.
- Stop BOINC
- Pasue 10s
- Rechedule
-Pause 2
- Start BOINC

And on that topic, Why reschedule now? There is very little to move or no advantage with the 99.5% BLC diet we have now.
ID: 1911498 · Report as offensive
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 14 · Next

Message boards : Number crunching : Postponed: Waiting to acquire lock


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.