Recent error: Cannot acquire lockfile.

Message boards : Number crunching : Recent error: Cannot acquire lockfile.
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Bill GProject Donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 653
Credit: 111,386,374
RAC: 100,708
United States
Message 1898263 - Posted: 30 Oct 2017, 14:15:57 UTC

This error started yesterday and seems to be continuing. It is on my TR and only on the CPU. I am wondering if anyone can help me solve this issue.
Computer is: https://setiathome.berkeley.edu/results.php?hostid=8366659

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1898263 · Report as offensive     Reply Quote
Profile Keith Myers
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 2431
Credit: 184,187,188
RAC: 358,626
United States
Message 1898282 - Posted: 30 Oct 2017, 16:26:06 UTC

Do you have a AV program scanning the Boinc ProgramData directory? Have you stopped BOINC entirely.... and restarted? I assume that was the first thing tried. Did you recently change NUMA memory configurations in the BIOS?
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1898282 · Report as offensive     Reply Quote
Profile Bill GProject Donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 653
Credit: 111,386,374
RAC: 100,708
United States
Message 1898287 - Posted: 30 Oct 2017, 17:04:50 UTC - in response to Message 1898282.  

I have controlled folder access "OFF" so that is the only thing that I could think of in the Windows Defender, which is my AV. I have done the restart things. The only thing that I have done very recently is activate Boinc Tasks so that it shows on my laptop. I guess I could shut that off as I am so used to using TeamViewer that for me it does not add that much. In the past I just ran Boinc Tasks on each computer and looked at the features when looking at the individual computer.
I am running off the M.2 SSD and have not seen anything related to that since the debacle of the mirror switch to it on the 22nd.

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1898287 · Report as offensive     Reply Quote
Profile Keith Myers
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 2431
Credit: 184,187,188
RAC: 358,626
United States
Message 1898288 - Posted: 30 Oct 2017, 17:17:28 UTC - in response to Message 1898287.  

I don't think BoincTasks has anything to do with the issue. I run BT on my daily driver Win 7 machine to look at my cruncher farm remotely. One of which is a Win 10 cruncher. I too just use Windows Defender on that machine. By setting folder off .... do you mean you EXCLUDED the BOINC main folders in the configuration? I have excluded \Program Files\BOINC and \ProgramData\BOINC in Defenders Exclusion settings. Never had a problem with Defender other than the annoying report that some items were skipped during a scan.

You could try and eliminate BT for testing and see if anything changes.

The only time I have ever seen that lockfile error is when tasks are erroring out and BOINC immediately starts a new one and which promptly errors out and a cascade starts. If that happened, the BOINC shutdown and restart should have fixed it with the tasks which had previously locked the tasks showing as Postponed in the task list. They would have slowly cleared as work finishes and one of the postponed tasks can restart to finish.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1898288 · Report as offensive     Reply Quote
Profile Bill GProject Donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 653
Credit: 111,386,374
RAC: 100,708
United States
Message 1898298 - Posted: 30 Oct 2017, 17:55:37 UTC - in response to Message 1898288.  

Controlled Folder Access is an option in Defender to not allow 3rd party programs to make changes to your folders. As you say, BoincTasks should not cause any problems. I have not had another error since early this morning so I will just wait and see for the moment.

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1898298 · Report as offensive     Reply Quote
Profile Brent Norman
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 1823
Credit: 106,380,728
RAC: 452,063
Canada
Message 1898310 - Posted: 30 Oct 2017, 18:57:04 UTC - in response to Message 1898298.  
Last modified: 30 Oct 2017, 18:58:02 UTC

Controlled Folder Access is an option in Defender to not allow 3rd party programs to make changes to your folders
You don't want that activated.

You want to EXCLUDE the folder from being scanned/monitored by AV.
ID: 1898310 · Report as offensive     Reply Quote
Profile Keith Myers
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 2431
Credit: 184,187,188
RAC: 358,626
United States
Message 1898320 - Posted: 30 Oct 2017, 19:26:02 UTC - in response to Message 1898310.  

I agree with Brent. You are using the program incorrectly. The folder access you refer to is just to prevent third party programs from altering or deleting the contents. Does not prevent Defender from scanning the files within and thus locking access on them. Scanning does not alter or delete the files. Go to Exclusions and add the folders I posted earlier.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1898320 · Report as offensive     Reply Quote
Profile Bill GProject Donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 653
Credit: 111,386,374
RAC: 100,708
United States
Message 1898333 - Posted: 30 Oct 2017, 21:06:27 UTC - in response to Message 1898320.  

I agree with Brent. You are using the program incorrectly. The folder access you refer to is just to prevent third party programs from altering or deleting the contents. Does not prevent Defender from scanning the files within and thus locking access on them. Scanning does not alter or delete the files. Go to Exclusions and add the folders I posted earlier.

I do not agree that I did not understand what the program was used for, and was not using the program at all(I included my setting for this to show that I had eliminated this as well-you are correct that I do not know when exactly the program functions).
As for Defender scanning the BOINC folder, that last scan took place on the 26th so had nothing to do with these errors. I did include the BOINC folder in the exclusion list.
WAIT I totally forgot that I had set the computer to back up to my server.......... but I just checked that and that has been going on since the 27th so that does no seem to apply as the errors have only occurred on two days

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1898333 · Report as offensive     Reply Quote
Profile Keith Myers
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 2431
Credit: 184,187,188
RAC: 358,626
United States
Message 1898339 - Posted: 30 Oct 2017, 21:30:41 UTC - in response to Message 1898333.  
Last modified: 30 Oct 2017, 21:31:52 UTC

OK, I didn't phrase that correctly. The program option was being used incorrectly or not understood fully. As long as Defender is Active, and it is by default, it is constantly monitoring all HDD's, network activities, app launches at all times looking for virii, Trojans and suspicious activity. Unless you opt-out of the service. Defender is a suite of functions, not just a AV scanner.

Not sure whether your reply meant that you did in fact have the BOINC folders excluded or just recently added them to the exclusion list.

Not sure what other suggestions to try. I am still very fuzzy on the whole NUMA-Not NUMA memory control of TR and Epyc and how having two different memory controllers active at the same time might impact locking of a file. I would think that issue wouldn't occur since that must have been engineered for servers a very long time ago since I believe NUMA has been around a long time before TR and Epyc came on the scene.

It could have just been some hiccup in BOINC that took a specific set of circumstances to trigger. If you cannot trigger the issue on command repeatedly, finding a intermittent problem is a bear.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1898339 · Report as offensive     Reply Quote
Profile Bill GProject Donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 653
Credit: 111,386,374
RAC: 100,708
United States
Message 1898341 - Posted: 30 Oct 2017, 21:37:25 UTC - in response to Message 1898339.  

Sorry to have misinterpreted what you were saying. I have added BOINC to the exclusions and we can only hope the problem will not manifest itself again. I know how difficult a transient problem can be. It was just strange that so many errors occurred in that space of time and then seem to have stopped. I hope they have stopped....only time will tell.
Thanks for the help.

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1898341 · Report as offensive     Reply Quote
Profile Jeff Buck
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1274
Credit: 133,886,779
RAC: 243,917
United States
Message 1898342 - Posted: 30 Oct 2017, 21:39:55 UTC

If you take a look in any of the slots directories within the BOINC data directory, you should see a "boinc_lockfile" for each running task. BOINC checks to make sure that such a file doesn't already exist when a new task starts (or an old task restarts) so that it doesn't try to run two tasks in the same slot. I think the only time I've run into a lockfile problem is when I have a system crash and a slot doesn't get cleaned up properly following the reboot. Usually, completely shutting down BOINC, including the client, and then restarting it again, clears it up.

Your errors look odd in that the tasks ran for awhile, then got the lockfile error, then ran successfully for a while again, then got the error, etc., etc. It almost appears that BOINC was, in fact, somehow running two tasks in the same slot and somehow alternating between them. I don't know how that would happen, BUT, looking at one of your successful tasks around the same time, 6129189997, it appears that you did, indeed have a BOINC crash. The Stderr shows:
16:47:12 (12308): BOINC client no longer exists - exiting
16:47:12 (12308): timer handler: client dead, exiting

That task then restarted at 1.14 percent and concluded successfully.

My best guess would be that when the client restarted, BOINC was somehow confused about what slots to assign to each restarted task and just got wrapped around the axle for awhile. Since it doesn't seem to be recurring, I'd say that a subsequent BOINC restart has probably gotten it straightened out and you're unlikely to have problems going forward. Unless, that is, the BOINC client crashes again. If you can identify what caused that in the first place, it might be helpful in the future.
ID: 1898342 · Report as offensive     Reply Quote
Profile Jeff Buck
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1274
Credit: 133,886,779
RAC: 243,917
United States
Message 1898344 - Posted: 30 Oct 2017, 22:02:27 UTC

Another possibility is simply that your system resources were overcommitted. I was just looking through an old Process Monitor log for one of my machines and found that BOINC apparently polls the lockfiles every 5 minutes, just to make sure they're still there. Perhaps if it takes too long to get a response from the system, it kicks out the messages shown in your errors. How many of those 32 HT cores are running tasks at the same time?
ID: 1898344 · Report as offensive     Reply Quote
Profile Bill GProject Donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 653
Credit: 111,386,374
RAC: 100,708
United States
Message 1898348 - Posted: 30 Oct 2017, 22:18:55 UTC - in response to Message 1898344.  

Another possibility is simply that your system resources were overcommitted. I was just looking through an old Process Monitor log for one of my machines and found that BOINC apparently polls the lockfiles every 5 minutes, just to make sure they're still there. Perhaps if it takes too long to get a response from the system, it kicks out the messages shown in your errors. How many of those 32 HT cores are running tasks at the same time?

2 for the 2 GPUs and then the remaining 30 each running its own WU,
Thanks for that analysis and your incite, To be honest I can not remember if anything happened on the computer prior to this. I am trying to just get the system stable right now. I have tried the autotuning of the setup, but I do not thing I started BOINC as the system started to lock up prior to my doing anything.

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1898348 · Report as offensive     Reply Quote
Profile Jeff Buck
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1274
Credit: 133,886,779
RAC: 243,917
United States
Message 1898353 - Posted: 30 Oct 2017, 22:42:35 UTC - in response to Message 1898348.  

Yeah, 30 concurrent CPU tasks might be a bit much. You might be able to glean some information from the BOINC event log (stored in the BOINC data directory as stdoutdae.txt) for that time period around 16:47:12 yesterday, when the client seems to have gone AWOL.

I've also noticed that, although the tasks that ended with errors displayed one or more lockfile related messages, it was ultimately the "finish file present too long" which did them in. These sorts of errors also often happen due to an overburdened system. The "finish file" is a rather odd method used for the science app to notify BOINC that processing is completed for a task. Unfortunately, BOINC only allows about 10 seconds to read the file. If anything causes a delay of more than 10 seconds, BOINC simply trashes the task. Of course, usually just a fraction of a second is all that's needed, but if another process is hogging system resources right about that time, the "finish file present too long" error is the result.

With 32 HT cores, Windows 10, the latest NVIDIA driver, and the latest and greatest (?) version of BOINC, it looks like you're right out there on the bleeding edge, so it could well be this won't be the last time you run into some of these sorts of errors. Good luck to ya! ;^)
ID: 1898353 · Report as offensive     Reply Quote
Profile Wiggo "Socialist"
Avatar

Send message
Joined: 24 Jan 00
Posts: 12601
Credit: 169,284,445
RAC: 86,532
Australia
Message 1898356 - Posted: 30 Oct 2017, 22:53:21 UTC

I would recommend cutting the number of CPU tasks down to 28, but you may get away with 29. ;-)

Cheers.
ID: 1898356 · Report as offensive     Reply Quote
Profile Keith Myers
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 2431
Credit: 184,187,188
RAC: 358,626
United States
Message 1898364 - Posted: 30 Oct 2017, 23:47:01 UTC - in response to Message 1898356.  

I don't like pushing a system to its maximum capabilities. I like a little bit of buffer to maintain normal desktop housekeeping without background processes getting an overly long wait to be serviced. I too think 30 MB CPU tasks running with the last two CPU cores basically fully engaged supporting the OpenCL SoG app might be stressing the system a bit too far into the danger zone.

You'll have to continue tuning and observing. You are living on the bleeding edge with very new hardware in the market and not much history to compare against.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1898364 · Report as offensive     Reply Quote
EdwardPF
Volunteer tester

Send message
Joined: 26 Jul 99
Posts: 341
Credit: 129,110,832
RAC: 82,778
United States
Message 1898367 - Posted: 31 Oct 2017, 0:42:33 UTC

Could checkpointing every 60 sec for 32 tasks be a problem??Once the disk gets behind ... the backlog can cascade ... (like what happened with Malaria@home with a small number of tasks)

Ed F
ID: 1898367 · Report as offensive     Reply Quote
Profile Jeff Buck
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1274
Credit: 133,886,779
RAC: 243,917
United States
Message 1898371 - Posted: 31 Oct 2017, 1:25:20 UTC - in response to Message 1898367.  
Last modified: 31 Oct 2017, 1:44:50 UTC

Could checkpointing every 60 sec for 32 tasks be a problem??Once the disk gets behind ... the backlog can cascade ... (like what happened with Malaria@home with a small number of tasks)

Ed F
I imagine that could be a consideration. The writing of the state.sah and boinc_task_state.xml files should be almost instantaneous. Again, looking at one of my old Process Monitor logs, it only took 0.0016416 and 0.0016416 0.0010486 seconds, respectively, for one particular checkpoint for a single task. On the other hand, the additional overhead added by the AV software to give its blessing to the updated files could be significant. In that example, Microsoft Essentials added 0.9972535 and 0.9985869 seconds, respectively. Those are run time numbers, and don't necessarily reflect actual CPU time, but some overhead is there, certainly.

EDIT: Corrected the time for the update of the boinc_task_state file.
ID: 1898371 · Report as offensive     Reply Quote
Profile Keith Myers
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 2431
Credit: 184,187,188
RAC: 358,626
United States
Message 1898372 - Posted: 31 Oct 2017, 1:27:23 UTC - in response to Message 1898367.  

That's a very valid and astute observation. We had Ruelke drop his checkpoints to 120 seconds on his TR system because he was seeing so much constant HDD activity.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1898372 · Report as offensive     Reply Quote
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6466
Credit: 175,830,032
RAC: 55,293
United States
Message 1898781 - Posted: 2 Nov 2017, 23:13:43 UTC

On my 16c/32t system I see very little disk activity running 32 CPU tasks.
I also run an additional 100 instances of BOINC for goofygrid@home with 4 apps and several of those instances also have WuProp.

So all together I have running:
BOINC instances: 101
CPU tasks: 32
NCI tasks: 431
The disk is an old 2.5" notebook HDD I tossed in to get the system running and I have Request tasks to checkpoint at most every: 60 seconds set.
If running a lot of tasks at once caused enough disk activity to be a problem I would think I would run into that issue often on the system.

I have observed the disk write activity of AP tasks is about 3-4 times that of a MB tasks.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1898781 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Recent error: Cannot acquire lockfile.


 
©2017 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.