Recent error: Cannot acquire lockfile.

Message boards : Number crunching : Recent error: Cannot acquire lockfile.
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile Bill G Special Project $75 donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 1282
Credit: 187,688,550
RAC: 182
United States
Message 1898783 - Posted: 3 Nov 2017, 0:08:31 UTC - in response to Message 1898372.  

That's a very valid and astute observation. We had Ruelke drop his checkpoints to 120 seconds on his TR system because he was seeing so much constant HDD activity.

I dropped mine to 120 and it only dropped the activity to 90% so I settled for 300 which on the graph show between 0 and 1%. Of course now it is a bit faster access with the N.2 SSD.

Sorry for the delay in responding but I have had to see a few Dr.s.

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1898783 · Report as offensive
Profile Bill G Special Project $75 donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 1282
Credit: 187,688,550
RAC: 182
United States
Message 1898786 - Posted: 3 Nov 2017, 0:12:55 UTC - in response to Message 1898342.  

If you take a look in any of the slots directories within the BOINC data directory, you should see a "boinc_lockfile" for each running task. BOINC checks to make sure that such a file doesn't already exist when a new task starts (or an old task restarts) so that it doesn't try to run two tasks in the same slot. I think the only time I've run into a lockfile problem is when I have a system crash and a slot doesn't get cleaned up properly following the reboot. Usually, completely shutting down BOINC, including the client, and then restarting it again, clears it up.

Your errors look odd in that the tasks ran for awhile, then got the lockfile error, then ran successfully for a while again, then got the error, etc., etc. It almost appears that BOINC was, in fact, somehow running two tasks in the same slot and somehow alternating between them. I don't know how that would happen, BUT, looking at one of your successful tasks around the same time, 6129189997, it appears that you did, indeed have a BOINC crash. The Stderr shows:
16:47:12 (12308): BOINC client no longer exists - exiting
16:47:12 (12308): timer handler: client dead, exiting

That task then restarted at 1.14 percent and concluded successfully.

My best guess would be that when the client restarted, BOINC was somehow confused about what slots to assign to each restarted task and just got wrapped around the axle for awhile. Since it doesn't seem to be recurring, I'd say that a subsequent BOINC restart has probably gotten it straightened out and you're unlikely to have problems going forward. Unless, that is, the BOINC client crashes again. If you can identify what caused that in the first place, it might be helpful in the future.

Thanks Jeff for that very clear explaination of how that works. I am so glad that I can keep learning about how this all works.

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1898786 · Report as offensive
Profile Bill G Special Project $75 donor
Avatar

Send message
Joined: 1 Jun 01
Posts: 1282
Credit: 187,688,550
RAC: 182
United States
Message 1898788 - Posted: 3 Nov 2017, 0:17:09 UTC - in response to Message 1898781.  

On my 16c/32t system I see very little disk activity running 32 CPU tasks.
I also run an additional 100 instances of BOINC for goofygrid@home with 4 apps and several of those instances also have WuProp.

So all together I have running:
BOINC instances: 101
CPU tasks: 32
NCI tasks: 431
The disk is an old 2.5" notebook HDD I tossed in to get the system running and I have Request tasks to checkpoint at most every: 60 seconds set.
If running a lot of tasks at once caused enough disk activity to be a problem I would think I would run into that issue often on the system.

I have observed the disk write activity of AP tasks is about 3-4 times that of a MB tasks.

HAL, that is not what I observed on my system. I do not know what the difference would actually be, but as they say YRMV.

No errors recently, fingers crossed.

SETI@home classic workunits 4,019
SETI@home classic CPU time 34,348 hours
ID: 1898788 · Report as offensive
Previous · 1 · 2

Message boards : Number crunching : Recent error: Cannot acquire lockfile.


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.