Postponed: Waiting to acquire lock

Author	Message
juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1910805 - Posted: 5 Jan 2018, 14:20:42 UTC Last modified: 5 Jan 2018, 14:33:18 UTC My cache is almost empty due the server problem but the last 4 WU refuses to crunch they start and stop after some seconds. Nothing else is running on the CPU/GPU of the host, so all the cores are available to crunch. This is what shows the stderr.txt of one of them. WU: 22ap08ac.1843.7248.16.43.27.vlar.1 Not using mb_cmdline.txt-file, using commandline options. Build features: SETI8 Non-graphics FFTW FFTOUT JSPF AVX2 64bit System: Linux x86_64 Kernel: 4.10.0-42-generic CPU : Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz 12 core(s), Speed : 3579.565 MHz L1 : 64 KB, Cache : 15360 KB Features : FPU TSC PAE APIC MTRR MMX SSE SSE2 HT PNI SSSE3 SSE4_1 SSE4_2 AVX AVX2 ar=0.011995 NumCfft=146001 NumGauss=0 NumPulse=50177167232 NumTriplet=67971049376 In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768 Linux optimized setiathome_v8 application Version info: AVX2jf (Intel, Core 2-optimized v8-nographics) V5.13 by Alex Kan AVX2jf Linux64 Build 3712 , Ported by : Raistmer, JDWhale, Urs Echternacht Work Unit Info: ............... Credit multiplier is : 2.85 WU true angle range is : 0.011995 05:49:22 (1270): Can't acquire lockfile (-154) - waiting 35s 05:49:57 (1270): Can't acquire lockfile (-154) - exiting 05:51:55 (2850): Can't acquire lockfile (-154) - waiting 35s 05:52:30 (2850): Can't acquire lockfile (-154) - exiting . . . ID: 1910805 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1910824 - Posted: 5 Jan 2018, 16:00:17 UTC - in response to Message 1910805. I've seen something similar with the CPU App 3711 running on my Mac with a couple BLC14 tasks. For some reason those two tasks refused to run using the CPU App. I just changed the client_state.xml so they would run using the CUDA App and they ran without a problem. The only thing I can think of is there is something about the latest code AKv8 3710 causing this. Fortunately it's only happened twice so far. ID: 1910824 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1910826 - Posted: 5 Jan 2018, 16:15:26 UTC Last modified: 5 Jan 2018, 16:17:38 UTC OK Maybe is something in the code will keep some eye on that. BTW my WU were common Arecibo Vlars not blc and each WU stops to run at different % progress. Can't reschedule to GPU to try anymore since after my cache runs empty i shutdown the host and when i turn it on again the 4 task goes directly to compute error. -185 (0xFFFFFF47) ERR_RESULT_START Stderr output <core_client_version>7.8.3</core_client_version> <![CDATA[ <message> couldn't start app: Can't get shared memory segment name: can't get shared mem segment name</message> ]]> Weird. Not believe is something wrong with the host since it crunches 1000's of WU without any error. And now i leave It's crunching E@H WU with no error too. System monitor said i not use even 20% of the available memory and 0% swap memory is in use. I just run Boinc and do some browsing on the host. ID: 1910826 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1910828 - Posted: 5 Jan 2018, 16:21:24 UTC You shouldn't be using a shared memory segment - we switched to using memory mapped files 6, 7, 8 years ago (I forget exactly when). That error message *could* (very, very, provisionally) be caused by a badly-formed app_info.xml file. Talk to your developer/supplier, and if they don't understand what you're talking about, tell them to talk to me - but give me time to dig out the archaeology first. ID: 1910828 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1910846 - Posted: 5 Jan 2018, 18:06:44 UTC - in response to Message 1910828. @Richard Please forgive my ignorance but i really not understood anything of your msg. My host crunch 1000's of WU each day and at least >100 of them are from CPU so why this 4 only shows this problem? I'm a alone old man from the DOS era and not know anybody here who even has an idea of how Boinc and all the rest of the stuff work. So no way i could ask my developer/supplier since i don't have any. Anyway thanks for your time/help. This is my app_info.xml , not see anything weird on the CPU part of it. Maybe you better training eyes could find something. <app_info> <app> <name>setiathome_v8</name> </app> <file_info> <name>setiathome_x41p_zi3v_x86_64-pc-linux-gnu_cuda90</name> <executable/> </file_info> <file_info> <name>libcudart.so.9.0</name> </file_info> <file_info> <name>libcufft.so.9.0</name> </file_info> <app_version> <app_name>setiathome_v8</app_name> <platform>x86_64-pc-linux-gnu</platform> <version_num>801</version_num> <plan_class>cuda90</plan_class> <coproc> <type>NVIDIA</type> <count>1</count> </coproc> <avg_ncpus>1</avg_ncpus> <max_ncpus>1</max_ncpus> <file_ref> <file_name>setiathome_x41p_zi3v_x86_64-pc-linux-gnu_cuda90</file_name> <main_program/> </file_ref> <file_ref> <file_name>libcudart.so.9.0</file_name> </file_ref> <file_ref> <file_name>libcufft.so.9.0</file_name> </file_ref> </app_version> <app> <name>astropulse_v7</name> </app> <file_info> <name>astropulse_7.08_x86_64-pc-linux-gnu__opencl_nvidia_100</name> <executable/> </file_info> <file_info> <name>AstroPulse_Kernels_r2751.cl</name> </file_info> <file_info> <name>ap_cmdline_7.08_x86_64-pc-linux-gnu__opencl_nvidia_100.txt</name> </file_info> <app_version> <app_name>astropulse_v7</app_name> <platform>x86_64-pc-linux-gnu</platform> <version_num>708</version_num> <plan_class>opencl_nvidia_100</plan_class> <coproc> <type>NVIDIA</type> <count>1</count> </coproc> <avg_ncpus>1</avg_ncpus> <max_ncpus>1</max_ncpus> <file_ref> <file_name>astropulse_7.08_x86_64-pc-linux-gnu__opencl_nvidia_100</file_name> <main_program/> </file_ref> <file_ref> <file_name>AstroPulse_Kernels_r2751.cl</file_name> </file_ref> <file_ref> <file_name>ap_cmdline_7.08_x86_64-pc-linux-gnu__opencl_nvidia_100.txt</file_name> <open_name>ap_cmdline.txt</open_name> </file_ref> </app_version> <app> <name>setiathome_v8</name> </app> <file_info> <name>MBv8_8.22r3712_avx2_x86_64-pc-linux-gnu</name> <executable/> </file_info> <app_version> <app_name>setiathome_v8</app_name> <platform>x86_64-pc-linux-gnu</platform> <version_num>800</version_num> <file_ref> <file_name>MBv8_8.22r3712_avx2_x86_64-pc-linux-gnu</file_name> <main_program/> </file_ref> </app_version> <app> <name>astropulse_v7</name> </app> <file_info> <name>ap_7.05r2728_sse3_linux64</name> <executable/> </file_info> <app_version> <app_name>astropulse_v7</app_name> <version_num>704</version_num> <platform>x86_64-pc-linux-gnu</platform> <plan_class></plan_class> <file_ref> <file_name>ap_7.05r2728_sse3_linux64</file_name> <main_program/> </file_ref> </app_version> </app_info> ID: 1910846 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1910855 - Posted: 5 Jan 2018, 18:24:49 UTC BOINC is having troubles accessing the CUDA library. Reboot the computer. Problem solved. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1910855 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1910857 - Posted: 5 Jan 2018, 18:25:55 UTC - in response to Message 1910855. Last modified: 5 Jan 2018, 18:33:26 UTC BOINC is having troubles accessing the CUDA library. Reboot the computer. Problem solved. CUDA? They was CPU WUs. Yes reboot solves the problem... by crashing all 4. LOL ID: 1910857 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1910877 - Posted: 5 Jan 2018, 19:52:35 UTC - in response to Message 1910846. @Richard I'm a alone old man from the DOS era ... I'm in much the same state myself, except the first language I received formal training in was Algol 60 - mainframes, long before the DOS era! But I try to keep the little grey cells active in my retirement by reading what I can. OK, to business. The various optimised applications that our developer friends write are designed to work in the BOINC environment - in fact, that's the only point of them: they are useless anywhere else. So it's important that the applications communicate with BOINC - telling BOINC how far they've got, listening for instructions about when to pause or shut down, that sort of thing. The rather telegraphic technical phrases I quoted in my last message - "shared memory segment" and "memory mapped files" - refer to two alternative mechanisms for handling that chatter between the application and BOINC. We actually switched from the first to the second 10 years ago (I now find) - how time flies. But 10 years ago BOINC was still being developed 'properly' - with care and attention to detail, ensuring both forward and backward compatibility. Both communication methods could co-exist. But the applications need to know which technique to use in any given situation. That's done by passing a piece of information known as the "API version" in - in this case - the app_info.xml file. You don't have that, so the applications will assume that they're working in an environment that's over 10 years old. I know nothing about Linux, but maybe they've abandoned shared memory in the intervening 10 years, too. That *could* account for your error message (no promises). But this is worth trying. Stick an extra line <api_version>6.1.0</api_version> into the bit of app_info which describes the application that's having problems - look at the documentation for Anonymous Platform to see how it fits into the format. In due course, the sections for the other applications could probably use the same thing, but only change one at a time. Fortunately, the API version rarely affects this process, and the next change didn't come until v7.5.0 (to support Bitcoin mining), so the numbers don't matter - it just has to be "at least 6.0", according to line 166 of app_start.cpp Then stop BOINC and restart it. The new line will only make a difference for new tasks as they start running for the first time: anything which is already trying to run and showing 'postponed' has probably been lost already, so don't worry about them: watch the next one as it starts. If the patch makes a difference, and the tasks run properly, please ask the developer to read what I've written above and modify their supplied app_info.xml files accordingly. ID: 1910877 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1910886 - Posted: 5 Jan 2018, 20:26:46 UTC Added the line, let's see if that solves the problem. BTW My first language learned was Fortran that was 35 or was 40 Years ago, or maybe more... i'm to old to learn. Thanks for your help & detailed explanation. ID: 1910886 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1910888 - Posted: 5 Jan 2018, 20:53:42 UTC - in response to Message 1910857. BOINC is having troubles accessing the CUDA library. Reboot the computer. Problem solved. CUDA? They was CPU WUs. Yes reboot solves the problem... by crashing all 4. LOL Sorry, the only time I ever experienced that error was when BOINC for some reason couldn't find the CUDA libraries. Rebooted the machine and have never seen it again. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1910888 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1910890 - Posted: 5 Jan 2018, 20:57:19 UTC Looks like Richard supplied you with great information about the history of the apps and a solution for your CPU tasks in app_info. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1910890 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1910891 - Posted: 5 Jan 2018, 20:59:34 UTC Last modified: 5 Jan 2018, 21:07:03 UTC I'm wondering if there might be a simpler explanation for the issue, namely a BOINC client or system crash and restart that left a lockfile in a slot that prevented a new task from using that slot. I've seen that happen on my own machines on rare occasions and, if I recall correctly, a simple clean shutdown and restart of BOINC would clear it up. Going all the way back to 2014, that message was also a symptom of a Zombie task issue, where the BOINC client died but left AP tasks running. Here's a snip from a message I posted then: Running on device number: 0 DATA_CHUNK_UNROLL at default:2 DATA_CHUNK_UNROLL at default:2 16:32:27 (1464): Can't acquire lockfile (32) - waiting 35s 16:33:02 (1464): Can't acquire lockfile (32) - exiting 16:33:02 (1464): Error: The process cannot access the file because it is being used by another process. (0x20) The main difference here is the final line, which Juan's task doesn't show. Another possibility is simply an overcommitment of resources. Bill G had a thread a couple months ago, Recent error: Cannot acquire lockfile, where he was getting the same message on a recurring basis, but tasks were intermittently suspending and then resuming. A BOINC client crash also seemed to immediately precede the onset of his problem. When I raised the possible overcommitment issue... I was just looking through an old Process Monitor log for one of my machines and found that BOINC apparently polls the lockfiles every 5 minutes, just to make sure they're still there. Perhaps if it takes too long to get a response from the system, it kicks out the messages shown in your errors. How many of those 32 HT cores are running tasks at the same time? ...he responded: 2 for the 2 GPUs and then the remaining 30 each running its own WU ...so that certainly looked like a contributing factor in his case. So, Juan, how many concurrent tasks are you running on that machine and do you know if you had a BOINC or system crash shortly before the problem first showed up? EDIT: Okay, reviewing your original post, I see that those were the only tasks you had running at the time, so overcommitment doesn't look like the source of your issue. ID: 1910891 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1910898 - Posted: 5 Jan 2018, 21:12:01 UTC - in response to Message 1910890. Looks like Richard supplied you with great information about the history of the apps and a solution for your CPU tasks in app_info. If it works! But least it's a plausible - and causal - suggestion. ID: 1910898 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1910899 - Posted: 5 Jan 2018, 21:30:47 UTC Last modified: 5 Jan 2018, 21:32:25 UTC Could be a very long shoot but I remember something, i just see this error 2 times, and in both cases was when the host reaches the end of the cache of WU. My host runs 4 GPU + 4 CPU WU (i use <project_max_concurrent>8</project_max_concurrent>) so you not expect a overcommitment but there are one point to add. Since E@H is my backup project and has Richard explained the line is readed only at the beginning of the crunch, maybe something hidden on the code crashes when the S@H GPU WU ends and the hosts starts up to 8 CPU WU while it starts to DL new E@H data. Or when it starts the E@H GPU WU while it not fully stopped the S@H additional 4 CPU WU. I talk about internal timing or something like that. Just a idea. Anyway I add the line suggested and will see if that happening again in the next outage. BTW I use app_config in E@H too, so it keep a maximum of 8 active WU at any time (up to 4 GPU in E@H + 4 S@H CPU WU too) ID: 1910899 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1910903 - Posted: 5 Jan 2018, 21:35:32 UTC Remember that every Einstein GPU task finishes up its calculation by switching over to a CPU core at 89% completion. If the E@H task was finishing up on a cpu core just as Seti@home cpu tasks occupied 4 cpu cores, that might have been the causation. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1910903 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1910909 - Posted: 5 Jan 2018, 21:41:37 UTC - in response to Message 1910903. Last modified: 5 Jan 2018, 21:47:03 UTC My CPU has 12 thread (6 cores) . So In theory i have 4 extra CPU thread available even when runs 4E@H + 4 S@H. So i not expect any overcommitment even when that happening. Something i never understood and maybe this is the time to ask: When they say each S@H CPU wu uses 1 core, it uses one physical core or one thread? By analogy when they say each E@H GPU WU uses one core it uses one physical core or one thread? <edit> I know i ask a lot of questions, so sorry. But it's always nice to learn new things each day. ID: 1910909 ·

Zalster Volunteer tester Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242	Message 1910914 - Posted: 5 Jan 2018, 21:50:30 UTC - in response to Message 1910909. I think what we really mean is 1 thread... Laziness on our part when we say core. The question is whether a physical thread vs virtual core crunches faster. On some projects, physical thread is faster. Not so much here. ID: 1910914 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1910918 - Posted: 5 Jan 2018, 21:53:02 UTC - in response to Message 1910909. Last modified: 5 Jan 2018, 22:04:32 UTC My CPU has 12 thread (6 cores) . So In theory i have 4 extra CPU thread available even when runs 4E@H + 4 S@H. So i not expect any overcommitment even when that happening. Something i never understood and maybe this is the time to ask: When they say each S@H CPU wu uses 1 core, it uses one physical core or one thread? By analogy when they say each E@H GPU WU uses one core it uses one physical core or one thread? Thread (Threads in this instance not to be confused with Threads in the Task Manager Process\Threads\Handles case). Hyperthreading is really a virtual Core, so saying each Wu requires a Core makes sense. So 6 cores, 12 threads mean it's possible to run 12 WU instances (shared between CPU & GPU as even the GPU WUs require some CPU support- in the case of SoG it's 1 Core (thread) fore each GPU WU being processed). My i7 is 4 core\8 threads. I run all threads with no over commitment issues as I have reserved 1 CPU core (thread) for each GPU WU being processed. When I run out of GPU Work, then the released CPU core (thread) then starts on a CPU WU. EDIT- The biggest hint for a overloaded system is when there is a significant discrepancy between the CPU time & Run time for a given CPU WU. On my i7 system the difference is generally 3min or less. I've seen some systems where the difference is over an hour. eg My C2D when processing Arecibo WUs. The GPU requires much more CPU support on those WUs (it's running CUDA50). It's got 2 GPUs running 2 WUs each. With only 2 Cores available the need for CPU support significantly reduces the available CPU time for crunching CPU WUs. However when processing GBT work, the CPU support required is significantly reduced and so CPU time & run times are much closer. Grant Darwin NT ID: 1910918 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34258 Credit: 79,922,639 RAC: 80	Message 1910920 - Posted: 5 Jan 2018, 21:54:18 UTC - in response to Message 1910909. My CPU has 12 thread (6 cores) . So In theory i have 4 extra CPU thread available even when runs 4E@H + 4 S@H. So i not expect any overcommitment even when that happening. Something i never understood and maybe this is the time to ask: When they say each S@H CPU wu uses 1 core, it uses one physical core or one thread? By analogy when they say each E@H GPU WU uses one core it uses one physical core or one thread? <edit> I know i ask a lot of questions, so sorry. But it's always nice to learn new things each day. Crunching always uses a physical core, so never utilize more than the physical cores available. It will slow down significantly. Maybe its possible to feed the GPU`s with available threads on modern CPU`s. With each crime and every kindness we birth our future. ID: 1910920 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 1910921 - Posted: 5 Jan 2018, 21:54:54 UTC - in response to Message 1910914. The question is whether a physical thread vs virtual core crunches faster. I think you mean to say Physical Core v Virtual Core there. They are all physical threads, even if it's in a virtual core. Grant Darwin NT ID: 1910921 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.