Postponed: Waiting to acquire lock

Message boards : Number crunching : Postponed: Waiting to acquire lock
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 14 · Next

AuthorMessage
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1910805 - Posted: 5 Jan 2018, 14:20:42 UTC
Last modified: 5 Jan 2018, 14:33:18 UTC

My cache is almost empty due the server problem but the last 4 WU refuses to crunch they start and stop after some seconds.

Nothing else is running on the CPU/GPU of the host, so all the cores are available to crunch.

This is what shows the stderr.txt of one of them.

WU: 22ap08ac.1843.7248.16.43.27.vlar.1

Not using mb_cmdline.txt-file, using commandline options.

Build features: SETI8 Non-graphics FFTW FFTOUT JSPF AVX2 64bit 
 System: Linux  x86_64  Kernel: 4.10.0-42-generic
 CPU   : Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz
 12 core(s), Speed :  3579.565 MHz
 L1 : 64 KB, Cache : 15360 KB
 Features : FPU TSC PAE APIC MTRR MMX SSE  SSE2 HT PNI SSSE3 SSE4_1 SSE4_2 AVX  AVX2  

ar=0.011995  NumCfft=146001  NumGauss=0  NumPulse=50177167232  NumTriplet=67971049376
In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768
Linux optimized setiathome_v8 application
Version info: AVX2jf (Intel, Core 2-optimized v8-nographics) V5.13 by Alex Kan
AVX2jf Linux64 Build 3712 , Ported by : Raistmer, JDWhale, Urs Echternacht

Work Unit Info:
...............
Credit multiplier is :  2.85
WU true angle range is :  0.011995
05:49:22 (1270): Can't acquire lockfile (-154) - waiting 35s
05:49:57 (1270): Can't acquire lockfile (-154) - exiting
05:51:55 (2850): Can't acquire lockfile (-154) - waiting 35s
05:52:30 (2850): Can't acquire lockfile (-154) - exiting
.
.
.


ID: 1910805 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1910824 - Posted: 5 Jan 2018, 16:00:17 UTC - in response to Message 1910805.  

I've seen something similar with the CPU App 3711 running on my Mac with a couple BLC14 tasks. For some reason those two tasks refused to run using the CPU App. I just changed the client_state.xml so they would run using the CUDA App and they ran without a problem. The only thing I can think of is there is something about the latest code AKv8 3710 causing this. Fortunately it's only happened twice so far.
ID: 1910824 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1910826 - Posted: 5 Jan 2018, 16:15:26 UTC
Last modified: 5 Jan 2018, 16:17:38 UTC

OK Maybe is something in the code will keep some eye on that. BTW my WU were common Arecibo Vlars not blc and each WU stops to run at different % progress.

Can't reschedule to GPU to try anymore since after my cache runs empty i shutdown the host and when i turn it on again the 4 task goes directly to compute error.

-185 (0xFFFFFF47) ERR_RESULT_START


Stderr output
<core_client_version>7.8.3</core_client_version>
<![CDATA[
<message>
couldn't start app: Can't get shared memory segment name: can't get shared mem segment name</message>
]]>


Weird.
Not believe is something wrong with the host since it crunches 1000's of WU without any error.
And now i leave It's crunching E@H WU with no error too.
System monitor said i not use even 20% of the available memory and 0% swap memory is in use.
I just run Boinc and do some browsing on the host.
ID: 1910826 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1910828 - Posted: 5 Jan 2018, 16:21:24 UTC

You shouldn't be using a shared memory segment - we switched to using memory mapped files 6, 7, 8 years ago (I forget exactly when).

That error message could (very, very, provisionally) be caused by a badly-formed app_info.xml file. Talk to your developer/supplier, and if they don't understand what you're talking about, tell them to talk to me - but give me time to dig out the archaeology first.
ID: 1910828 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1910846 - Posted: 5 Jan 2018, 18:06:44 UTC - in response to Message 1910828.  

@Richard

Please forgive my ignorance but i really not understood anything of your msg.

My host crunch 1000's of WU each day and at least >100 of them are from CPU so why this 4 only shows this problem?

I'm a alone old man from the DOS era and not know anybody here who even has an idea of how Boinc and all the rest of the stuff work.
So no way i could ask my developer/supplier since i don't have any.

Anyway thanks for your time/help.

This is my app_info.xml , not see anything weird on the CPU part of it. Maybe you better training eyes could find something.

<app_info>
  <app>
     <name>setiathome_v8</name>
  </app>
    <file_info>
      <name>setiathome_x41p_zi3v_x86_64-pc-linux-gnu_cuda90</name>
      <executable/>
    </file_info>
    <file_info>
      <name>libcudart.so.9.0</name>
    </file_info>
    <file_info>
      <name>libcufft.so.9.0</name>
    </file_info>
    <app_version>
      <app_name>setiathome_v8</app_name>
      <platform>x86_64-pc-linux-gnu</platform>
      <version_num>801</version_num>
      <plan_class>cuda90</plan_class>
      <coproc>
        <type>NVIDIA</type>
        <count>1</count>
      </coproc>
      <avg_ncpus>1</avg_ncpus>
      <max_ncpus>1</max_ncpus>
      <file_ref>
         <file_name>setiathome_x41p_zi3v_x86_64-pc-linux-gnu_cuda90</file_name>
          <main_program/>
      </file_ref>
      <file_ref>
         <file_name>libcudart.so.9.0</file_name>
      </file_ref>
      <file_ref>
         <file_name>libcufft.so.9.0</file_name>
      </file_ref>
    </app_version>
  <app>
     <name>astropulse_v7</name>
  </app>
     <file_info>
       <name>astropulse_7.08_x86_64-pc-linux-gnu__opencl_nvidia_100</name>
        <executable/>
     </file_info>
     <file_info>
       <name>AstroPulse_Kernels_r2751.cl</name>
     </file_info>
     <file_info>
       <name>ap_cmdline_7.08_x86_64-pc-linux-gnu__opencl_nvidia_100.txt</name>
     </file_info>
    <app_version>
      <app_name>astropulse_v7</app_name>
      <platform>x86_64-pc-linux-gnu</platform>
      <version_num>708</version_num>
      <plan_class>opencl_nvidia_100</plan_class>
      <coproc>
        <type>NVIDIA</type>
        <count>1</count>
      </coproc>
      <avg_ncpus>1</avg_ncpus>
      <max_ncpus>1</max_ncpus>
      <file_ref>
         <file_name>astropulse_7.08_x86_64-pc-linux-gnu__opencl_nvidia_100</file_name>
          <main_program/>
      </file_ref>
      <file_ref>
         <file_name>AstroPulse_Kernels_r2751.cl</file_name>
      </file_ref>
      <file_ref>
         <file_name>ap_cmdline_7.08_x86_64-pc-linux-gnu__opencl_nvidia_100.txt</file_name>
         <open_name>ap_cmdline.txt</open_name>
      </file_ref>
    </app_version>
   <app>
      <name>setiathome_v8</name>
   </app>
      <file_info>
         <name>MBv8_8.22r3712_avx2_x86_64-pc-linux-gnu</name>
         <executable/>
      </file_info>
     <app_version>
     <app_name>setiathome_v8</app_name>
     <platform>x86_64-pc-linux-gnu</platform>
     <version_num>800</version_num>   
      <file_ref>
        <file_name>MBv8_8.22r3712_avx2_x86_64-pc-linux-gnu</file_name>
        <main_program/>
      </file_ref>
    </app_version>
   <app>
      <name>astropulse_v7</name>
   </app>
     <file_info>
       <name>ap_7.05r2728_sse3_linux64</name>
        <executable/>
     </file_info>
    <app_version>
       <app_name>astropulse_v7</app_name>
       <version_num>704</version_num>
       <platform>x86_64-pc-linux-gnu</platform>
       <plan_class></plan_class>
       <file_ref>
         <file_name>ap_7.05r2728_sse3_linux64</file_name>
          <main_program/>
       </file_ref>
    </app_version>
</app_info>

ID: 1910846 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1910855 - Posted: 5 Jan 2018, 18:24:49 UTC

BOINC is having troubles accessing the CUDA library. Reboot the computer. Problem solved.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1910855 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1910857 - Posted: 5 Jan 2018, 18:25:55 UTC - in response to Message 1910855.  
Last modified: 5 Jan 2018, 18:33:26 UTC

BOINC is having troubles accessing the CUDA library. Reboot the computer. Problem solved.

CUDA? They was CPU WUs.

Yes reboot solves the problem... by crashing all 4. LOL
ID: 1910857 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1910877 - Posted: 5 Jan 2018, 19:52:35 UTC - in response to Message 1910846.  

@Richard

I'm a alone old man from the DOS era ...
I'm in much the same state myself, except the first language I received formal training in was Algol 60 - mainframes, long before the DOS era!

But I try to keep the little grey cells active in my retirement by reading what I can.

OK, to business. The various optimised applications that our developer friends write are designed to work in the BOINC environment - in fact, that's the only point of them: they are useless anywhere else. So it's important that the applications communicate with BOINC - telling BOINC how far they've got, listening for instructions about when to pause or shut down, that sort of thing. The rather telegraphic technical phrases I quoted in my last message - "shared memory segment" and "memory mapped files" - refer to two alternative mechanisms for handling that chatter between the application and BOINC. We actually switched from the first to the second 10 years ago (I now find) - how time flies.

But 10 years ago BOINC was still being developed 'properly' - with care and attention to detail, ensuring both forward and backward compatibility. Both communication methods could co-exist. But the applications need to know which technique to use in any given situation.

That's done by passing a piece of information known as the "API version" in - in this case - the app_info.xml file.

You don't have that, so the applications will assume that they're working in an environment that's over 10 years old. I know nothing about Linux, but maybe they've abandoned shared memory in the intervening 10 years, too. That could account for your error message (no promises). But this is worth trying. Stick an extra line

    <api_version>6.1.0</api_version>
into the bit of app_info which describes the application that's having problems - look at the documentation for Anonymous Platform to see how it fits into the format.

In due course, the sections for the other applications could probably use the same thing, but only change one at a time. Fortunately, the API version rarely affects this process, and the next change didn't come until v7.5.0 (to support Bitcoin mining), so the numbers don't matter - it just has to be "at least 6.0", according to line 166 of app_start.cpp

Then stop BOINC and restart it. The new line will only make a difference for new tasks as they start running for the first time: anything which is already trying to run and showing 'postponed' has probably been lost already, so don't worry about them: watch the next one as it starts. If the patch makes a difference, and the tasks run properly, please ask the developer to read what I've written above and modify their supplied app_info.xml files accordingly.
ID: 1910877 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1910886 - Posted: 5 Jan 2018, 20:26:46 UTC

Added the line, let's see if that solves the problem.

BTW My first language learned was Fortran that was 35 or was 40 Years ago, or maybe more... i'm to old to learn.

Thanks for your help & detailed explanation.
ID: 1910886 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1910888 - Posted: 5 Jan 2018, 20:53:42 UTC - in response to Message 1910857.  

BOINC is having troubles accessing the CUDA library. Reboot the computer. Problem solved.

CUDA? They was CPU WUs.

Yes reboot solves the problem... by crashing all 4. LOL

Sorry, the only time I ever experienced that error was when BOINC for some reason couldn't find the CUDA libraries. Rebooted the machine and have never seen it again.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1910888 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1910890 - Posted: 5 Jan 2018, 20:57:19 UTC

Looks like Richard supplied you with great information about the history of the apps and a solution for your CPU tasks in app_info.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1910890 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1910891 - Posted: 5 Jan 2018, 20:59:34 UTC
Last modified: 5 Jan 2018, 21:07:03 UTC

I'm wondering if there might be a simpler explanation for the issue, namely a BOINC client or system crash and restart that left a lockfile in a slot that prevented a new task from using that slot. I've seen that happen on my own machines on rare occasions and, if I recall correctly, a simple clean shutdown and restart of BOINC would clear it up.

Going all the way back to 2014, that message was also a symptom of a Zombie task issue, where the BOINC client died but left AP tasks running. Here's a snip from a message I posted then:

Running on device number: 0
DATA_CHUNK_UNROLL at default:2
DATA_CHUNK_UNROLL at default:2
16:32:27 (1464): Can't acquire lockfile (32) - waiting 35s
16:33:02 (1464): Can't acquire lockfile (32) - exiting
16:33:02 (1464): Error: The process cannot access the file because it is being used by another process. (0x20)

The main difference here is the final line, which Juan's task doesn't show.

Another possibility is simply an overcommitment of resources. Bill G had a thread a couple months ago, Recent error: Cannot acquire lockfile, where he was getting the same message on a recurring basis, but tasks were intermittently suspending and then resuming. A BOINC client crash also seemed to immediately precede the onset of his problem. When I raised the possible overcommitment issue...
I was just looking through an old Process Monitor log for one of my machines and found that BOINC apparently polls the lockfiles every 5 minutes, just to make sure they're still there. Perhaps if it takes too long to get a response from the system, it kicks out the messages shown in your errors. How many of those 32 HT cores are running tasks at the same time?
...he responded:
2 for the 2 GPUs and then the remaining 30 each running its own WU
...so that certainly looked like a contributing factor in his case.

So, Juan, how many concurrent tasks are you running on that machine and do you know if you had a BOINC or system crash shortly before the problem first showed up?

EDIT: Okay, reviewing your original post, I see that those were the only tasks you had running at the time, so overcommitment doesn't look like the source of your issue.
ID: 1910891 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1910898 - Posted: 5 Jan 2018, 21:12:01 UTC - in response to Message 1910890.  

Looks like Richard supplied you with great information about the history of the apps and a solution for your CPU tasks in app_info.
If it works! But least it's a plausible - and causal - suggestion.
ID: 1910898 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1910899 - Posted: 5 Jan 2018, 21:30:47 UTC
Last modified: 5 Jan 2018, 21:32:25 UTC

Could be a very long shoot but I remember something, i just see this error 2 times, and in both cases was when the host reaches the end of the cache of WU.

My host runs 4 GPU + 4 CPU WU (i use <project_max_concurrent>8</project_max_concurrent>) so you not expect a overcommitment but there are one point to add.

Since E@H is my backup project and has Richard explained the line is readed only at the beginning of the crunch, maybe something hidden on the code crashes when the S@H GPU WU ends and the hosts starts up to 8 CPU WU while it starts to DL new E@H data. Or when it starts the E@H GPU WU while it not fully stopped the S@H additional 4 CPU WU. I talk about internal timing or something like that. Just a idea.

Anyway I add the line suggested and will see if that happening again in the next outage.

BTW I use app_config in E@H too, so it keep a maximum of 8 active WU at any time (up to 4 GPU in E@H + 4 S@H CPU WU too)
ID: 1910899 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1910903 - Posted: 5 Jan 2018, 21:35:32 UTC

Remember that every Einstein GPU task finishes up its calculation by switching over to a CPU core at 89% completion. If the E@H task was finishing up on a cpu core just as Seti@home cpu tasks occupied 4 cpu cores, that might have been the causation.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1910903 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1910909 - Posted: 5 Jan 2018, 21:41:37 UTC - in response to Message 1910903.  
Last modified: 5 Jan 2018, 21:47:03 UTC

My CPU has 12 thread (6 cores) . So In theory i have 4 extra CPU thread available even when runs 4E@H + 4 S@H.
So i not expect any overcommitment even when that happening.

Something i never understood and maybe this is the time to ask:
When they say each S@H CPU wu uses 1 core, it uses one physical core or one thread?
By analogy when they say each E@H GPU WU uses one core it uses one physical core or one thread?

<edit> I know i ask a lot of questions, so sorry. But it's always nice to learn new things each day.
ID: 1910909 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1910914 - Posted: 5 Jan 2018, 21:50:30 UTC - in response to Message 1910909.  

I think what we really mean is 1 thread... Laziness on our part when we say core.

The question is whether a physical thread vs virtual core crunches faster. On some projects, physical thread is faster. Not so much here.
ID: 1910914 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 1910918 - Posted: 5 Jan 2018, 21:53:02 UTC - in response to Message 1910909.  
Last modified: 5 Jan 2018, 22:04:32 UTC

My CPU has 12 thread (6 cores) . So In theory i have 4 extra CPU thread available even when runs 4E@H + 4 S@H.
So i not expect any overcommitment even when that happening.

Something i never understood and maybe this is the time to ask:
When they say each S@H CPU wu uses 1 core, it uses one physical core or one thread?
By analogy when they say each E@H GPU WU uses one core it uses one physical core or one thread?

Thread (Threads in this instance not to be confused with Threads in the Task Manager Process\Threads\Handles case). Hyperthreading is really a virtual Core, so saying each Wu requires a Core makes sense.

So 6 cores, 12 threads mean it's possible to run 12 WU instances (shared between CPU & GPU as even the GPU WUs require some CPU support- in the case of SoG it's 1 Core (thread) fore each GPU WU being processed). My i7 is 4 core\8 threads. I run all threads with no over commitment issues as I have reserved 1 CPU core (thread) for each GPU WU being processed. When I run out of GPU Work, then the released CPU core (thread) then starts on a CPU WU.


EDIT- The biggest hint for a overloaded system is when there is a significant discrepancy between the CPU time & Run time for a given CPU WU.
On my i7 system the difference is generally 3min or less.
I've seen some systems where the difference is over an hour.
eg My C2D when processing Arecibo WUs. The GPU requires much more CPU support on those WUs (it's running CUDA50).
It's got 2 GPUs running 2 WUs each. With only 2 Cores available the need for CPU support significantly reduces the available CPU time for crunching CPU WUs.
However when processing GBT work, the CPU support required is significantly reduced and so CPU time & run times are much closer.
Grant
Darwin NT
ID: 1910918 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34255
Credit: 79,922,639
RAC: 80
Germany
Message 1910920 - Posted: 5 Jan 2018, 21:54:18 UTC - in response to Message 1910909.  

My CPU has 12 thread (6 cores) . So In theory i have 4 extra CPU thread available even when runs 4E@H + 4 S@H.
So i not expect any overcommitment even when that happening.

Something i never understood and maybe this is the time to ask:
When they say each S@H CPU wu uses 1 core, it uses one physical core or one thread?
By analogy when they say each E@H GPU WU uses one core it uses one physical core or one thread?

<edit> I know i ask a lot of questions, so sorry. But it's always nice to learn new things each day.


Crunching always uses a physical core, so never utilize more than the physical cores available.
It will slow down significantly.
Maybe its possible to feed the GPU`s with available threads on modern CPU`s.


With each crime and every kindness we birth our future.
ID: 1910920 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13731
Credit: 208,696,464
RAC: 304
Australia
Message 1910921 - Posted: 5 Jan 2018, 21:54:54 UTC - in response to Message 1910914.  

The question is whether a physical thread vs virtual core crunches faster.

I think you mean to say Physical Core v Virtual Core there. They are all physical threads, even if it's in a virtual core.
Grant
Darwin NT
ID: 1910921 · Report as offensive
1 · 2 · 3 · 4 . . . 14 · Next

Message boards : Number crunching : Postponed: Waiting to acquire lock


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.