Postponed: Waiting to acquire lock

Message boards : Number crunching : Postponed: Waiting to acquire lock
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 14 · Next

AuthorMessage
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1911215 - Posted: 6 Jan 2018, 19:41:54 UTC - in response to Message 1911211.  

The rescheduler program could just make that more common because it shutdown and restart the Boinc process too.
Yes, I'd say that's accurate. The rescheduler tells the BOINC Manager to shut down and, from there on, it's the Manager (and OS) that control the client shutdown, just as it would any other time you Exit the Manager from the menu.
ID: 1911215 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1911216 - Posted: 6 Jan 2018, 19:44:27 UTC - in response to Message 1911212.  
Last modified: 6 Jan 2018, 19:49:14 UTC

If you say so. My recommendation stands. There are reasons I don't use scripts to edit my state file, in any platform.

Oh, and I don't use a third party App to control BOINC Tasks either, I use the BOINC Manager.
ID: 1911216 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911221 - Posted: 6 Jan 2018, 19:49:03 UTC - in response to Message 1911215.  

What we need to find it's why when the exit happening the GPU related slots close the lock file and that not happening with the CPU related slots. But that only happening when the Boinc is running the last WU of the cache...... Weird..... Will do like you, go for a new beer.
ID: 1911221 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1911225 - Posted: 6 Jan 2018, 19:58:34 UTC - in response to Message 1911216.  

It's irrelevant if the client_state file gets edited while BOINC is running. BOINC only reads that file at startup, then maintains and updates it in memory. The only thing that will happen is that BOINC periodically overwrites the file on disc, thus negating any changes that might have been made on disc while BOINC is running.
ID: 1911225 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1911227 - Posted: 6 Jan 2018, 20:00:35 UTC - in response to Message 1911225.  

It's irrelevant if the client_state file gets edited while BOINC is running. BOINC only reads that file at startup, then maintains and updates it in memory. The only thing that will happen is that BOINC periodically overwrites the file on disc, thus negating any changes that might have been made on disc while BOINC is running.
Correct.
ID: 1911227 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1911229 - Posted: 6 Jan 2018, 20:08:35 UTC - in response to Message 1911221.  
Last modified: 6 Jan 2018, 20:20:06 UTC

What we need to find it's why when the exit happening the GPU related slots close the lock file and that not happening with the CPU related slots. But that only happening when the Boinc is running the last WU of the cache...... Weird.....
I wonder if it's possible that the BOINC client actually finishes shutting down before the individual science apps have completely terminated. On the other hand, perhaps the client is forcing app termination prematurely. Off the top of my head, I'm not sure if it's the science app or the client that maintains the lockfile.

EDIT: Ah, here's a snippet from one of my old Process Monitor logs. This isn't from a client shutdown, but does seem to show that it's the science app that deletes the lockfile.
5:45:17.7139951 PM	Lunatics_x41zc_win32_cuda50.exe	4020	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\3\boinc_lockfile	SUCCESS	
5:45:17.7142202 PM	Lunatics_x41zc_win32_cuda50.exe	4020	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\3	SUCCESS	Desired Access: Read Data/List Directory, Synchronize, Disposition: Open, Options: Directory, Synchronous IO Non-Alert, Attributes: n/a, ShareMode: Read, Write, AllocationSize: n/a, OpenResult: Opened
5:45:17.7142423 PM	Lunatics_x41zc_win32_cuda50.exe	4020	QueryDirectory	C:\Documents and Settings\All Users\Application Data\BOINC\slots\3\boinc_lockfile	SUCCESS	Filter: boinc_lockfile, 1: boinc_lockfile
5:45:17.7142763 PM	Lunatics_x41zc_win32_cuda50.exe	4020	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\3	SUCCESS	
5:45:17.7144252 PM	Lunatics_x41zc_win32_cuda50.exe	4020	CreateFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\3\boinc_lockfile	SUCCESS	Desired Access: Read Attributes, Delete, Disposition: Open, Options: Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
5:45:17.7144544 PM	Lunatics_x41zc_win32_cuda50.exe	4020	QueryAttributeTagFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\3\boinc_lockfile	SUCCESS	Attributes: A, ReparseTag: 0x0
5:45:17.7144668 PM	Lunatics_x41zc_win32_cuda50.exe	4020	SetDispositionInformationFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\3\boinc_lockfile	SUCCESS	Delete: True
5:45:17.7144813 PM	Lunatics_x41zc_win32_cuda50.exe	4020	CloseFile	C:\Documents and Settings\All Users\Application Data\BOINC\slots\3\boinc_lockfile	SUCCESS	
ID: 1911229 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1911230 - Posted: 6 Jan 2018, 20:10:32 UTC

Is this happening ONLY with tasks that you have rescheduled?
ID: 1911230 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1911232 - Posted: 6 Jan 2018, 20:15:28 UTC - in response to Message 1911227.  
Last modified: 6 Jan 2018, 20:15:49 UTC

You people act as though Juan never restarts boinc, the last log file contains 23 restarts, all it takes is one to preserve the edits that were made during running. 23 is a lot...
ID: 1911232 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1911235 - Posted: 6 Jan 2018, 20:21:23 UTC - in response to Message 1911232.  

That would only make a difference if it followed the exact sequence

Edit file
Stop BOINC
Save file
Start BOINC

If the file is saved at any other moment - either before BOINC stops, or after it restarts - nothing will be preserved from the editing. I agree with Jeff: this line of thought is a red herring. So I'll butt out again.
ID: 1911235 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1911239 - Posted: 6 Jan 2018, 20:27:38 UTC - in response to Message 1911232.  

You people act as though Juan never restarts boinc, the last log file contains 23 restarts, all it takes is one to preserve the edits that were made during running. 23 is a lot...
Sure, and every time the client shuts down it will overwrite any client_state.xml file on disc with the contents held in memory, thus wiping out any disc edits. Now, if you can find a reference to a client_state_prev file somewhere in the log following a restart, then there might be a very slim chance of a manual edit sneaking in.
ID: 1911239 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1911240 - Posted: 6 Jan 2018, 20:30:07 UTC - in response to Message 1911235.  

Run it dry, and Nuke it from Orbit. Only way to be sure. Something is obvious borked, best to start with a New file and empty slots.
ID: 1911240 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911252 - Posted: 6 Jan 2018, 20:46:19 UTC - in response to Message 1911230.  
Last modified: 6 Jan 2018, 20:52:16 UTC

Is this happening ONLY with tasks that you have rescheduled?

Cant say for sure yes or no. I believe no because there are CPU files and i always rescheduler CPU to GPU.
What i could say it's happening only when the crunching process of the set of WU starts. And only on CPU.
ID: 1911252 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911255 - Posted: 6 Jan 2018, 20:50:35 UTC - in response to Message 1911232.  

You people act as though Juan never restarts boinc, the last log file contains 23 restarts, all it takes is one to preserve the edits that were made during running. 23 is a lot...

Normally i don't do that. Those stop and restarts are mainly rescheduling (we have a lot of troubles to obtain new WU in the last days) and some test i'm doing this days trying to understand the problem.
But today i just made one adjust and was the change of the 5 CPU WU to 6, and no rescheduled or other tests you could se that in my latest log file
ID: 1911255 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911258 - Posted: 6 Jan 2018, 20:57:20 UTC
Last modified: 6 Jan 2018, 20:59:36 UTC

Before to take extreme measures like kill the client file (i know how to preserve the host ID)
I was thinking to do this:

Stop the AVX2 builds and put the stock Linux to crunch and see if something changes.
or whatever other build more common used.

Something is not working fine in the lock /unlock of the file , or maybe my host uses too much time to exit and something mess with the lock/unlock process.

Opening for sugestions
ID: 1911258 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1911260 - Posted: 6 Jan 2018, 21:02:16 UTC - in response to Message 1911258.  

You deleted all the slot folders and with them all the lockfiles, right?

Let it run like that until empty. See what effect that deletion has made.

While it runs, LOOK but DON'T TOUCH. Do you have CPU tasks? Do you have GPU tasks? Are both types running? Are any tasks postponed? Gather evidence.
ID: 1911260 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911268 - Posted: 6 Jan 2018, 21:22:06 UTC - in response to Message 1911260.  
Last modified: 6 Jan 2018, 21:23:02 UTC

You deleted all the slot folders and with them all the lockfiles, right?
Let it run like that until empty. See what effect that deletion has made.

I done that before when i first see the problem. It kills all active WU (including the postponed).
When i do that all return to normal, after killing the postponed WU
BTW The GPU WU continues to crunch normally when the postponed error still happening on the CPU WU.

While it runs, LOOK but DON'T TOUCH. Do you have CPU tasks? Do you have GPU tasks? Are both types running? Are any tasks postponed? Gather evidence.

Not clearly understand what you ask for but let try to explain since i do a lot of tries.

Both types of WU are running.
Runs for hours normally.
For some reason my host stop to receive new WU (server crash like yesterday)
The work continues
Normally the GPU WU cache ends first as expected.
Then when the CPU WU caches is at the end the last WU is the one who aparentlly starts the problem.
The host could restart to receive GPU WU and they restarted to crunch as normal
only the CPU WU crunching part stops to work.


Set to NNT to dry the cache to try cleaning the host file. But that will take few hours maybe most of you will sleeping when that happening.B
ID: 1911268 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1911272 - Posted: 6 Jan 2018, 21:25:08 UTC - in response to Message 1911252.  

Is this happening ONLY with tasks that you have rescheduled?
Cant say for sure yes or no. I believe no because there are CPU files and i always rescheduler CPU to GPU.
What i could say it's happening only when the crunching process of the set of WU starts. And only on CPU.
The reason I asked was because there is no string of cmdline or api_version is the rescheduling script ... So it is not specifically looking for these when modifying the client state.

You have added/removed lines of the app_info from what has been 'normal' so if cpu2gpu is looking to remove/add X number of lines, it could very well be reformatting it incorrectly. You would have to move a single file and do a comparison of the output to know for sure if that is the case.
ID: 1911272 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13746
Credit: 208,696,464
RAC: 304
Australia
Message 1911277 - Posted: 6 Jan 2018, 21:35:17 UTC - in response to Message 1911268.  

For some reason my host stop to receive new WU (server crash like yesterday)

There has been an issue with the Scheduler for the last 12 months where it will randomly stop allocating work to certain systems, even though they've just reported work. And then it will start allocating work again, when it feels like it. You may or may not run out of work in the mean time.

The Scheduler response is usually "Project has no tasks available", and very occasionally it'll say there is no work available for your selected application, but there is work available for others.
Generally Tbar's triple update gets things going again.
In the BOINC Manager, click on Update. Once the Scheduler request is in progress, click on Update again. When that Scheduler request has completed, click on update again. On the next automatic update, work should start flowing again.
If you're having problems downloading work once it's allocated, then it's time to edit your Hosts file again...
Grant
Darwin NT
ID: 1911277 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911278 - Posted: 6 Jan 2018, 21:35:28 UTC - in response to Message 1911272.  

You have added/removed lines of the app_info from what has been 'normal'

I use app_config to pass the commands not mess with app_info.
Only change in app_info file to enable the AVX2 builds but i take extreme care to not touch anything else since i know if i do it's a time bomb.

This is my file, it's extreamelly clean

<app_info>
  <app>
     <name>setiathome_v8</name>
  </app>
    <file_info>
      <name>setiathome_x41p_zi3v_x86_64-pc-linux-gnu_cuda90</name>
      <executable/>
    </file_info>
    <file_info>
      <name>libcudart.so.9.0</name>
    </file_info>
    <file_info>
      <name>libcufft.so.9.0</name>
    </file_info>
    <app_version>
      <app_name>setiathome_v8</app_name>
      <platform>x86_64-pc-linux-gnu</platform>
      <version_num>801</version_num>
      <plan_class>cuda90</plan_class>
      <coproc>
        <type>NVIDIA</type>
        <count>1</count>
      </coproc>
      <avg_ncpus>1</avg_ncpus>
      <max_ncpus>1</max_ncpus>
      <file_ref>
         <file_name>setiathome_x41p_zi3v_x86_64-pc-linux-gnu_cuda90</file_name>
          <main_program/>
      </file_ref>
      <file_ref>
         <file_name>libcudart.so.9.0</file_name>
      </file_ref>
      <file_ref>
         <file_name>libcufft.so.9.0</file_name>
      </file_ref>
    </app_version>
  <app>
     <name>astropulse_v7</name>
  </app>
     <file_info>
       <name>astropulse_7.08_x86_64-pc-linux-gnu__opencl_nvidia_100</name>
        <executable/>
     </file_info>
     <file_info>
       <name>AstroPulse_Kernels_r2751.cl</name>
     </file_info>
     <file_info>
       <name>ap_cmdline_7.08_x86_64-pc-linux-gnu__opencl_nvidia_100.txt</name>
     </file_info>
    <app_version>
      <app_name>astropulse_v7</app_name>
      <platform>x86_64-pc-linux-gnu</platform>
      <version_num>708</version_num>
      <plan_class>opencl_nvidia_100</plan_class>
      <coproc>
        <type>NVIDIA</type>
        <count>1</count>
      </coproc>
      <avg_ncpus>1</avg_ncpus>
      <max_ncpus>1</max_ncpus>
      <file_ref>
         <file_name>astropulse_7.08_x86_64-pc-linux-gnu__opencl_nvidia_100</file_name>
          <main_program/>
      </file_ref>
      <file_ref>
         <file_name>AstroPulse_Kernels_r2751.cl</file_name>
      </file_ref>
      <file_ref>
         <file_name>ap_cmdline_7.08_x86_64-pc-linux-gnu__opencl_nvidia_100.txt</file_name>
         <open_name>ap_cmdline.txt</open_name>
      </file_ref>
    </app_version>
   <app>
      <name>setiathome_v8</name>
   </app>
      <file_info>
         <name>MBv8_8.22r3712_avx2_x86_64-pc-linux-gnu</name>
         <executable/>
      </file_info>
     <app_version>
     <app_name>setiathome_v8</app_name>
     <platform>x86_64-pc-linux-gnu</platform>
     <version_num>800</version_num>   
     <api_version>6.1.0</api_version>
      <file_ref>
        <file_name>MBv8_8.22r3712_avx2_x86_64-pc-linux-gnu</file_name>
        <main_program/>
      </file_ref>
    </app_version>
   <app>
      <name>astropulse_v7</name>
   </app>
     <file_info>
       <name>ap_7.05r2728_sse3_linux64</name>
        <executable/>
     </file_info>
    <app_version>
       <app_name>astropulse_v7</app_name>
       <version_num>704</version_num>
       <platform>x86_64-pc-linux-gnu</platform>
       <plan_class></plan_class>
       <file_ref>
         <file_name>ap_7.05r2728_sse3_linux64</file_name>
          <main_program/>
       </file_ref>
    </app_version>
</app_info>

ID: 1911278 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1911292 - Posted: 6 Jan 2018, 22:22:35 UTC

Juan, here's a little experiment you could try. While BOINC is running, use gedit to open a boinc_lockfile from one of the slots where a CPU app is running. It's an empty file, but it still should open okay. Then shut down BOINC. Once BOINC is completely shut down, simply hit Save in gedit. That should recreate the file in that same slot folder. Restart BOINC and see if the task that was running in that slot gets postponed. You could also try doing the same thing with one of the GPU tasks.

I tried that test with CPU tasks on both my daily driver (Win 7) and one of my Linux boxes. Neither of those apps cared that there was already a lockfile present in the slot. They both restarted fine. So, if your CPU app has a problem with a pre-existing lockfile, then there might be an app-specific issue. On the other hand, if yours restart smoothly even with the lockfile present, then it would seem as if there's some other factor involved besides just the lockfile. At least that would be a bit more info worth knowing, I think.
ID: 1911292 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 14 · Next

Message boards : Number crunching : Postponed: Waiting to acquire lock


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.