Postponed: Waiting to acquire lock

Message boards : Number crunching : Postponed: Waiting to acquire lock
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 14 · Next

AuthorMessage
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1911140 - Posted: 6 Jan 2018, 16:40:51 UTC - in response to Message 1910998.  

I UL the file to my onedrive if you wish to see it.

https://1drv.ms/t/s!Asjkc9Jyluh3zjxkeFIIYLB6RCjF

At around 16:15 the only different i see is because i rescheduler some work since the Seti was down. And around the 17:28 seems like the Boinc shout down and restart (maybe another rescheduler). Not remember why but i remember yesterdays i made some reschedules due the lack of new work. But i reschedule only from the CPU to the GPU by sending some of the Arecibo Vlars to the GPU.

BTW I totally forget today's it's and i don't start to drink anything yet. LOL
Well, after looking at that, and hearing about the reschedule immediately prior, I'm going to say your problem was most likely caused by rescheduling. Hopefully you won't find it necessary to experience that again. I'm still not sure about my problem which only happened with two tasks over a 2 month period. I ran one of the tasks in the benchmark app and it worked fine with the 3711 CPU App. It's been a few days and I haven't seen the problem again, possibly it was just a brief cosmic ray storm a few days ago. As for the API line, if you look at the Windows AVX CPU App it Doesn't have an API line in it's app_info, http://mikesworld.eu/download.html. The Linux AVX App doesn't have an API line either, http://lunatics.kwsn.info/index.php?action=downloads;sa=view;down=467. So, that is probably Not your problem, also, the AVX2 App was built with API 7.5.0 so even your API line is wrong. I'm not sure if you should change the line to 7.5.0 or just remove it...your call.
ID: 1911140 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1911145 - Posted: 6 Jan 2018, 16:55:58 UTC - in response to Message 1911140.  

So, that is probably Not your problem, also, the AVX2 App was built with API 7.5.0 so even your API line is wrong. I'm not sure if you should change the line to 7.5.0 or just remove it...your call.
I suggested adding the API version line because Juan posted an error message which is specific to a function for which the API version is used. Let's wait and see if it's had any effect before jumping to conclusions.

The test relating to shared memory is precisely "At least 6.0". Any higher value will do. There's no point in worrying about any numeric change.

The test which kicks in at 7.5 is different. Above that point, the BOINC Client no longer passes the device number to be used (for GPU apps) on the command line. We had to get Raistmer to re-organise his code so that different tasks ran on device 1, device 2, and so on. Without fixing it, all tasks ran on device zero, whatever device BOINC thought (and displayed) that they were running on. I don't know whether the other developers were paying attention or not.
ID: 1911145 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1911148 - Posted: 6 Jan 2018, 17:04:55 UTC - in response to Message 1911145.  

There are an awful lot of people running the Windows and Linux AVX Apps without any problems or api lines.
ID: 1911148 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911150 - Posted: 6 Jan 2018, 17:07:13 UTC
Last modified: 6 Jan 2018, 17:20:45 UTC

@TBar

Yes i after all i really believe is some kind of incompatibility with the rescheduler and the way the Lunix or the build works. Please i not saying some is broken or have a bug, the mix just not work well in my case at least. But now i know about that so i could try to avoid that extreame situation who produces the error. Maybe something similar could happening with yours WU. I know it's a detective task to find needle in the haystack. And for that i thanks all who help me to try to trace and fix the problem.

Anyway i will left the API line as Richard post and try to see if that problem happening again in the next outage.

Since you don't read our team forum I believe I never have a real opportunity to say thanks to You directly for the nice work you done who make possible to us Linux newbies run this CUDA90 builds in our hosts.
That was amazing, i never run a Linux box before and i was able to change my host from Windows to Linux with almost no pain.
And the result of this is clear, my host now is the #2 Top cruncher by RAC even without the need of change my 1070 for 1080Ti GPU's
It goes from 120K/Day to up to 200K/Day just because the software update something really incredible.
So a big thanks for that!
ID: 1911150 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1911152 - Posted: 6 Jan 2018, 17:11:03 UTC - in response to Message 1911148.  

There are an awful lot of people running the Windows and Linux AVX Apps without any problems or api lines.
Great - they can stop reading this specific thread at this point.
ID: 1911152 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1911170 - Posted: 6 Jan 2018, 18:01:59 UTC - in response to Message 1911150.  

@TBar

Yes i after all i really believe is some kind of incompatibility with the rescheduler and the way the Lunix or the build works. Please i not saying some is broken or have a bug, the mix just not work well in my case at least. But now i know about that so i could try to avoid that extreame situation who produces the error. Maybe something similar could happening with yours WU. I know it's a detective task to find needle in the haystack. And for that i thanks all who help me to try to trace and fix the problem.

Anyway i will left the API line as Richard post and try to see if that problem happening again in the next outage.

Since you don't read our team forum I believe I never have a real opportunity to say thanks to You directly for the nice work you done who make possible to us Linux newbies run this CUDA90 builds in our hosts.
That was amazing, i never run a Linux box before and i was able to change my host from Windows to Linux with almost no pain.
And the result of this is clear, my host now is the #2 Top cruncher by RAC even without the need of change my 1070 for 1080Ti GPU's
It goes from 120K/Day to up to 200K/Day just because the software update something really incredible.
So a big thanks for that!
There isn't a rescheduler App for OSX, if I find it necessary to reschedule I just use the Text Editor with Find & Replace instead, even on Linux. Right now, my OSX app_info still has the API line in it from the last time Richard brought it up, it didn't seem to help any this last time. I'm also running the 3711 CPU App on a Linux machine, so far it hasn't had any trouble without an API line.

Yes, it does make things much easier when you place the BOINC folder in your Linux Home folder, especially if you have One Home partition with a few different System partitions. Kinda a pain to have to have a different BOINC for each system folder. Hopefully the next CUDA App will solve the few remaining problems and I can post a zi3xs version, zi3xs is a little faster than zi3v.
ID: 1911170 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911176 - Posted: 6 Jan 2018, 18:16:45 UTC - in response to Message 1911170.  
Last modified: 6 Jan 2018, 18:17:18 UTC

Hopefully the next CUDA App will solve the few remaining problems and I can post a zi3xs version, zi3xs is a little faster than zi3v.

Can't wait to use & if you need some host to test be free to ask.
ID: 1911176 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1911182 - Posted: 6 Jan 2018, 18:26:50 UTC - in response to Message 1911176.  

Hopefully the next CUDA App will solve the few remaining problems and I can post a zi3xs version, zi3xs is a little faster than zi3v.

Can't wait to use & if you need some host to test be free to ask.

Count me in as a beta tester too.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1911182 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911185 - Posted: 6 Jan 2018, 18:30:35 UTC
Last modified: 6 Jan 2018, 18:33:04 UTC

Ok that's is new. Not need to wait a lot of time.... the error repeats.

Again the cache almost empty (project backoff.... again)

My cache is in the last 8 CPU WU all shows the Acquire lock msg

Now I'm sure: Einstein is out (set to NNT yesterday) I do not rescheduled anything, my host was not recycled and Boinc was running without interference And the api line is on place (will post the file in the next msg with the Boinc log file)

The only change i made on the host is to rise the # of CPU WU from 5 to 6 as suggested by Mike earlier.


This is the error file of one of them:

not using mb_cmdline.txt-file, using commandline options.

Build features: SETI8 Non-graphics FFTW FFTOUT JSPF AVX2 64bit 
 System: Linux  x86_64  Kernel: 4.10.0-42-generic
 CPU   : Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz
 12 core(s), Speed :  3499.804 MHz
 L1 : 64 KB, Cache : 15360 KB
 Features : FPU TSC PAE APIC MTRR MMX SSE  SSE2 HT PNI SSSE3 SSE4_1 SSE4_2 AVX  AVX2  

ar=0.398298  NumCfft=203281  NumGauss=1186466310  NumPulse=226424236977  NumTriplet=452813964973
In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768
Linux optimized setiathome_v8 application
Version info: AVX2jf (Intel, Core 2-optimized v8-nographics) V5.13 by Alex Kan
AVX2jf Linux64 Build 3712 , Ported by : Raistmer, JDWhale, Urs Echternacht

Work Unit Info:
...............
Credit multiplier is :  2.85
WU true angle range is :  0.398298
11:13:57 (1870): Can't acquire lockfile (-154) - waiting 35s
11:14:32 (1870): Can't acquire lockfile (-154) - exiting
11:32:03 (9219): Can't acquire lockfile (-154) - waiting 35s
11:32:38 (9219): Can't acquire lockfile (-154) - exiting
11:42:51 (13594): Can't acquire lockfile (-154) - waiting 35s
11:43:26 (13594): Can't acquire lockfile (-154) - exiting
11:53:31 (17730): Can't acquire lockfile (-154) - waiting 35s
11:54:06 (17730): Can't acquire lockfile (-154) - exiting
12:04:10 (22018): Can't acquire lockfile (-154) - waiting 35s
12:04:45 (22018): Can't acquire lockfile (-154) - exiting
12:14:59 (26203): Can't acquire lockfile (-154) - waiting 35s
12:15:34 (26203): Can't acquire lockfile (-154) - exiting
12:25:45 (30454): Can't acquire lockfile (-154) - waiting 35s
12:26:20 (30454): Can't acquire lockfile (-154) - exiting
12:36:32 (2439): Can't acquire lockfile (-154) - waiting 35s
12:37:07 (2439): Can't acquire lockfile (-154) - exiting
12:47:28 (6871): Can't acquire lockfile (-154) - waiting 35s
12:48:03 (6871): Can't acquire lockfile (-154) - exiting
12:58:08 (10898): Can't acquire lockfile (-154) - waiting 35s
12:58:43 (10898): Can't acquire lockfile (-154) - exiting
13:09:19 (15219): Can't acquire lockfile (-154) - waiting 35s
13:09:54 (15219): Can't acquire lockfile (-154) - exiting
13:19:58 (19226): Can't acquire lockfile (-154) - waiting 35s
13:20:33 (19226): Can't acquire lockfile (-154) - exiting

ID: 1911185 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1911187 - Posted: 6 Jan 2018, 18:37:34 UTC - in response to Message 1911185.  

OK, 'Can't acquire lockfile' is a different error from the shared memory error I was trying to shepherd you through, so my advice is irrelevant here - I'll keep out of the way.

I'd only suggest that you look in the slot folder where that task was trying to run (I assume that's where you got the stderr from), and see if a boinc_lockfile is present. If so, does the timestamp correspond to the time that particular task first tried to run?
ID: 1911187 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911189 - Posted: 6 Jan 2018, 18:41:01 UTC

This is the stdoutdat.txt file

https://1drv.ms/t/s!Asjkc9Jyluh3zw8yYwUiRa4eKQVE

Definitely is related when the WU cache is almost dry.

Just DL few GPU WU and the continue to crunch normally.

Any suggestion? The postponed WU still here.

Will leave crunching the GPU WU for now.
ID: 1911189 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911190 - Posted: 6 Jan 2018, 18:44:04 UTC - in response to Message 1911187.  

OK, 'Can't acquire lockfile' is a different error from the shared memory error I was trying to shepherd you through, so my advice is irrelevant here - I'll keep out of the way.

I'd only suggest that you look in the slot folder where that task was trying to run (I assume that's where you got the stderr from), and see if a boinc_lockfile is present. If so, does the timestamp correspond to the time that particular task first tried to run?

Yes the file is there and the file time stamp is 10:58 but it's has 0 (zero bytes)

blc05_2bit_guppi_57976_07262_HIP74926_0026.15439.818.22.45.99.vlar

ID: 1911190 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1911194 - Posted: 6 Jan 2018, 18:53:33 UTC - in response to Message 1911190.  

Zero size is fine - that's normal. Check the date as well as the time - it might be left over from the last time this happened.
ID: 1911194 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911195 - Posted: 6 Jan 2018, 18:54:00 UTC

I made something extreme. Stop Boinc. Go to the Slot directories and delete them all and restart Boinc

All the WU end as computation error as expected. But at least the host returns to do his work.

I feel doomed... pause for a beer.
ID: 1911195 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911196 - Posted: 6 Jan 2018, 18:57:51 UTC - in response to Message 1911194.  
Last modified: 6 Jan 2018, 18:59:18 UTC

Zero size is fine - that's normal. Check the date as well as the time - it might be left over from the last time this happened.


Yes was well around the time i reload the config file changing the number of WU from 5 to 6

& The DL errors did not help, something else is happening on the servers side. A lot of DL retries.
ID: 1911196 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1911205 - Posted: 6 Jan 2018, 19:25:05 UTC - in response to Message 1911190.  

Yes the file is there and the file time stamp is 10:58 but it's has 0 (zero bytes)
That timestamp precedes the BOINC restart shown at the beginning of the log file you posted,
06-Jan-2018 11:02:33 [---] Starting BOINC client version 7.8.3 for x86_64-pc-linux-gnu

I think that's a pretty good indication that the lockfiles didn't get removed when BOINC shut down. BOINC also started a fresh log file with that restart, so to see the end of the previous run you'll need to look at the stdoutdae.old file. If the last line doesn't show "exiting", it's probably also a good indication that the BOINC client didn't finish shutting down cleanly, although it could also simply be that the log file got closed before all the messages were written.

Clearly it's not a rescheduler problem, but does seem to be a BOINC shutdown issue. It seems like something is happening too quickly during the shutdown and is preventing the lockfiles from getting deleted. If it happens again, try shutting down BOINC and then delete any lockfiles you find in the slot folders. Just the lockfiles, nothing else. Oh, and have another beer. That will definitely help. :^)
ID: 1911205 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1911207 - Posted: 6 Jan 2018, 19:27:52 UTC - in response to Message 1911196.  
Last modified: 6 Jan 2018, 19:35:24 UTC

couldn't start app: Can't write init file: fopen() failed</message>
That looks very much the same error as I saw once when the client_state.xml file was edited with boinc running. It's possible the client_state.xml file was damaged the other day. It appears you just filled your cache again which makes my suggestion much more difficult. I'd recommend running the cache dry and removing the state file so a new one can be built. If you save your Host ID and the <rpc_seqno></rpc_seqno> number you can keep the old Host number if BOINC trys to make a new Host ID. That might solve the problem.

I'd say the problem is you edited the client_state.xml while boinc was running.
How long did you run the CPU App without having any trouble?
ID: 1911207 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911211 - Posted: 6 Jan 2018, 19:32:20 UTC - in response to Message 1911205.  
Last modified: 6 Jan 2018, 19:33:49 UTC

Clearly it's not a rescheduler problem, but does seem to be a BOINC shutdown issue. It seems like something is happening too quickly during the shutdown and is preventing the lockfiles from getting deleted. If it happens again, try shutting down BOINC and then delete any lockfiles you find in the slot folders. Just the lockfiles, nothing else. Oh, and have another beer. That will definitely help. :^)

OK will wait the error happening again and test just by erasing the lock file.
Yes i believe we are closing to the problem.
Something left the lockfiles closed so the new work can't start.
But it closes the GPU slot related files since they continue to work.
The rescheduler program could just make that more common because it shutdown and restart the Boinc process too.
Not know the Boinc unlock process when it shut down but could be related because i use the AVX2 builds?
ID: 1911211 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1911212 - Posted: 6 Jan 2018, 19:37:20 UTC - in response to Message 1911207.  
Last modified: 6 Jan 2018, 19:38:55 UTC

That looks very much the same error as I saw once when the client_state.xml file was edited with boinc running. It's possible the client_state.xml file was damaged the other day. It appears you just filled your cache again which makes my suggestion much more difficult. I'd recommend running the cache dry and removing the state file so a new one can be built. If you save your Host ID and the <rpc_seqno></rpc_seqno> number you can keep the old Host number if BOINC trys to make a new Host ID. That might solve the problem.

I'd say the problem is you edited the client_state.xml while boinc was running.
How long did you run the CPU App without having any trouble?
No, that's just because he deleted all the files from all the slot folders, even the ones that weren't having the lockfile problem. The ones that were successfully running simply couldn't find the files they needed on restart.
ID: 1911212 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1911213 - Posted: 6 Jan 2018, 19:39:53 UTC - in response to Message 1911207.  

couldn't start app: Can't write init file: fopen() failed</message>
That looks very much the same error as I saw once when the client_state.xml file was edited with boinc running. It's possible the client_state.xml file was damaged the other day. It appears you just filled your cache again which makes my suggestion much more difficult. I'd recommend running the cache dry and removing the state file so a new one can be built. If you save your Host ID and the <rpc_seqno></rpc_seqno> number you can keep the old Host number if BOINC trys to make a new Host ID. That might solve the problem.

I'd say the problem is you edited the client_state.xml while boinc was running.
How long did you run the CPU App without having any trouble?

For few weeks at least, for the beginning of dezember.
The error only appears when the CPU WU cache is close to zero. Apparently only when the last WU is crunched on the slot of something near of that. Not very common since i always try to keep my cache close to the 1000 WU limit.
Will save this for a second try.
ID: 1911213 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 14 · Next

Message boards : Number crunching : Postponed: Waiting to acquire lock


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.