Message boards :
Number crunching :
Postponed: Waiting to acquire lock
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 14 · Next
Author | Message |
---|---|
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
I UL the file to my onedrive if you wish to see it.Well, after looking at that, and hearing about the reschedule immediately prior, I'm going to say your problem was most likely caused by rescheduling. Hopefully you won't find it necessary to experience that again. I'm still not sure about my problem which only happened with two tasks over a 2 month period. I ran one of the tasks in the benchmark app and it worked fine with the 3711 CPU App. It's been a few days and I haven't seen the problem again, possibly it was just a brief cosmic ray storm a few days ago. As for the API line, if you look at the Windows AVX CPU App it Doesn't have an API line in it's app_info, http://mikesworld.eu/download.html. The Linux AVX App doesn't have an API line either, http://lunatics.kwsn.info/index.php?action=downloads;sa=view;down=467. So, that is probably Not your problem, also, the AVX2 App was built with API 7.5.0 so even your API line is wrong. I'm not sure if you should change the line to 7.5.0 or just remove it...your call. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
So, that is probably Not your problem, also, the AVX2 App was built with API 7.5.0 so even your API line is wrong. I'm not sure if you should change the line to 7.5.0 or just remove it...your call.I suggested adding the API version line because Juan posted an error message which is specific to a function for which the API version is used. Let's wait and see if it's had any effect before jumping to conclusions. The test relating to shared memory is precisely "At least 6.0". Any higher value will do. There's no point in worrying about any numeric change. The test which kicks in at 7.5 is different. Above that point, the BOINC Client no longer passes the device number to be used (for GPU apps) on the command line. We had to get Raistmer to re-organise his code so that different tasks ran on device 1, device 2, and so on. Without fixing it, all tasks ran on device zero, whatever device BOINC thought (and displayed) that they were running on. I don't know whether the other developers were paying attention or not. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
There are an awful lot of people running the Windows and Linux AVX Apps without any problems or api lines. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
@TBar Yes i after all i really believe is some kind of incompatibility with the rescheduler and the way the Lunix or the build works. Please i not saying some is broken or have a bug, the mix just not work well in my case at least. But now i know about that so i could try to avoid that extreame situation who produces the error. Maybe something similar could happening with yours WU. I know it's a detective task to find needle in the haystack. And for that i thanks all who help me to try to trace and fix the problem. Anyway i will left the API line as Richard post and try to see if that problem happening again in the next outage. Since you don't read our team forum I believe I never have a real opportunity to say thanks to You directly for the nice work you done who make possible to us Linux newbies run this CUDA90 builds in our hosts. That was amazing, i never run a Linux box before and i was able to change my host from Windows to Linux with almost no pain. And the result of this is clear, my host now is the #2 Top cruncher by RAC even without the need of change my 1070 for 1080Ti GPU's It goes from 120K/Day to up to 200K/Day just because the software update something really incredible. So a big thanks for that! |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
There are an awful lot of people running the Windows and Linux AVX Apps without any problems or api lines.Great - they can stop reading this specific thread at this point. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
@TBarThere isn't a rescheduler App for OSX, if I find it necessary to reschedule I just use the Text Editor with Find & Replace instead, even on Linux. Right now, my OSX app_info still has the API line in it from the last time Richard brought it up, it didn't seem to help any this last time. I'm also running the 3711 CPU App on a Linux machine, so far it hasn't had any trouble without an API line. Yes, it does make things much easier when you place the BOINC folder in your Linux Home folder, especially if you have One Home partition with a few different System partitions. Kinda a pain to have to have a different BOINC for each system folder. Hopefully the next CUDA App will solve the few remaining problems and I can post a zi3xs version, zi3xs is a little faster than zi3v. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Hopefully the next CUDA App will solve the few remaining problems and I can post a zi3xs version, zi3xs is a little faster than zi3v. Can't wait to use & if you need some host to test be free to ask. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Hopefully the next CUDA App will solve the few remaining problems and I can post a zi3xs version, zi3xs is a little faster than zi3v. Count me in as a beta tester too. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Ok that's is new. Not need to wait a lot of time.... the error repeats. Again the cache almost empty (project backoff.... again) My cache is in the last 8 CPU WU all shows the Acquire lock msg Now I'm sure: Einstein is out (set to NNT yesterday) I do not rescheduled anything, my host was not recycled and Boinc was running without interference And the api line is on place (will post the file in the next msg with the Boinc log file) The only change i made on the host is to rise the # of CPU WU from 5 to 6 as suggested by Mike earlier. This is the error file of one of them: not using mb_cmdline.txt-file, using commandline options. Build features: SETI8 Non-graphics FFTW FFTOUT JSPF AVX2 64bit System: Linux x86_64 Kernel: 4.10.0-42-generic CPU : Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz 12 core(s), Speed : 3499.804 MHz L1 : 64 KB, Cache : 15360 KB Features : FPU TSC PAE APIC MTRR MMX SSE SSE2 HT PNI SSSE3 SSE4_1 SSE4_2 AVX AVX2 ar=0.398298 NumCfft=203281 NumGauss=1186466310 NumPulse=226424236977 NumTriplet=452813964973 In v_BaseLineSmooth: NumDataPoints=1048576, BoxCarLength=8192, NumPointsInChunk=32768 Linux optimized setiathome_v8 application Version info: AVX2jf (Intel, Core 2-optimized v8-nographics) V5.13 by Alex Kan AVX2jf Linux64 Build 3712 , Ported by : Raistmer, JDWhale, Urs Echternacht Work Unit Info: ............... Credit multiplier is : 2.85 WU true angle range is : 0.398298 11:13:57 (1870): Can't acquire lockfile (-154) - waiting 35s 11:14:32 (1870): Can't acquire lockfile (-154) - exiting 11:32:03 (9219): Can't acquire lockfile (-154) - waiting 35s 11:32:38 (9219): Can't acquire lockfile (-154) - exiting 11:42:51 (13594): Can't acquire lockfile (-154) - waiting 35s 11:43:26 (13594): Can't acquire lockfile (-154) - exiting 11:53:31 (17730): Can't acquire lockfile (-154) - waiting 35s 11:54:06 (17730): Can't acquire lockfile (-154) - exiting 12:04:10 (22018): Can't acquire lockfile (-154) - waiting 35s 12:04:45 (22018): Can't acquire lockfile (-154) - exiting 12:14:59 (26203): Can't acquire lockfile (-154) - waiting 35s 12:15:34 (26203): Can't acquire lockfile (-154) - exiting 12:25:45 (30454): Can't acquire lockfile (-154) - waiting 35s 12:26:20 (30454): Can't acquire lockfile (-154) - exiting 12:36:32 (2439): Can't acquire lockfile (-154) - waiting 35s 12:37:07 (2439): Can't acquire lockfile (-154) - exiting 12:47:28 (6871): Can't acquire lockfile (-154) - waiting 35s 12:48:03 (6871): Can't acquire lockfile (-154) - exiting 12:58:08 (10898): Can't acquire lockfile (-154) - waiting 35s 12:58:43 (10898): Can't acquire lockfile (-154) - exiting 13:09:19 (15219): Can't acquire lockfile (-154) - waiting 35s 13:09:54 (15219): Can't acquire lockfile (-154) - exiting 13:19:58 (19226): Can't acquire lockfile (-154) - waiting 35s 13:20:33 (19226): Can't acquire lockfile (-154) - exiting |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
OK, 'Can't acquire lockfile' is a different error from the shared memory error I was trying to shepherd you through, so my advice is irrelevant here - I'll keep out of the way. I'd only suggest that you look in the slot folder where that task was trying to run (I assume that's where you got the stderr from), and see if a boinc_lockfile is present. If so, does the timestamp correspond to the time that particular task first tried to run? |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
This is the stdoutdat.txt file https://1drv.ms/t/s!Asjkc9Jyluh3zw8yYwUiRa4eKQVE Definitely is related when the WU cache is almost dry. Just DL few GPU WU and the continue to crunch normally. Any suggestion? The postponed WU still here. Will leave crunching the GPU WU for now. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
OK, 'Can't acquire lockfile' is a different error from the shared memory error I was trying to shepherd you through, so my advice is irrelevant here - I'll keep out of the way. Yes the file is there and the file time stamp is 10:58 but it's has 0 (zero bytes) blc05_2bit_guppi_57976_07262_HIP74926_0026.15439.818.22.45.99.vlar |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Zero size is fine - that's normal. Check the date as well as the time - it might be left over from the last time this happened. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
I made something extreme. Stop Boinc. Go to the Slot directories and delete them all and restart Boinc All the WU end as computation error as expected. But at least the host returns to do his work. I feel doomed... pause for a beer. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Zero size is fine - that's normal. Check the date as well as the time - it might be left over from the last time this happened. Yes was well around the time i reload the config file changing the number of WU from 5 to 6 & The DL errors did not help, something else is happening on the servers side. A lot of DL retries. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Yes the file is there and the file time stamp is 10:58 but it's has 0 (zero bytes)That timestamp precedes the BOINC restart shown at the beginning of the log file you posted, 06-Jan-2018 11:02:33 [---] Starting BOINC client version 7.8.3 for x86_64-pc-linux-gnu I think that's a pretty good indication that the lockfiles didn't get removed when BOINC shut down. BOINC also started a fresh log file with that restart, so to see the end of the previous run you'll need to look at the stdoutdae.old file. If the last line doesn't show "exiting", it's probably also a good indication that the BOINC client didn't finish shutting down cleanly, although it could also simply be that the log file got closed before all the messages were written. Clearly it's not a rescheduler problem, but does seem to be a BOINC shutdown issue. It seems like something is happening too quickly during the shutdown and is preventing the lockfiles from getting deleted. If it happens again, try shutting down BOINC and then delete any lockfiles you find in the slot folders. Just the lockfiles, nothing else. Oh, and have another beer. That will definitely help. :^) |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
couldn't start app: Can't write init file: fopen() failed</message>That looks very much the same error as I saw once when the client_state.xml file was edited with boinc running. It's possible the client_state.xml file was damaged the other day. It appears you just filled your cache again which makes my suggestion much more difficult. I'd recommend running the cache dry and removing the state file so a new one can be built. If you save your Host ID and the <rpc_seqno></rpc_seqno> number you can keep the old Host number if BOINC trys to make a new Host ID. That might solve the problem. I'd say the problem is you edited the client_state.xml while boinc was running. How long did you run the CPU App without having any trouble? |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Clearly it's not a rescheduler problem, but does seem to be a BOINC shutdown issue. It seems like something is happening too quickly during the shutdown and is preventing the lockfiles from getting deleted. If it happens again, try shutting down BOINC and then delete any lockfiles you find in the slot folders. Just the lockfiles, nothing else. Oh, and have another beer. That will definitely help. :^) OK will wait the error happening again and test just by erasing the lock file. Yes i believe we are closing to the problem. Something left the lockfiles closed so the new work can't start. But it closes the GPU slot related files since they continue to work. The rescheduler program could just make that more common because it shutdown and restart the Boinc process too. Not know the Boinc unlock process when it shut down but could be related because i use the AVX2 builds? |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
That looks very much the same error as I saw once when the client_state.xml file was edited with boinc running. It's possible the client_state.xml file was damaged the other day. It appears you just filled your cache again which makes my suggestion much more difficult. I'd recommend running the cache dry and removing the state file so a new one can be built. If you save your Host ID and the <rpc_seqno></rpc_seqno> number you can keep the old Host number if BOINC trys to make a new Host ID. That might solve the problem.No, that's just because he deleted all the files from all the slot folders, even the ones that weren't having the lockfile problem. The ones that were successfully running simply couldn't find the files they needed on restart. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
couldn't start app: Can't write init file: fopen() failed</message>That looks very much the same error as I saw once when the client_state.xml file was edited with boinc running. It's possible the client_state.xml file was damaged the other day. It appears you just filled your cache again which makes my suggestion much more difficult. I'd recommend running the cache dry and removing the state file so a new one can be built. If you save your Host ID and the <rpc_seqno></rpc_seqno> number you can keep the old Host number if BOINC trys to make a new Host ID. That might solve the problem. For few weeks at least, for the beginning of dezember. The error only appears when the CPU WU cache is close to zero. Apparently only when the last WU is crunched on the slot of something near of that. Not very common since i always try to keep my cache close to the 1000 WU limit. Will save this for a second try. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.