Message boards :
Number crunching :
Problem with AP?
Message board moderation
Author | Message |
---|---|
Martin Send message Joined: 7 Aug 13 Posts: 3 Credit: 1,604,771 RAC: 2 |
On checking running status, I found that an AP WU appeared to have been "stuck" with some 31% completed, but with time elapsed increasing but not % done. I tried to see if anything could unstick it and, finally, rebooted my computer. This seems to have restarted the application. The % completed remained the same, but the elapsed time dropped from some 35 hours down to 15, and the estimated time to completion went up from 20 hours to 35. Has anyone else noticed this problem? Martin |
spitfire_mk_2 Send message Joined: 14 Apr 00 Posts: 563 Credit: 27,306,885 RAC: 0 |
On checking running status, I found that an AP WU appeared to have been "stuck" with some 31% completed, but with time elapsed increasing but not % done. I tried to see if anything could unstick it and, finally, rebooted my computer. It happens. Is it cpu wu or gpu wu? |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Has anyone else noticed this problem? It came up a number of times last year. There were a couple of threads about it many months ago. I know I used to experience it myself for a while, primarily on my daily driver. I was always able to clear it by simply suspending processing for several seconds (using the Activity menu in BOINC Manager) and then simply resuming the processing. No reboot necessary. What I eventually found seemed to be at the root of my problem was the fact that I had the "Use at most nnn% CPU time" option set to less than 100%, which I was using to control temperatures. It seemed that every time an AP task got hung up, it was precisely at the point where the task was taking a checkpoint, and I began to feel that perhaps if a checkpoint was being written at the exact same time that BOINC was momentarily suspending the task to meet the maximum CPU time criteria, it was unable to resume processing for some reason. The problem was not happening on machines where I had the CPU time usage set to 100%. What I ended up trying was to also set the CPU Time usage back to 100% on my daily driver and switch to TThrottle to control my temperatures. I never had a hung AP task again after I made that change. (Knock on wood!) |
Martin Send message Joined: 7 Aug 13 Posts: 3 Credit: 1,604,771 RAC: 2 |
Many thanks, Jeff. I have just set the CPU usage to 100%, and will see what happens. Martin |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
I had a similar problem two days ago. Actually twice. I ended up first aborting the task (without hesitation: 5 min 0% progress.) Then I got the same idea: suspend and resume after a minute. That did not seem to work, but I left the process to continue whatever it was doing... After 12 minutes it said 0,9% done and some 40 seconds later 1,8% etc. A momentary hickup? During the night I observed another similar one.. I did nothing and after about 10 minutes or so it went on processing as normal. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
I'm guessing that it was a GPU task you're referring to. While the progress indicator on AP CPU tasks seem to progress smoothly, the progress for AP GPU tasks jump incrementally 0.9% (or 0.901%) at a time. If you have a general idea of how many minutes the run time usually is for your AP GPU tasks, you can just divide that by 110 to estimate how frequently the progress indicator will increment. (I think, though, that the amount of blanking can make that interval vary significantly.) |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Yes a GPU task. They go normally when running 4 at a time per GPU like little under 40 seconds per 0.9%. A normal non blanked task runs about one hour and 6 minutes. The strange thing was that a task seemed to do nothing in the first several minutes and then started going through the normal 40 sec interval update. It could be that blanking was needed at the beginning of the WU. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Yes a GPU task. They go normally when running 4 at a time per GPU like little under 40 seconds per 0.9%. A normal non blanked task runs about one hour and 6 minutes. When blanking is needed, each of the 111 intervals needs the same amount so that wasn't the cause. There are initialization activities, of course, but none ought to take anything like the 10 or 12 minutes noted in your previous post. It's a puzzle. Joe |
ExchangeMan Send message Joined: 9 Jan 00 Posts: 115 Credit: 157,719,104 RAC: 0 |
I had a similar problem two days ago. I've seen this behavior several times myself. The task would indicate no progress for 5 or 6 minutes, then jump to .9%. After that it continues normally and finished normally. I used to abort tasks that took more than 5 minutes to start indicating progress. I know better now and just let them go, although I did get some tasks that halted progress about at 50%. Sometimes suspending and resuming the tasks can get it going normally. I'm not sure why this happens, but perhaps the data in the .wu file is corrupted to a degree. Oh well, I'm sure this has happened quite often while I was at work or sleeping and went unnoticed. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
For GPU task (especially Ati GPU AP task) this can happen because of fully loaded CPU. There are 3 possible solutions: 1) keep free core, maybe even 2. 2) use -cpu_lock option with recent enough build (where this option changes process affinity right). 3)use third-party software like ProcessLasso to limit process affinity to only 1 CPU (different CPUs should be used if few GPU tasks running simultaneously). SETI apps news We're not gonna fight them. We're gonna transcend them. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
OK, I have 6 free cores for 8 GPU AP tasks. The GPUs are NV GTX780. The machine is not near CPU starvation. If the load was 12 or more it would be. Note that the GPU AP is using from 10% to 25% of a CPU for blanking and stuff. The CPU usage can be as low as 1-2% when there is no blanking. top - 21:34:15 up 1 day, 9:22, 2 users, load average: 6.82, 6.76, 6.72 Tasks: 301 total, 14 running, 287 sleeping, 0 stopped, 0 zombie Cpu(s): 0.5%us, 0.3%sy, 60.0%ni, 38.9%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8172532k total, 3812664k used, 4359868k free, 277944k buffers Swap: 10256380k total, 0k used, 10256380k free, 1992836k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 10056 boinc 39 19 43824 39m 4 R 99.8 0.5 66:42.58 ../../projects/setiathome.berkeley.edu/setiathome_7.01_x86_64-pc-linux-gnu 10107 boinc 39 19 44596 40m 4 R 99.8 0.5 37:39.74 ../../projects/setiathome.berkeley.edu/setiathome_7.01_x86_64-pc-linux-gnu 10157 boinc 39 19 41712 37m 4 R 99.8 0.5 20:53.31 ../../projects/setiathome.berkeley.edu/setiathome_7.01_x86_64-pc-linux-gnu 10073 boinc 39 19 43780 39m 4 R 99.7 0.5 57:02.76 ../../projects/setiathome.berkeley.edu/setiathome_7.01_x86_64-pc-linux-gnu 10089 boinc 39 19 44504 40m 4 R 99.7 0.5 45:40.06 ../../projects/setiathome.berkeley.edu/setiathome_7.01_x86_64-pc-linux-gnu 10103 boinc 39 19 43160 38m 4 R 97.6 0.5 42:04.72 ../../projects/setiathome.berkeley.edu/setiathome_7.01_x86_64-pc-linux-gnu 10043 boinc 30 10 36.3g 133m 110m R 25.4 1.7 16:11.24 ../../projects/setiathome.berkeley.edu/ap_6.07r1952_avx_clGPU_x86_64-pc-linux-gnu -unroll 24 -ffa_block 4096 -ffa_block_fetch 4096 -sbs 512 -tune 1 32 32 1 --device 1 10144 boinc 30 10 36.3g 133m 110m R 20.8 1.7 5:06.94 ../../projects/setiathome.berkeley.edu/ap_6.07r1952_avx_clGPU_x86_64-pc-linux-gnu -unroll 24 -ffa_block 4096 -ffa_block_fetch 4096 -sbs 512 -tune 1 32 32 1 --device 1 10204 boinc 30 10 36.3g 133m 110m R 17.9 1.7 1:37.52 ../../projects/setiathome.berkeley.edu/ap_6.07r1952_avx_clGPU_x86_64-pc-linux-gnu -unroll 24 -ffa_block 4096 -ffa_block_fetch 4096 -sbs 512 -tune 1 32 32 1 --device 1 10110 boinc 30 10 36.3g 133m 110m R 15.8 1.7 5:55.96 ../../projects/setiathome.berkeley.edu/ap_6.07r1952_avx_clGPU_x86_64-pc-linux-gnu -unroll 24 -ffa_block 4096 -ffa_block_fetch 4096 -sbs 512 -tune 1 32 32 1 --device 0 10162 boinc 30 10 36.3g 133m 110m R 15.4 1.7 2:55.17 ../../projects/setiathome.berkeley.edu/ap_6.07r1952_avx_clGPU_x86_64-pc-linux-gnu -unroll 24 -ffa_block 4096 -ffa_block_fetch 4096 -sbs 512 -tune 1 32 32 1 --device 1 10064 boinc 30 10 36.3g 133m 110m R 13.6 1.7 8:18.23 ../../projects/setiathome.berkeley.edu/ap_6.07r1952_avx_clGPU_x86_64-pc-linux-gnu -unroll 24 -ffa_block 4096 -ffa_block_fetch 4096 -sbs 512 -tune 1 32 32 1 --device 0 10186 boinc 30 10 36.3g 133m 110m S 10.6 1.7 1:20.49 ../../projects/setiathome.berkeley.edu/ap_6.07r1952_avx_clGPU_x86_64-pc-linux-gnu -unroll 24 -ffa_block 4096 -ffa_block_fetch 4096 -sbs 512 -tune 1 32 32 1 --device 0 10194 boinc 30 10 36.3g 133m 110m S 10.6 1.7 1:14.56 ../../projects/setiathome.berkeley.edu/ap_6.07r1952_avx_clGPU_x86_64-pc-linux-gnu -unroll 24 -ffa_block 4096 -ffa_block_fetch 4096 -sbs 512 -tune 1 32 32 1 --device 0 To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
OK, My advise can be applied only to windows hosts. I'm not sure Linux port supports -cpu_lock switch at all. What could help for Linux (including if CPU freeing is needed at all on Linux) need to be found. SETI apps news We're not gonna fight them. We're gonna transcend them. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Hi Don´t know if that could be related to this thread but today after few days all working fine, today i get this error: http://setiathome.berkeley.edu/result.php?resultid=3370196609 never see one like this before in this host and as allways have no ideia why? |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Hi 194 (0xc2) EXIT_ABORTED_BY_CLIENT <message> finish file present too long </message> There is a limited about of space allocated for the return data file and this limit was reached. However, I am not sure if it is related to the 4 times the task was restarted. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Thanks Hal9000. The question is why that happening on this hosts who crunch a lot of other WU without any error and nothing was changed on it? And if i could to do something to avoid it happening again? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Hi I think the measure of 'too long' is time, rather than size - there's a different message for 'too large': #define ERR_FILE_TOO_BIG -131 // an output file was bigger than max_nbytes |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Thanks Richard, but i still don´t understand why this error happening, this hosts crunch a lot of other WU with no error. BTW If is time or size for us the users makes little diference, we can´t control anyone of them, or we could? |
Mike Send message Joined: 17 Feb 01 Posts: 34253 Credit: 79,922,639 RAC: 80 |
Thanks Richard, but i still don´t understand why this error happening, this hosts crunch a lot of other WU with no error. BTW If is time or size for us the users makes little diference, we can´t control anyone of them, or we could? Of course you can. You reduced FFA_block/fetch again. 5500 seconds for 7% blanking is to long for a 780. With each crime and every kindness we birth our future. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
But it´s running 3 WU at a time, so it takes about 1hr and 1/4 to crunch a AP WU, if nothing stops the crunching. The time of this WU is realy wierd normaly must takes about 3500, but who cause that? and why not on the other WU crunched on thehost who are crunched with the same reduced settings too? I reduce the FFA_block/fetch to avoid the video lag, remember i use slow I5 CPUs to feed the fast GPU´s, with the bigger numbers the videolag was present, and the new reduced numbers are working fine in all my 780´s, this is the only WU with error in days. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Hi That does make more sense. Note to self: Stop trying to think before coffee. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.