Problem with AP?

Message boards : Number crunching : Problem with AP?
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Martin

Send message
Joined: 7 Aug 13
Posts: 3
Credit: 1,604,771
RAC: 2
United Kingdom
Message 1471687 - Posted: 1 Feb 2014, 19:35:02 UTC

On checking running status, I found that an AP WU appeared to have been "stuck" with some 31% completed, but with time elapsed increasing but not % done. I tried to see if anything could unstick it and, finally, rebooted my computer.

This seems to have restarted the application. The % completed remained the same, but the elapsed time dropped from some 35 hours down to 15, and the estimated time to completion went up from 20 hours to 35.

Has anyone else noticed this problem?

Martin
ID: 1471687 · Report as offensive
spitfire_mk_2
Avatar

Send message
Joined: 14 Apr 00
Posts: 563
Credit: 27,306,885
RAC: 0
United States
Message 1471691 - Posted: 1 Feb 2014, 19:54:16 UTC - in response to Message 1471687.  

On checking running status, I found that an AP WU appeared to have been "stuck" with some 31% completed, but with time elapsed increasing but not % done. I tried to see if anything could unstick it and, finally, rebooted my computer.

This seems to have restarted the application. The % completed remained the same, but the elapsed time dropped from some 35 hours down to 15, and the estimated time to completion went up from 20 hours to 35.

Has anyone else noticed this problem?

Martin

It happens.

Is it cpu wu or gpu wu?
ID: 1471691 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1471695 - Posted: 1 Feb 2014, 20:29:50 UTC - in response to Message 1471687.  

Has anyone else noticed this problem?

Martin

It came up a number of times last year. There were a couple of threads about it many months ago. I know I used to experience it myself for a while, primarily on my daily driver. I was always able to clear it by simply suspending processing for several seconds (using the Activity menu in BOINC Manager) and then simply resuming the processing. No reboot necessary.

What I eventually found seemed to be at the root of my problem was the fact that I had the "Use at most nnn% CPU time" option set to less than 100%, which I was using to control temperatures. It seemed that every time an AP task got hung up, it was precisely at the point where the task was taking a checkpoint, and I began to feel that perhaps if a checkpoint was being written at the exact same time that BOINC was momentarily suspending the task to meet the maximum CPU time criteria, it was unable to resume processing for some reason. The problem was not happening on machines where I had the CPU time usage set to 100%.

What I ended up trying was to also set the CPU Time usage back to 100% on my daily driver and switch to TThrottle to control my temperatures. I never had a hung AP task again after I made that change. (Knock on wood!)
ID: 1471695 · Report as offensive
Profile Martin

Send message
Joined: 7 Aug 13
Posts: 3
Credit: 1,604,771
RAC: 2
United Kingdom
Message 1471700 - Posted: 1 Feb 2014, 20:51:30 UTC - in response to Message 1471695.  

Many thanks, Jeff.

I have just set the CPU usage to 100%, and will see what happens.

Martin
ID: 1471700 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1471709 - Posted: 1 Feb 2014, 21:17:46 UTC - in response to Message 1471695.  

I had a similar problem two days ago.

Actually twice.

I ended up first aborting the task (without hesitation: 5 min 0% progress.) Then I got the same idea: suspend and resume after a minute. That did not seem to work, but I left the process to continue whatever it was doing... After 12 minutes it said 0,9% done and some 40 seconds later 1,8% etc.

A momentary hickup?

During the night I observed another similar one.. I did nothing and after about 10 minutes or so it went on processing as normal.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1471709 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1471714 - Posted: 1 Feb 2014, 21:38:24 UTC - in response to Message 1471709.  

I'm guessing that it was a GPU task you're referring to. While the progress indicator on AP CPU tasks seem to progress smoothly, the progress for AP GPU tasks jump incrementally 0.9% (or 0.901%) at a time. If you have a general idea of how many minutes the run time usually is for your AP GPU tasks, you can just divide that by 110 to estimate how frequently the progress indicator will increment. (I think, though, that the amount of blanking can make that interval vary significantly.)
ID: 1471714 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1471969 - Posted: 2 Feb 2014, 16:06:58 UTC - in response to Message 1471714.  

Yes a GPU task. They go normally when running 4 at a time per GPU like little under 40 seconds per 0.9%. A normal non blanked task runs about one hour and 6 minutes.

The strange thing was that a task seemed to do nothing in the first several minutes and then started going through the normal 40 sec interval update. It could be that blanking was needed at the beginning of the WU.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1471969 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1471989 - Posted: 2 Feb 2014, 17:28:22 UTC - in response to Message 1471969.  

Yes a GPU task. They go normally when running 4 at a time per GPU like little under 40 seconds per 0.9%. A normal non blanked task runs about one hour and 6 minutes.

The strange thing was that a task seemed to do nothing in the first several minutes and then started going through the normal 40 sec interval update. It could be that blanking was needed at the beginning of the WU.

When blanking is needed, each of the 111 intervals needs the same amount so that wasn't the cause. There are initialization activities, of course, but none ought to take anything like the 10 or 12 minutes noted in your previous post. It's a puzzle.
                                                                   Joe
ID: 1471989 · Report as offensive
ExchangeMan
Volunteer tester

Send message
Joined: 9 Jan 00
Posts: 115
Credit: 157,719,104
RAC: 0
United States
Message 1471994 - Posted: 2 Feb 2014, 17:43:15 UTC - in response to Message 1471709.  

I had a similar problem two days ago.

Actually twice.

I ended up first aborting the task (without hesitation: 5 min 0% progress.) Then I got the same idea: suspend and resume after a minute. That did not seem to work, but I left the process to continue whatever it was doing... After 12 minutes it said 0,9% done and some 40 seconds later 1,8% etc.

A momentary hickup?

During the night I observed another similar one.. I did nothing and after about 10 minutes or so it went on processing as normal.

I've seen this behavior several times myself. The task would indicate no progress for 5 or 6 minutes, then jump to .9%. After that it continues normally and finished normally. I used to abort tasks that took more than 5 minutes to start indicating progress. I know better now and just let them go, although I did get some tasks that halted progress about at 50%. Sometimes suspending and resuming the tasks can get it going normally.

I'm not sure why this happens, but perhaps the data in the .wu file is corrupted to a degree. Oh well, I'm sure this has happened quite often while I was at work or sleeping and went unnoticed.
ID: 1471994 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1472039 - Posted: 2 Feb 2014, 19:30:12 UTC

For GPU task (especially Ati GPU AP task) this can happen because of fully loaded CPU.
There are 3 possible solutions:
1) keep free core, maybe even 2.
2) use -cpu_lock option with recent enough build (where this option changes process affinity right).
3)use third-party software like ProcessLasso to limit process affinity to only 1 CPU (different CPUs should be used if few GPU tasks running simultaneously).
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1472039 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1472045 - Posted: 2 Feb 2014, 19:49:56 UTC - in response to Message 1472039.  

OK,

I have 6 free cores for 8 GPU AP tasks. The GPUs are NV GTX780. The machine is not near CPU starvation. If the load was 12 or more it would be. Note that the GPU AP is using from 10% to 25% of a CPU for blanking and stuff. The CPU usage can be as low as 1-2% when there is no blanking.

top - 21:34:15 up 1 day,  9:22,  2 users,  load average: 6.82, 6.76, 6.72
Tasks: 301 total,  14 running, 287 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.5%us,  0.3%sy, 60.0%ni, 38.9%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8172532k total,  3812664k used,  4359868k free,   277944k buffers
Swap: 10256380k total,        0k used, 10256380k free,  1992836k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                       
10056 boinc     39  19 43824  39m    4 R 99.8  0.5  66:42.58 ../../projects/setiathome.berkeley.edu/setiathome_7.01_x86_64-pc-linux-gnu                                                                                                    
10107 boinc     39  19 44596  40m    4 R 99.8  0.5  37:39.74 ../../projects/setiathome.berkeley.edu/setiathome_7.01_x86_64-pc-linux-gnu                                                                                                    
10157 boinc     39  19 41712  37m    4 R 99.8  0.5  20:53.31 ../../projects/setiathome.berkeley.edu/setiathome_7.01_x86_64-pc-linux-gnu                                                                                                    
10073 boinc     39  19 43780  39m    4 R 99.7  0.5  57:02.76 ../../projects/setiathome.berkeley.edu/setiathome_7.01_x86_64-pc-linux-gnu                                                                                                    
10089 boinc     39  19 44504  40m    4 R 99.7  0.5  45:40.06 ../../projects/setiathome.berkeley.edu/setiathome_7.01_x86_64-pc-linux-gnu                                                                                                    
10103 boinc     39  19 43160  38m    4 R 97.6  0.5  42:04.72 ../../projects/setiathome.berkeley.edu/setiathome_7.01_x86_64-pc-linux-gnu                                                                                                    
10043 boinc     30  10 36.3g 133m 110m R 25.4  1.7  16:11.24 ../../projects/setiathome.berkeley.edu/ap_6.07r1952_avx_clGPU_x86_64-pc-linux-gnu -unroll 24 -ffa_block 4096 -ffa_block_fetch 4096 -sbs 512 -tune 1 32 32 1 --device 1        
10144 boinc     30  10 36.3g 133m 110m R 20.8  1.7   5:06.94 ../../projects/setiathome.berkeley.edu/ap_6.07r1952_avx_clGPU_x86_64-pc-linux-gnu -unroll 24 -ffa_block 4096 -ffa_block_fetch 4096 -sbs 512 -tune 1 32 32 1 --device 1        
10204 boinc     30  10 36.3g 133m 110m R 17.9  1.7   1:37.52 ../../projects/setiathome.berkeley.edu/ap_6.07r1952_avx_clGPU_x86_64-pc-linux-gnu -unroll 24 -ffa_block 4096 -ffa_block_fetch 4096 -sbs 512 -tune 1 32 32 1 --device 1        
10110 boinc     30  10 36.3g 133m 110m R 15.8  1.7   5:55.96 ../../projects/setiathome.berkeley.edu/ap_6.07r1952_avx_clGPU_x86_64-pc-linux-gnu -unroll 24 -ffa_block 4096 -ffa_block_fetch 4096 -sbs 512 -tune 1 32 32 1 --device 0        
10162 boinc     30  10 36.3g 133m 110m R 15.4  1.7   2:55.17 ../../projects/setiathome.berkeley.edu/ap_6.07r1952_avx_clGPU_x86_64-pc-linux-gnu -unroll 24 -ffa_block 4096 -ffa_block_fetch 4096 -sbs 512 -tune 1 32 32 1 --device 1        
10064 boinc     30  10 36.3g 133m 110m R 13.6  1.7   8:18.23 ../../projects/setiathome.berkeley.edu/ap_6.07r1952_avx_clGPU_x86_64-pc-linux-gnu -unroll 24 -ffa_block 4096 -ffa_block_fetch 4096 -sbs 512 -tune 1 32 32 1 --device 0        
10186 boinc     30  10 36.3g 133m 110m S 10.6  1.7   1:20.49 ../../projects/setiathome.berkeley.edu/ap_6.07r1952_avx_clGPU_x86_64-pc-linux-gnu -unroll 24 -ffa_block 4096 -ffa_block_fetch 4096 -sbs 512 -tune 1 32 32 1 --device 0        
10194 boinc     30  10 36.3g 133m 110m S 10.6  1.7   1:14.56 ../../projects/setiathome.berkeley.edu/ap_6.07r1952_avx_clGPU_x86_64-pc-linux-gnu -unroll 24 -ffa_block 4096 -ffa_block_fetch 4096 -sbs 512 -tune 1 32 32 1 --device 0 

To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1472045 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1472050 - Posted: 2 Feb 2014, 20:04:22 UTC - in response to Message 1472045.  

OK,

I have 6 free cores for 8 GPU AP tasks. The GPUs are NV GTX780. The machine is not near CPU starvation. If the load was 12 or more it would be. Note that the GPU AP is using from 10% to 25% of a CPU for blanking and stuff. The CPU usage can be as low as 1-2% when there is no blanking.


My advise can be applied only to windows hosts.
I'm not sure Linux port supports -cpu_lock switch at all. What could help for Linux (including if CPU freeing is needed at all on Linux) need to be found.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1472050 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1473277 - Posted: 6 Feb 2014, 13:46:23 UTC
Last modified: 6 Feb 2014, 13:46:45 UTC

Hi

Don´t know if that could be related to this thread but today
after few days all working fine, today i get this error:

http://setiathome.berkeley.edu/result.php?resultid=3370196609

never see one like this before in this host and as allways have no ideia why?
ID: 1473277 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1473279 - Posted: 6 Feb 2014, 14:01:07 UTC - in response to Message 1473277.  

Hi

Don´t know if that could be related to this thread but today
after few days all working fine, today i get this error:

http://setiathome.berkeley.edu/result.php?resultid=3370196609

never see one like this before in this host and as allways have no ideia why?

194 (0xc2) EXIT_ABORTED_BY_CLIENT
<message>
finish file present too long
</message>

There is a limited about of space allocated for the return data file and this limit was reached. However, I am not sure if it is related to the 4 times the task was restarted.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1473279 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1473282 - Posted: 6 Feb 2014, 14:19:02 UTC - in response to Message 1473279.  

Thanks Hal9000. The question is why that happening on this hosts who crunch a lot of other WU without any error and nothing was changed on it? And if i could to do something to avoid it happening again?
ID: 1473282 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1473286 - Posted: 6 Feb 2014, 14:21:45 UTC - in response to Message 1473279.  

Hi

Don´t know if that could be related to this thread but today
after few days all working fine, today i get this error:

http://setiathome.berkeley.edu/result.php?resultid=3370196609

never see one like this before in this host and as allways have no ideia why?

194 (0xc2) EXIT_ABORTED_BY_CLIENT
<message>
finish file present too long
</message>

There is a limited about of space allocated for the return data file and this limit was reached. However, I am not sure if it is related to the 4 times the task was restarted.

I think the measure of 'too long' is time, rather than size - there's a different message for 'too large':

#define ERR_FILE_TOO_BIG    -131
    // an output file was bigger than max_nbytes
ID: 1473286 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1473290 - Posted: 6 Feb 2014, 14:33:10 UTC

Thanks Richard, but i still don´t understand why this error happening, this hosts crunch a lot of other WU with no error. BTW If is time or size for us the users makes little diference, we can´t control anyone of them, or we could?
ID: 1473290 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34253
Credit: 79,922,639
RAC: 80
Germany
Message 1473291 - Posted: 6 Feb 2014, 14:34:42 UTC - in response to Message 1473290.  
Last modified: 6 Feb 2014, 14:36:13 UTC

Thanks Richard, but i still don´t understand why this error happening, this hosts crunch a lot of other WU with no error. BTW If is time or size for us the users makes little diference, we can´t control anyone of them, or we could?


Of course you can.
You reduced FFA_block/fetch again.

5500 seconds for 7% blanking is to long for a 780.


With each crime and every kindness we birth our future.
ID: 1473291 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1473293 - Posted: 6 Feb 2014, 14:43:40 UTC - in response to Message 1473291.  
Last modified: 6 Feb 2014, 14:58:18 UTC

But it´s running 3 WU at a time, so it takes about 1hr and 1/4 to crunch a AP WU, if nothing stops the crunching. The time of this WU is realy wierd normaly must takes about 3500, but who cause that? and why not on the other WU crunched on thehost who are crunched with the same reduced settings too?

I reduce the FFA_block/fetch to avoid the video lag, remember i use slow I5 CPUs to feed the fast GPU´s, with the bigger numbers the videolag was present, and the new reduced numbers are working fine in all my 780´s, this is the only WU with error in days.
ID: 1473293 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1473295 - Posted: 6 Feb 2014, 14:51:40 UTC - in response to Message 1473286.  

Hi

Don´t know if that could be related to this thread but today
after few days all working fine, today i get this error:

http://setiathome.berkeley.edu/result.php?resultid=3370196609

never see one like this before in this host and as allways have no ideia why?

194 (0xc2) EXIT_ABORTED_BY_CLIENT
<message>
finish file present too long
</message>

There is a limited about of space allocated for the return data file and this limit was reached. However, I am not sure if it is related to the 4 times the task was restarted.

I think the measure of 'too long' is time, rather than size - there's a different message for 'too large':

#define ERR_FILE_TOO_BIG    -131
    // an output file was bigger than max_nbytes

That does make more sense.
Note to self: Stop trying to think before coffee.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1473295 · Report as offensive
1 · 2 · 3 · Next

Message boards : Number crunching : Problem with AP?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.