WU's timing out........

Message boards : Number crunching : WU's timing out........
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile AllenIN
Volunteer tester
Avatar

Send message
Joined: 5 Dec 00
Posts: 292
Credit: 58,297,005
RAC: 311
United States
Message 1771772 - Posted: 15 Mar 2016, 22:52:08 UTC

I've been having a lot of WU's that run out of time before they are run. I don't remember this being a problem in the past.

Is anyone else having this problem??

If not, is there anything I can do to fix it? I've tried suspending work until the older ones are through running but this is very time consuming and sometimes I don't get back in time to resume them and nothing gets done during that time.

Thanks for any help you can offer.

Allen
ID: 1771772 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1771784 - Posted: 16 Mar 2016, 0:44:30 UTC
Last modified: 16 Mar 2016, 0:49:48 UTC

With the limits of 100 tasks per CPU/GPU & a deadline of about 6 weeks how many hours a day do you let your machines run tasks?

The only things I can think of are:
1) Adjust your queue size.
2) Stop fiddling with things & let BOINC handle running tasks instead of trying to micromanage what it does. BOINC will run tasks that are near the deadline sooner if it thinks there is a chance of missing them.

EDIT: A single core CPU machine that takes around 10 hours to complete a task would be able to complete a queue of 100 tasks before the deadline if it ran 24/7.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1771784 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1771790 - Posted: 16 Mar 2016, 1:33:04 UTC
Last modified: 16 Mar 2016, 1:38:41 UTC

If you are still actively crunching Seti Beta as stock there, there have been some hints there *Might* be some problems with multiple project resource share management, to be further investigated. [possibly further complicated by Einstein fixed credit]. If so, selecting among your hosts to dedicate to given projects (setting no new tasks on the others) *might* let things recover.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1771790 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22186
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1771835 - Posted: 16 Mar 2016, 6:07:04 UTC

Which of your computers is having this problem?
One of them ( http://setiathome.berkeley.edu/results.php?hostid=6335328 ) has a single GPU and over two hundred tasks, which means either it has a lot of "ghosts", or you've been playing with re-scheduling, either of which can give this problem.
First thing to try on that computer is to "detach" wait a couple of minutes then "re-attach" to the project. This will help to kill off ghosts, and it will have the "real" tasks list.
Secondly, as others have said, don't micro-manage, BOINC does very well when left to do its own thing, but tends to make a mess when we fiddle.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1771835 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1771850 - Posted: 16 Mar 2016, 8:46:04 UTC - in response to Message 1771790.  

If you are still actively crunching Seti Beta as stock there, there have been some hints there *Might* be some problems with multiple project resource share management, to be further investigated. [possibly further complicated by Einstein fixed credit]. If so, selecting among your hosts to dedicate to given projects (setting no new tasks on the others) *might* let things recover.

No, *actual* credit (fixed or otherwise) has no effect on resource share - only the simplified REC is used.

But Einstein has much shorter deadlines than here, and that can play badly with very large cache settings. Although, since he doesn't seem to have actually crunched for Einstein for over 2 months (last contact 10 January), that's unlikely to be the cause currently.
ID: 1771850 · Report as offensive
Profile AllenIN
Volunteer tester
Avatar

Send message
Joined: 5 Dec 00
Posts: 292
Credit: 58,297,005
RAC: 311
United States
Message 1771851 - Posted: 16 Mar 2016, 9:07:41 UTC - in response to Message 1771784.  

With the limits of 100 tasks per CPU/GPU & a deadline of about 6 weeks how many hours a day do you let your machines run tasks?

The only things I can think of are:
1) Adjust your queue size.
2) Stop fiddling with things & let BOINC handle running tasks instead of trying to micromanage what it does. BOINC will run tasks that are near the deadline sooner if it thinks there is a chance of missing them.

EDIT: A single core CPU machine that takes around 10 hours to complete a task would be able to complete a queue of 100 tasks before the deadline if it ran 24/7.


I run 24/7 on all of my machines and I didn't start "playing around" until Boinc was unable to finish all of the tasks it had at it's command. As I said, never had this problem before the change of versions.
Thanks for the input though.

Allen
ID: 1771851 · Report as offensive
Profile AllenIN
Volunteer tester
Avatar

Send message
Joined: 5 Dec 00
Posts: 292
Credit: 58,297,005
RAC: 311
United States
Message 1771852 - Posted: 16 Mar 2016, 9:12:02 UTC - in response to Message 1771835.  

Which of your computers is having this problem?
One of them ( http://setiathome.berkeley.edu/results.php?hostid=6335328 ) has a single GPU and over two hundred tasks, which means either it has a lot of "ghosts", or you've been playing with re-scheduling, either of which can give this problem.
First thing to try on that computer is to "detach" wait a couple of minutes then "re-attach" to the project. This will help to kill off ghosts, and it will have the "real" tasks list.
Secondly, as others have said, don't micro-manage, BOINC does very well when left to do its own thing, but tends to make a mess when we fiddle.


As I said in the previous note, I didn't try to micro manage until Boinc was unable to monitor things correctly and run the older wu's first. I think it pays to much attention to which units are completed by other people than just doing them by their due dates. I didn't just jump in and start making changes, it has been failing to get the job done for some time now. As for the single CPU machine, even the tablet has more than one core and the real systems have at least two.

Thanks,

Allen
ID: 1771852 · Report as offensive
Profile AllenIN
Volunteer tester
Avatar

Send message
Joined: 5 Dec 00
Posts: 292
Credit: 58,297,005
RAC: 311
United States
Message 1771853 - Posted: 16 Mar 2016, 9:16:41 UTC - in response to Message 1771850.  

If you are still actively crunching Seti Beta as stock there, there have been some hints there *Might* be some problems with multiple project resource share management, to be further investigated. [possibly further complicated by Einstein fixed credit]. If so, selecting among your hosts to dedicate to given projects (setting no new tasks on the others) *might* let things recover.

No, *actual* credit (fixed or otherwise) has no effect on resource share - only the simplified REC is used.

But Einstein has much shorter deadlines than here, and that can play badly with very large cache settings. Although, since he doesn't seem to have actually crunched for Einstein for over 2 months (last contact 10 January), that's unlikely to be the cause currently.


Thanks Richard. As you stated, no Einstein for quite awhile. I can obviously change the size of my cache, but I thought Boinc was suppose to figure out how many wu's you need based on how long it takes you to run wu's, so I've just been leaving it up to Boinc to decide.

Thanks again, I just can't figure out what is wrong. The only real change that I have made since v7, is that I upgraded all of my machines to the latest Boinc version. Maybe that's the problem.

Allen
ID: 1771853 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1771855 - Posted: 16 Mar 2016, 9:33:43 UTC - in response to Message 1771853.  

The only real change that I have made since v7, is that I upgraded all of my machines to the latest Boinc version. Maybe that's the problem.

"The latest"? Could we have an actual number on that, please? (Yes, I know I can look it up - v7.6.22 - but you might have meant you're testing v7.6.29). In general, it's better to use absolute data than relative terms which might get overtaken by events.

Speaking of which, could you post the actual numbers you're using for cache size settings - both of them, please. I personally didn't have any problem with BOINC v7.6.22 maintaining the cache sizes I requested, but that tends to be an absolute maximum of 1 day, more often 0.4 or 0.5 days.
ID: 1771855 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1771857 - Posted: 16 Mar 2016, 9:37:48 UTC - in response to Message 1771853.  
Last modified: 16 Mar 2016, 9:39:33 UTC

I am seeing substantial 'cleanup' of work_fetch over the last year, so quite likely behaviour changed in there along the way.

[Edit:] interested for the version and setting numbers too.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1771857 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1771858 - Posted: 16 Mar 2016, 10:03:21 UTC - in response to Message 1771857.  
Last modified: 16 Mar 2016, 10:26:17 UTC

I am seeing substantial 'cleanup' of work_fetch over the last year, so quite likely behaviour changed in there along the way.

[Edit:] interested for the version and setting numbers too.

I don't see much 'substance' in https://github.com/BOINC/boinc/commits/master/client/work_fetch.cpp - mostly initial connection fetches and a backup project tweak. Nothing like the reversal of function between the two setting values, which took place with the initial changeover from v6 to v7 in 2012, and caught people out for a while afterwards.

Edit - it's incredibly difficult to diagnose work fetch problems retrospectively, but v7.6.22 does make it incredibly easy (Ctrl+Shift+F) to set additional Event log flags to help. I always set <cpu_shed> and <sched_op_debug>, which don't cause too much bloat in the log, but do show

16/03/2016 10:04:00 | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
16/03/2016 10:04:00 | SETI@home | [sched_op] NVIDIA GPU work request: 1900.67 seconds; 0.00 devices
16/03/2016 10:04:00 | SETI@home | [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices
16/03/2016 10:04:03 | SETI@home | Scheduler request completed: got 4 new tasks
16/03/2016 10:04:03 | SETI@home | [sched_op] estimated total CPU task duration: 0 seconds
16/03/2016 10:04:03 | SETI@home | [sched_op] estimated total NVIDIA GPU task duration: 3165 seconds
16/03/2016 10:04:03 | SETI@home | [sched_op] estimated total Intel GPU task duration: 0 seconds

which shows things behaving as expected.
ID: 1771858 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1771861 - Posted: 16 Mar 2016, 10:23:48 UTC - in response to Message 1771858.  

I am seeing substantial 'cleanup' of work_fetch over the last year, so quite likely behaviour changed in there along the way.

[Edit:] interested for the version and setting numbers too.

I don't see much 'substance' in https://github.com/BOINC/boinc/commits/master/client/work_fetch.cpp - mostly initial connection fetches and a backup project tweak. Nothing like the reversal of function between the two setting values, which took place with the initial changeover from v6 to v7 in 2012, and caught people out for a while afterwards.


Saw those tweaks and don't think those are it either. One thing that has caught my eye in the logs, on that file and others, is the number of shotgun safe function replacements. Some from Rom, and some probably coming from Christian's work. These can conceivably affect RPC scheduler request reliability (through thread-safety), and subtle behaviour changes can add up.

Paraphrasing my own response when Christian and David queried the importance/priority of the whole Coverity vulnerability scan cleanup, having ruggedised much code (Boinc/Seti and other) myself before: "When shotgunning in safe functions, you can expose weaknesses in other mechanisms".

Short version to me amounts to "more reliable work fetches + large caches, probably pushes the client estimate mechanisms [again], against deadlines"
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1771861 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1771863 - Posted: 16 Mar 2016, 10:31:43 UTC - in response to Message 1771861.  

That shotgun code cleanup did indeed completely destroy my client_state.xml file, but they came long after v7.6.22, so they won't be in play here. I only got caught because I made a self-build to investigate a separate bug (which doesn't show up in the online client simulator) - they've been fixed for v7.6.29
ID: 1771863 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1771867 - Posted: 16 Mar 2016, 10:48:29 UTC - in response to Message 1771863.  
Last modified: 16 Mar 2016, 10:57:55 UTC

Still working back that Far :). [Edit: I'm actually seeing large swathes of the changes prior to 7.6.18 ]

There's also the the server side to consider. Wacky or unconverged APRs could trigger over-issue on those hosts if they are fetching.

Could see if the OP's APRs seem a bit high compared to equivalent hardware and are dropping or static.

[On ALbert at Steady state equal work tasks, We saw credit do a high downward (loose) covergence, and a low upward (loose) convergence, making a funnel shape. The low values are caused mostly by High APR stock single instance results in the quorun, while the High credit 'booms' are generated by low APR, low efficiency, pile on as many GPU instances as possible, as Claggy was able to reproduce IIRC.]
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1771867 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1771869 - Posted: 16 Mar 2016, 11:06:56 UTC - in response to Message 1771867.  

Allen is using Anonymous Platform on all hosts (except 5494870, which doesn't have a GPU and hence probably doesn't come into this discussion).

With Anon. Plat., APR doesn't get considered (directly) client-side in estimating cache contents, and hence deciding when, and how much, work to fetch. Any FLOPs value used in estimation would have to be set manually by the user in app_info (unlike the stock case, where APR is transmitted without translation as <flops> and incorporated into <app_version>).

In case you missed it, I edited in some comments about event logging to a previous answer - that helps to monitor client behaviour, and verify that the client's interpretation of server responses is sane - or not, as the case may be.
ID: 1771869 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1771870 - Posted: 16 Mar 2016, 11:07:29 UTC - in response to Message 1771863.  

Noticed Something on one of the hosts with timed out CPU tasks.
http://setiathome.berkeley.edu/show_host_detail.php?hostid=7795393

That's important because I see an nVidia GT 730 with high OpenCL AP app APR.
That's going to be sucking up a CPU Core when AP is available. do the stock and installer issued fractions allow for that ?
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1771870 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1771871 - Posted: 16 Mar 2016, 11:12:18 UTC - in response to Message 1771869.  

In case you missed it, I edited in some comments about event logging to a previous answer - that helps to monitor client behaviour, and verify that the client's interpretation of server responses is sane - or not, as the case may be.


Cheers,
Edit - it's incredibly difficult to diagnose work fetch problems retrospectively, but v7.6.22 does make it incredibly easy (Ctrl+Shift+F) to set additional Event log flags to help. I always set <cpu_shed> and <sched_op_debug>, which don't cause too much bloat in the log, but do show...

Will come in handy.


With Anon. Plat., APR doesn't get considered (directly) client-side...
nah, in the scheduler reply.

Looking at some of the other hosts to confirm/refute if it looks like CPU pressure from the OpenCL app, or not.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1771871 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1771872 - Posted: 16 Mar 2016, 11:18:59 UTC - in response to Message 1771870.  

That's going to be sucking up a CPU Core when AP is available. do the stock and installer issued fractions allow for that ?

That's a fair question. I did ask that question before deployment - the technical term in play here is <avg_ncpus> - and got this reply from the developer:

All but ncpus should be fixed.
Stock goes with CPUlock and CPUlock mode doesn't require full CPU (though app could benefit from it). Entire host tuning/balancing on user anyway for now.

(my emphasis)

That machine is a quad-core: I'd be extremely surprised if there'd been enough AP work between January and March to block MB tasks completely for seven weeks, despite BOINC's EDF protection.
ID: 1771872 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1771873 - Posted: 16 Mar 2016, 11:22:49 UTC - in response to Message 1771871.  

With Anon. Plat., APR doesn't get considered (directly) client-side...
nah, in the scheduler reply.

Only indirectly, by modifying the task size (<rsc_fpops_est>) so the time estimate multiplies out correctly on the CPU benchmark assumption.

That, of course, could cause major problems if rebranding between CPU and GPU was added to the mix.
ID: 1771873 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1771874 - Posted: 16 Mar 2016, 11:33:17 UTC - in response to Message 1771873.  
Last modified: 16 Mar 2016, 11:33:39 UTC

I'm 'seeing' two patterns on the OP's hosts (whether coincidental or indicative I'm not sure)

In the case of the NV hosts, the timed out ones appear to be CPU tasks, while on the ATI host(s) it's the ATI GPU tasks, many dumped today.

We know about the NV OpenCL app CPU behaviour, possibly a thing on those that freeing cores could help. But is there anything known for ATI GPU, such as the app priority or other settings at low defaults making them susceptible to be pushed aside by CPU apps ?
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1771874 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : WU's timing out........


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.