Message boards :
Number crunching :
WU's timing out........
Message board moderation
Author | Message |
---|---|
AllenIN Send message Joined: 5 Dec 00 Posts: 292 Credit: 58,297,005 RAC: 311 |
I've been having a lot of WU's that run out of time before they are run. I don't remember this being a problem in the past. Is anyone else having this problem?? If not, is there anything I can do to fix it? I've tried suspending work until the older ones are through running but this is very time consuming and sometimes I don't get back in time to resume them and nothing gets done during that time. Thanks for any help you can offer. Allen |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
With the limits of 100 tasks per CPU/GPU & a deadline of about 6 weeks how many hours a day do you let your machines run tasks? The only things I can think of are: 1) Adjust your queue size. 2) Stop fiddling with things & let BOINC handle running tasks instead of trying to micromanage what it does. BOINC will run tasks that are near the deadline sooner if it thinks there is a chance of missing them. EDIT: A single core CPU machine that takes around 10 hours to complete a task would be able to complete a queue of 100 tasks before the deadline if it ran 24/7. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
If you are still actively crunching Seti Beta as stock there, there have been some hints there *Might* be some problems with multiple project resource share management, to be further investigated. [possibly further complicated by Einstein fixed credit]. If so, selecting among your hosts to dedicate to given projects (setting no new tasks on the others) *might* let things recover. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
rob smith Send message Joined: 7 Mar 03 Posts: 22186 Credit: 416,307,556 RAC: 380 |
Which of your computers is having this problem? One of them ( http://setiathome.berkeley.edu/results.php?hostid=6335328 ) has a single GPU and over two hundred tasks, which means either it has a lot of "ghosts", or you've been playing with re-scheduling, either of which can give this problem. First thing to try on that computer is to "detach" wait a couple of minutes then "re-attach" to the project. This will help to kill off ghosts, and it will have the "real" tasks list. Secondly, as others have said, don't micro-manage, BOINC does very well when left to do its own thing, but tends to make a mess when we fiddle. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
If you are still actively crunching Seti Beta as stock there, there have been some hints there *Might* be some problems with multiple project resource share management, to be further investigated. [possibly further complicated by Einstein fixed credit]. If so, selecting among your hosts to dedicate to given projects (setting no new tasks on the others) *might* let things recover. No, *actual* credit (fixed or otherwise) has no effect on resource share - only the simplified REC is used. But Einstein has much shorter deadlines than here, and that can play badly with very large cache settings. Although, since he doesn't seem to have actually crunched for Einstein for over 2 months (last contact 10 January), that's unlikely to be the cause currently. |
AllenIN Send message Joined: 5 Dec 00 Posts: 292 Credit: 58,297,005 RAC: 311 |
With the limits of 100 tasks per CPU/GPU & a deadline of about 6 weeks how many hours a day do you let your machines run tasks? I run 24/7 on all of my machines and I didn't start "playing around" until Boinc was unable to finish all of the tasks it had at it's command. As I said, never had this problem before the change of versions. Thanks for the input though. Allen |
AllenIN Send message Joined: 5 Dec 00 Posts: 292 Credit: 58,297,005 RAC: 311 |
Which of your computers is having this problem? As I said in the previous note, I didn't try to micro manage until Boinc was unable to monitor things correctly and run the older wu's first. I think it pays to much attention to which units are completed by other people than just doing them by their due dates. I didn't just jump in and start making changes, it has been failing to get the job done for some time now. As for the single CPU machine, even the tablet has more than one core and the real systems have at least two. Thanks, Allen |
AllenIN Send message Joined: 5 Dec 00 Posts: 292 Credit: 58,297,005 RAC: 311 |
If you are still actively crunching Seti Beta as stock there, there have been some hints there *Might* be some problems with multiple project resource share management, to be further investigated. [possibly further complicated by Einstein fixed credit]. If so, selecting among your hosts to dedicate to given projects (setting no new tasks on the others) *might* let things recover. Thanks Richard. As you stated, no Einstein for quite awhile. I can obviously change the size of my cache, but I thought Boinc was suppose to figure out how many wu's you need based on how long it takes you to run wu's, so I've just been leaving it up to Boinc to decide. Thanks again, I just can't figure out what is wrong. The only real change that I have made since v7, is that I upgraded all of my machines to the latest Boinc version. Maybe that's the problem. Allen |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
The only real change that I have made since v7, is that I upgraded all of my machines to the latest Boinc version. Maybe that's the problem. "The latest"? Could we have an actual number on that, please? (Yes, I know I can look it up - v7.6.22 - but you might have meant you're testing v7.6.29). In general, it's better to use absolute data than relative terms which might get overtaken by events. Speaking of which, could you post the actual numbers you're using for cache size settings - both of them, please. I personally didn't have any problem with BOINC v7.6.22 maintaining the cache sizes I requested, but that tends to be an absolute maximum of 1 day, more often 0.4 or 0.5 days. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I am seeing substantial 'cleanup' of work_fetch over the last year, so quite likely behaviour changed in there along the way. [Edit:] interested for the version and setting numbers too. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
I am seeing substantial 'cleanup' of work_fetch over the last year, so quite likely behaviour changed in there along the way. I don't see much 'substance' in https://github.com/BOINC/boinc/commits/master/client/work_fetch.cpp - mostly initial connection fetches and a backup project tweak. Nothing like the reversal of function between the two setting values, which took place with the initial changeover from v6 to v7 in 2012, and caught people out for a while afterwards. Edit - it's incredibly difficult to diagnose work fetch problems retrospectively, but v7.6.22 does make it incredibly easy (Ctrl+Shift+F) to set additional Event log flags to help. I always set <cpu_shed> and <sched_op_debug>, which don't cause too much bloat in the log, but do show 16/03/2016 10:04:00 | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices which shows things behaving as expected. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I am seeing substantial 'cleanup' of work_fetch over the last year, so quite likely behaviour changed in there along the way. Saw those tweaks and don't think those are it either. One thing that has caught my eye in the logs, on that file and others, is the number of shotgun safe function replacements. Some from Rom, and some probably coming from Christian's work. These can conceivably affect RPC scheduler request reliability (through thread-safety), and subtle behaviour changes can add up. Paraphrasing my own response when Christian and David queried the importance/priority of the whole Coverity vulnerability scan cleanup, having ruggedised much code (Boinc/Seti and other) myself before: "When shotgunning in safe functions, you can expose weaknesses in other mechanisms". Short version to me amounts to "more reliable work fetches + large caches, probably pushes the client estimate mechanisms [again], against deadlines" "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
That shotgun code cleanup did indeed completely destroy my client_state.xml file, but they came long after v7.6.22, so they won't be in play here. I only got caught because I made a self-build to investigate a separate bug (which doesn't show up in the online client simulator) - they've been fixed for v7.6.29 |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Still working back that Far :). [Edit: I'm actually seeing large swathes of the changes prior to 7.6.18 ] There's also the the server side to consider. Wacky or unconverged APRs could trigger over-issue on those hosts if they are fetching. Could see if the OP's APRs seem a bit high compared to equivalent hardware and are dropping or static. [On ALbert at Steady state equal work tasks, We saw credit do a high downward (loose) covergence, and a low upward (loose) convergence, making a funnel shape. The low values are caused mostly by High APR stock single instance results in the quorun, while the High credit 'booms' are generated by low APR, low efficiency, pile on as many GPU instances as possible, as Claggy was able to reproduce IIRC.] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Allen is using Anonymous Platform on all hosts (except 5494870, which doesn't have a GPU and hence probably doesn't come into this discussion). With Anon. Plat., APR doesn't get considered (directly) client-side in estimating cache contents, and hence deciding when, and how much, work to fetch. Any FLOPs value used in estimation would have to be set manually by the user in app_info (unlike the stock case, where APR is transmitted without translation as <flops> and incorporated into <app_version>). In case you missed it, I edited in some comments about event logging to a previous answer - that helps to monitor client behaviour, and verify that the client's interpretation of server responses is sane - or not, as the case may be. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Noticed Something on one of the hosts with timed out CPU tasks. http://setiathome.berkeley.edu/show_host_detail.php?hostid=7795393 That's important because I see an nVidia GT 730 with high OpenCL AP app APR. That's going to be sucking up a CPU Core when AP is available. do the stock and installer issued fractions allow for that ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
In case you missed it, I edited in some comments about event logging to a previous answer - that helps to monitor client behaviour, and verify that the client's interpretation of server responses is sane - or not, as the case may be. Cheers, Edit - it's incredibly difficult to diagnose work fetch problems retrospectively, but v7.6.22 does make it incredibly easy (Ctrl+Shift+F) to set additional Event log flags to help. I always set <cpu_shed> and <sched_op_debug>, which don't cause too much bloat in the log, but do show... Will come in handy. With Anon. Plat., APR doesn't get considered (directly) client-side...nah, in the scheduler reply. Looking at some of the other hosts to confirm/refute if it looks like CPU pressure from the OpenCL app, or not. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
That's going to be sucking up a CPU Core when AP is available. do the stock and installer issued fractions allow for that ? That's a fair question. I did ask that question before deployment - the technical term in play here is <avg_ncpus> - and got this reply from the developer: All but ncpus should be fixed. (my emphasis) That machine is a quad-core: I'd be extremely surprised if there'd been enough AP work between January and March to block MB tasks completely for seven weeks, despite BOINC's EDF protection. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
With Anon. Plat., APR doesn't get considered (directly) client-side...nah, in the scheduler reply. Only indirectly, by modifying the task size (<rsc_fpops_est>) so the time estimate multiplies out correctly on the CPU benchmark assumption. That, of course, could cause major problems if rebranding between CPU and GPU was added to the mix. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I'm 'seeing' two patterns on the OP's hosts (whether coincidental or indicative I'm not sure) In the case of the NV hosts, the timed out ones appear to be CPU tasks, while on the ATI host(s) it's the ATI GPU tasks, many dumped today. We know about the NV OpenCL app CPU behaviour, possibly a thing on those that freeing cores could help. But is there anything known for ATI GPU, such as the app priority or other settings at low defaults making them susceptible to be pushed aside by CPU apps ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.