Message boards :
Number crunching :
About Deadlines or Database reduction proposals
Message board moderation
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 16 · Next
Author | Message |
---|---|
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
Does this mean ... it is NOT the slow return of tasks that is causing our problem now?It never was. It is exacerbating the problem, but it is not the cause. This post from earlier in this very thread points that out. Grant Darwin NT |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
Does this mean ... it is NOT the slow return of tasks that is causing our problem now?It never was. Agree. This is a common mistake, slow or fast hosts are not the cause or the problem. The problem is the size of the DB because it not fit on the server RAM anymore. This is why we suggest few ways to squeeze the DB size. Reduce dateline of the WU, reduce the WU limits, etc. are some of the suggestions to achieve that. |
rob smith Send message Joined: 7 Mar 03 Posts: 22440 Credit: 416,307,556 RAC: 380 |
Given the shear number of tasks out in the field and so sitting in the "day-file" it is going to take some time to get things back under control :-( But, if you think back to November when the allowance was pushed up how rapidly the queue sizes responded that would suggest that dropping allowances would be the first action - and it would work provided everyone lived within the lower limits and nobody further inflates their number of GPUs to keep their caches at their current inflated levels. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14673 Credit: 200,643,578 RAC: 874 |
Given the shear number of tasks out in the field and so sitting in the "day-file" it is going to take some time to get things back under control :-(I think that horse has bolted.... In my opinion, there's a difference between: 'caching' - having enough tasks to work through routine outages, plus the server restart delay, plus the initial rush of reporting and refills by 'normal' users, and 'bunkering' - deliberately hoarding work, above and beyond current needs, for competition purposes. IMHO, bunkering is asking project managers (not just SETI) to supply extra work beyond scientific research needs - that's the tail wagging the dog. We volunteer to help the projects, not the other way round. But I'm OK with caching - once we're through this bumpy patch - provided it's kept to the minimum level necessary to meet the objective stated above. |
W-K 666 Send message Joined: 18 May 99 Posts: 19310 Credit: 40,757,560 RAC: 67 |
@ W-K 666 Still doesn't answer why there are 4 million workunits in the waiting for Validation row when there used to be a very small number, 108 in the example given, on 15th Nov 2019. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
It was demonstrated a while back the best cache size is one equal to the Host daily production. If you lower the cache on a high production machine the Pending Tasks number goes up significantly. This is easy to see by comparing two Host with similar production with different cache sizes. This one has a smaller cache and large number of Pendings, https://setiathome.berkeley.edu/results.php?hostid=8837101, this one a larger cache with smaller number of Pendings, https://setiathome.berkeley.edu/results.php?hostid=8873167. This is Not a Theory, long term observation has verified it. There isn't any significant change in the impact on the Database between a large or small cache until the cache size exceeds the daily production. That's why I keep recommending changing the cache size to One Day, it's the best size for the database. The problems didn't start until the cache size was increased on all those machines with a Low daily production, before that there wasn't a problem with some machines having a cache size closer to their daily production. The way I remember it the Server "upgrade" happened on Dec 20th, looking at the Server page from Dec 21st it's clear the problem was well underway before then, https://web.archive.org/web/20191221230917/https://setiathome.berkeley.edu/show_server_status.php. As far as I know, the only item that was changed before Dec 20th was sending more tasks to All Hosts, even those who didn't need them, and the extra AMD/ATI quorum, there wasn't a problem until then. Whatever the problem is, it isn't caused by the Hosts that have less than a Days worth of cache. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14673 Credit: 200,643,578 RAC: 874 |
Which is a long-winded way of saying that the best cache size is the same as your wingmate's: then neither of you have to wait for the other. If the community as a whole agrees to settle on 1 day, that's fine by me. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14673 Credit: 200,643,578 RAC: 874 |
The problems didn't start until the cache size was increased ...Which is a different statement. Cache size (by time) is under the control of the user: number of tasks in progress is capped by the project. If the database size spiked (and it did) when the project cap was lifted, then many users had set a cache time longer than the project cap allowed. - which was pointless, and simply resulted in more frequent requests for (at most) 1 task. My personal preference is for a cache of 0.75 days + 0.05 days extra, for slower GPUs like my 1050 TIs. That results in a cache of about 80 per card, topped up by one request an hour. |
W-K 666 Send message Joined: 18 May 99 Posts: 19310 Credit: 40,757,560 RAC: 67 |
The problems didn't start until the cache size was increased ...Which is a different statement. Cache size (by time) is under the control of the user: number of tasks in progress is capped by the project. If the database size spiked (and it did) when the project cap was lifted, then many users had set a cache time longer than the project cap allowed. - which was pointless, and simply resulted in more frequent requests for (at most) 1 task. I think that many users will set a cache size to give them as many as possible for the CPU. Which if the average CPU task takes 1 hour will require now, a cache size of 6 days. Which because the computer has 6 cores actually only lasts 1 day. Therefore if you limit the cache to one day, for the CPU it will be 1day/number of cores. Intel's latest Xeon packages two 24 core cpu chips into one package, which would be 48 cores, the news item doesn't mention if there can be 2 threads/core. So a rethink about the CPU cache being for the CPU and not number of cores needs to done before a time limit cache is imposed. My cache for my 2060 is 0.6 days + 0.02 days and most of the time, but not always I get the max 150 tasks. It used to be 0.4 days + under the 100 task limit. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Thanks for the redirect to that developer report. I had only heard anecdotally that the ROCm drivers were fubared for BOINC projects and not the why. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
The Database spiked because all those people that had less than a Ten Day cache, because of the Cap, suddenly were able to download more. From my experience people have the cache set at 10 days and never change it, even when experimenting with things known to trash caches. Seems the furthest thing from their minds is to lower the cache. When running the CUDA Special App most people have to jump through hoops to even get a Half-Days worth of cache, most have much less than a one day cache. |
W-K 666 Send message Joined: 18 May 99 Posts: 19310 Credit: 40,757,560 RAC: 67 |
The other thing that needs to be taken into account is that, the true cache size is usually significantly smaller that the one set because of the noise in the data forcing early finishes, -9's. My 0.62 day cache as indicated by the average turnaround is actually 0.41 days. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14673 Credit: 200,643,578 RAC: 874 |
I think that many users will set a cache size to give them as many as possible for the CPU. Which if the average CPU task takes 1 hour will require now, a cache size of 6 days. Which because the computer has 6 cores actually only lasts 1 day.FALSE - check the work requests in seconds (<sched_op_debug>). If you set 1 day cache, and you have six cores, BOINC will ask for 6 days of work - one day (cache size) of work for each core. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14673 Credit: 200,643,578 RAC: 874 |
The other thing that needs to be taken into account is that, the true cache size is usually significantly smaller that the one set because of the noise in the data forcing early finishes, -9's.That's fair comment, and has been particularly true while we work through these trashy old tapes. But as soon as you blow through a noise bomb or two, BOINC will ask for more work to bring your cache back up to spec. Only problem is a run of noise bombs during an outage ... |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14673 Credit: 200,643,578 RAC: 874 |
The Database spiked because all those people that had less than a Ten Day cache, because of the Cap, suddenly were able to download more. From my experience people have the cache set at 10 days and never change it, even when experimenting with things known to trash caches. Seems the furthest thing from their minds is to lower the cache.Then we have to do something to bring it back to the front of their attention. |
Kevin Olley Send message Joined: 3 Aug 99 Posts: 906 Credit: 261,085,289 RAC: 572 |
Given the shear number of tasks out in the field and so sitting in the "day-file" it is going to take some time to get things back under control :-( Einstein would like that:-) On my main machine my Einstein RAC is higher than my SETI RAC and it only processes Einstein when there are No SETI WU's available. Have a look at the oldest 1000 or so validation pending tasks on this machine and add up all the workunits that are sitting on machines that are no longer processing work. HAVE we got a serious problem with a number of people using SETI as a stress test and not bothering to abort unprocessed tasks? HOW many WU's are out there that get sent to 2 of these type of machines, having to wait a couple of months to be sent to machines that will HOPEFULLY process them properly? Looking at the Einstein WU's on my other machine (it does 50% Einstein on GPU's) they have a 2 week deadline there are very few _2 no _3 tasks and 1 x _4 out of 300 WU's. Einstein does not seem to have a problem with a 2 week deadline even though their fast WU's take SIX times longer on average to process. Shortening the deadline and adjusting the cache size IF it was based on the capabilities and previous performance of the machine might help. JUST reducing the cache size is only going to upset those who will be affected by that change:-( Kevin |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
even at nearly 10,000 task cache, it only lasts about 12-13hrs on my fastest host. with the current workarounds available I cannot get more cache than this. the slowest of my main 3, still can't hold more than a day. i'd love to be able to actually hold a day. 1 day cache for my fastest system is ~17,000-20,000 WUs depending on how fast each task is, they are quite variable. just reduce the per device limits back to 100 to stop the "hands-off" slow systems and those that fill their cache then dont return until the deadline passed, from caching too much (where 100 per device is already more than a day). and let those of us who are able, work around the issue ourselves as we have been doing for a long time. theres no "one" solution to this problem. 1. reduce server side limits back to 100 to reduce the out in the field (since those "spoofing" is a relatively small slice of the pie) i counted about 35 hosts in the top 100 hosts that are spoofing. even if you assume that ALL of those are spoofing to the max (which they are not), 64 GPUs, that gives them collectively 336,000 tasks in progress (150*64 * 35 hosts). comparing that to the >6,000,000 results currently in the field, that's ~5%. but in reality it's less since a lot of those people are not max spoofing. those using spoofing are not a big problem. and are only doing so to continue supporting the project during the downtimes. if SETI was as smooth and stable as some other projects and never went down for weekly maintenance or other instability, I would have no problem with even a very small cache. as long as it's always working. 2. reduce the deadlines so that the impact of results being abandoned is lessened (once it hits the deadline, goes to someone else). can EASILY cut the current deadlines in half without ostracizing 99.99% of hosts 3. invest in better hardware for some of the project servers. even if it's a minor platform change to simply hold more memory to have a larger limit on db size in ram. 4. (future) work towards developing a more variable limit for tasks in progress based on host performance Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
W-K 666 Send message Joined: 18 May 99 Posts: 19310 Credit: 40,757,560 RAC: 67 |
I think that many users will set a cache size to give them as many as possible for the CPU. Which if the average CPU task takes 1 hour will require now, a cache size of 6 days. Which because the computer has 6 cores actually only lasts 1 day.FALSE - check the work requests in seconds (<sched_op_debug>). If you set 1 day cache, and you have six cores, BOINC will ask for 6 days of work - one day (cache size) of work for each core. Not what I was told here https://setiathome.berkeley.edu/forum_thread.php?id=84983&postid=2032189#2032189 |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
Richard is talking about what is requested, not what you get. if your host is slow enough that the number of tasks that 6 days worth comprises, falls below the project side limits, you will be sent 6 days worth. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14673 Credit: 200,643,578 RAC: 874 |
That was the "in-progress cap" by number of tasks - that is indeed for the the whole CPU.Not what I was told here https://setiathome.berkeley.edu/forum_thread.php?id=84983&postid=2032189#2032189I think that many users will set a cache size to give them as many as possible for the CPU. Which if the average CPU task takes 1 hour will require now, a cache size of 6 days. Which because the computer has 6 cores actually only lasts 1 day.FALSE - check the work requests in seconds (<sched_op_debug>). If you set 1 day cache, and you have six cores, BOINC will ask for 6 days of work - one day (cache size) of work for each core. By 'cache', I mean the user setting - 'store at least -- days of work'. that one is per core. The system is, indeed, illogical. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.