About Deadlines or Database reduction proposals

Message boards : Number crunching : About Deadlines or Database reduction proposals
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 16 · Next

AuthorMessage
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 2034580 - Posted: 29 Feb 2020, 10:55:33 UTC - in response to Message 2034577.  

Does this mean ... it is NOT the slow return of tasks that is causing our problem now?
It never was.
It is exacerbating the problem, but it is not the cause.

This post from earlier in this very thread points that out.
Grant
Darwin NT
ID: 2034580 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2034586 - Posted: 29 Feb 2020, 11:45:58 UTC - in response to Message 2034580.  
Last modified: 29 Feb 2020, 12:11:17 UTC

Does this mean ... it is NOT the slow return of tasks that is causing our problem now?
It never was.
It is exacerbating the problem, but it is not the cause.

This post from earlier in this very thread points that out.

Agree. This is a common mistake, slow or fast hosts are not the cause or the problem.

The problem is the size of the DB because it not fit on the server RAM anymore.

This is why we suggest few ways to squeeze the DB size. Reduce dateline of the WU, reduce the WU limits, etc. are some
of the suggestions to achieve that.
ID: 2034586 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22440
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2034587 - Posted: 29 Feb 2020, 12:36:08 UTC

Given the shear number of tasks out in the field and so sitting in the "day-file" it is going to take some time to get things back under control :-(
But, if you think back to November when the allowance was pushed up how rapidly the queue sizes responded that would suggest that dropping allowances would be the first action - and it would work provided everyone lived within the lower limits and nobody further inflates their number of GPUs to keep their caches at their current inflated levels.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2034587 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14673
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2034588 - Posted: 29 Feb 2020, 12:53:15 UTC - in response to Message 2034587.  

Given the shear number of tasks out in the field and so sitting in the "day-file" it is going to take some time to get things back under control :-(
But, if you think back to November when the allowance was pushed up how rapidly the queue sizes responded that would suggest that dropping allowances would be the first action - and it would work provided everyone lived within the lower limits and nobody further inflates their number of GPUs to keep their caches at their current inflated levels.
I think that horse has bolted....

In my opinion, there's a difference between:

'caching' - having enough tasks to work through routine outages, plus the server restart delay, plus the initial rush of reporting and refills by 'normal' users, and

'bunkering' - deliberately hoarding work, above and beyond current needs, for competition purposes.

IMHO, bunkering is asking project managers (not just SETI) to supply extra work beyond scientific research needs - that's the tail wagging the dog. We volunteer to help the projects, not the other way round. But I'm OK with caching - once we're through this bumpy patch - provided it's kept to the minimum level necessary to meet the objective stated above.
ID: 2034588 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19310
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034591 - Posted: 29 Feb 2020, 13:26:02 UTC - in response to Message 2034578.  

@ W-K 666

But couldn't the "Results returned and awaiting Validation" increase be down to the fact that they
cannot move on to "Assimilation" because there is no room as that number is now 4 million instead of
close to zero.
No. The various queues from "ready to send" to "DB purging" are "logical queues" and a task is
"moved" from one queue to another by changing a status value within the row of the DB.

The limit we are bumping up against is the number of rows within the DB, not the values within the rows.

Still doesn't answer why there are 4 million workunits in the waiting for Validation row when there used to be a very small number, 108 in the example given, on 15th Nov 2019.
ID: 2034591 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2034596 - Posted: 29 Feb 2020, 15:04:07 UTC - in response to Message 2034588.  

It was demonstrated a while back the best cache size is one equal to the Host daily production. If you lower the cache on a high production machine the Pending Tasks number goes up significantly. This is easy to see by comparing two Host with similar production with different cache sizes. This one has a smaller cache and large number of Pendings, https://setiathome.berkeley.edu/results.php?hostid=8837101, this one a larger cache with smaller number of Pendings, https://setiathome.berkeley.edu/results.php?hostid=8873167. This is Not a Theory, long term observation has verified it. There isn't any significant change in the impact on the Database between a large or small cache until the cache size exceeds the daily production. That's why I keep recommending changing the cache size to One Day, it's the best size for the database. The problems didn't start until the cache size was increased on all those machines with a Low daily production, before that there wasn't a problem with some machines having a cache size closer to their daily production. The way I remember it the Server "upgrade" happened on Dec 20th, looking at the Server page from Dec 21st it's clear the problem was well underway before then, https://web.archive.org/web/20191221230917/https://setiathome.berkeley.edu/show_server_status.php. As far as I know, the only item that was changed before Dec 20th was sending more tasks to All Hosts, even those who didn't need them, and the extra AMD/ATI quorum, there wasn't a problem until then. Whatever the problem is, it isn't caused by the Hosts that have less than a Days worth of cache.
ID: 2034596 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14673
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2034602 - Posted: 29 Feb 2020, 15:55:33 UTC - in response to Message 2034596.  

Which is a long-winded way of saying that the best cache size is the same as your wingmate's: then neither of you have to wait for the other.

If the community as a whole agrees to settle on 1 day, that's fine by me.
ID: 2034602 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14673
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2034603 - Posted: 29 Feb 2020, 16:04:48 UTC - in response to Message 2034596.  

The problems didn't start until the cache size was increased ...
Which is a different statement. Cache size (by time) is under the control of the user: number of tasks in progress is capped by the project. If the database size spiked (and it did) when the project cap was lifted, then many users had set a cache time longer than the project cap allowed. - which was pointless, and simply resulted in more frequent requests for (at most) 1 task.

My personal preference is for a cache of 0.75 days + 0.05 days extra, for slower GPUs like my 1050 TIs. That results in a cache of about 80 per card, topped up by one request an hour.
ID: 2034603 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19310
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034610 - Posted: 29 Feb 2020, 16:36:23 UTC - in response to Message 2034603.  

The problems didn't start until the cache size was increased ...
Which is a different statement. Cache size (by time) is under the control of the user: number of tasks in progress is capped by the project. If the database size spiked (and it did) when the project cap was lifted, then many users had set a cache time longer than the project cap allowed. - which was pointless, and simply resulted in more frequent requests for (at most) 1 task.

My personal preference is for a cache of 0.75 days + 0.05 days extra, for slower GPUs like my 1050 TIs. That results in a cache of about 80 per card, topped up by one request an hour.

I think that many users will set a cache size to give them as many as possible for the CPU. Which if the average CPU task takes 1 hour will require now, a cache size of 6 days. Which because the computer has 6 cores actually only lasts 1 day.

Therefore if you limit the cache to one day, for the CPU it will be 1day/number of cores.
Intel's latest Xeon packages two 24 core cpu chips into one package, which would be 48 cores, the news item doesn't mention if there can be 2 threads/core.

So a rethink about the CPU cache being for the CPU and not number of cores needs to done before a time limit cache is imposed.

My cache for my 2060 is 0.6 days + 0.02 days and most of the time, but not always I get the max 150 tasks. It used to be 0.4 days + under the 100 task limit.
ID: 2034610 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2034611 - Posted: 29 Feb 2020, 16:38:56 UTC - in response to Message 2034570.  

Thanks for the redirect to that developer report. I had only heard anecdotally that the ROCm drivers were fubared for BOINC projects and not the why.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2034611 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2034614 - Posted: 29 Feb 2020, 16:47:14 UTC - in response to Message 2034603.  

The Database spiked because all those people that had less than a Ten Day cache, because of the Cap, suddenly were able to download more. From my experience people have the cache set at 10 days and never change it, even when experimenting with things known to trash caches. Seems the furthest thing from their minds is to lower the cache. When running the CUDA Special App most people have to jump through hoops to even get a Half-Days worth of cache, most have much less than a one day cache.
ID: 2034614 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19310
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034617 - Posted: 29 Feb 2020, 16:54:23 UTC

The other thing that needs to be taken into account is that, the true cache size is usually significantly smaller that the one set because of the noise in the data forcing early finishes, -9's.
My 0.62 day cache as indicated by the average turnaround is actually 0.41 days.
ID: 2034617 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14673
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2034620 - Posted: 29 Feb 2020, 17:02:26 UTC - in response to Message 2034610.  

I think that many users will set a cache size to give them as many as possible for the CPU. Which if the average CPU task takes 1 hour will require now, a cache size of 6 days. Which because the computer has 6 cores actually only lasts 1 day.
FALSE - check the work requests in seconds (<sched_op_debug>). If you set 1 day cache, and you have six cores, BOINC will ask for 6 days of work - one day (cache size) of work for each core.
ID: 2034620 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14673
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2034623 - Posted: 29 Feb 2020, 17:05:49 UTC - in response to Message 2034617.  

The other thing that needs to be taken into account is that, the true cache size is usually significantly smaller that the one set because of the noise in the data forcing early finishes, -9's.
My 0.62 day cache as indicated by the average turnaround is actually 0.41 days.
That's fair comment, and has been particularly true while we work through these trashy old tapes. But as soon as you blow through a noise bomb or two, BOINC will ask for more work to bring your cache back up to spec. Only problem is a run of noise bombs during an outage ...
ID: 2034623 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14673
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2034624 - Posted: 29 Feb 2020, 17:08:12 UTC - in response to Message 2034614.  

The Database spiked because all those people that had less than a Ten Day cache, because of the Cap, suddenly were able to download more. From my experience people have the cache set at 10 days and never change it, even when experimenting with things known to trash caches. Seems the furthest thing from their minds is to lower the cache.
Then we have to do something to bring it back to the front of their attention.
ID: 2034624 · Report as offensive     Reply Quote
Kevin Olley

Send message
Joined: 3 Aug 99
Posts: 906
Credit: 261,085,289
RAC: 572
United Kingdom
Message 2034626 - Posted: 29 Feb 2020, 17:13:23 UTC - in response to Message 2034587.  

Given the shear number of tasks out in the field and so sitting in the "day-file" it is going to take some time to get things back under control :-(
But, if you think back to November when the allowance was pushed up how rapidly the queue sizes responded that would suggest that dropping allowances would be the first action - and it would work provided everyone lived within the lower limits and nobody further inflates their number of GPUs to keep their caches at their current inflated levels.


Einstein would like that:-)

On my main machine my Einstein RAC is higher than my SETI RAC and it only processes Einstein when there are No SETI WU's available.

Have a look at the oldest 1000 or so validation pending tasks on this machine and add up all the workunits that are sitting on machines that are no longer processing work.

HAVE we got a serious problem with a number of people using SETI as a stress test and not bothering to abort unprocessed tasks?

HOW many WU's are out there that get sent to 2 of these type of machines, having to wait a couple of months to be sent to machines that will HOPEFULLY process them properly?

Looking at the Einstein WU's on my other machine (it does 50% Einstein on GPU's) they have a 2 week deadline there are very few _2 no _3 tasks and 1 x _4 out of 300 WU's.

Einstein does not seem to have a problem with a 2 week deadline even though their fast WU's take SIX times longer on average to process.

Shortening the deadline and adjusting the cache size IF it was based on the capabilities and previous performance of the machine might help.

JUST reducing the cache size is only going to upset those who will be affected by that change:-(
Kevin


ID: 2034626 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2034629 - Posted: 29 Feb 2020, 17:32:32 UTC - in response to Message 2034614.  

even at nearly 10,000 task cache, it only lasts about 12-13hrs on my fastest host. with the current workarounds available I cannot get more cache than this. the slowest of my main 3, still can't hold more than a day. i'd love to be able to actually hold a day.

1 day cache for my fastest system is ~17,000-20,000 WUs depending on how fast each task is, they are quite variable.

just reduce the per device limits back to 100 to stop the "hands-off" slow systems and those that fill their cache then dont return until the deadline passed, from caching too much (where 100 per device is already more than a day). and let those of us who are able, work around the issue ourselves as we have been doing for a long time.

theres no "one" solution to this problem.

1. reduce server side limits back to 100 to reduce the out in the field (since those "spoofing" is a relatively small slice of the pie)

i counted about 35 hosts in the top 100 hosts that are spoofing. even if you assume that ALL of those are spoofing to the max (which they are not), 64 GPUs, that gives them collectively 336,000 tasks in progress (150*64 * 35 hosts). comparing that to the >6,000,000 results currently in the field, that's ~5%. but in reality it's less since a lot of those people are not max spoofing. those using spoofing are not a big problem. and are only doing so to continue supporting the project during the downtimes. if SETI was as smooth and stable as some other projects and never went down for weekly maintenance or other instability, I would have no problem with even a very small cache. as long as it's always working.

2. reduce the deadlines so that the impact of results being abandoned is lessened (once it hits the deadline, goes to someone else). can EASILY cut the current deadlines in half without ostracizing 99.99% of hosts
3. invest in better hardware for some of the project servers. even if it's a minor platform change to simply hold more memory to have a larger limit on db size in ram.
4. (future) work towards developing a more variable limit for tasks in progress based on host performance
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2034629 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19310
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034633 - Posted: 29 Feb 2020, 17:44:44 UTC - in response to Message 2034620.  

I think that many users will set a cache size to give them as many as possible for the CPU. Which if the average CPU task takes 1 hour will require now, a cache size of 6 days. Which because the computer has 6 cores actually only lasts 1 day.
FALSE - check the work requests in seconds (<sched_op_debug>). If you set 1 day cache, and you have six cores, BOINC will ask for 6 days of work - one day (cache size) of work for each core.

Not what I was told here https://setiathome.berkeley.edu/forum_thread.php?id=84983&postid=2032189#2032189
ID: 2034633 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2034636 - Posted: 29 Feb 2020, 17:51:13 UTC - in response to Message 2034633.  

Richard is talking about what is requested, not what you get.

if your host is slow enough that the number of tasks that 6 days worth comprises, falls below the project side limits, you will be sent 6 days worth.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2034636 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14673
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2034638 - Posted: 29 Feb 2020, 17:51:22 UTC - in response to Message 2034633.  

I think that many users will set a cache size to give them as many as possible for the CPU. Which if the average CPU task takes 1 hour will require now, a cache size of 6 days. Which because the computer has 6 cores actually only lasts 1 day.
FALSE - check the work requests in seconds (<sched_op_debug>). If you set 1 day cache, and you have six cores, BOINC will ask for 6 days of work - one day (cache size) of work for each core.
Not what I was told here https://setiathome.berkeley.edu/forum_thread.php?id=84983&postid=2032189#2032189
That was the "in-progress cap" by number of tasks - that is indeed for the the whole CPU.

By 'cache', I mean the user setting - 'store at least -- days of work'. that one is per core. The system is, indeed, illogical.
ID: 2034638 · Report as offensive     Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 16 · Next

Message boards : Number crunching : About Deadlines or Database reduction proposals


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.