About Deadlines or Database reduction proposals

Message boards : Number crunching : About Deadlines or Database reduction proposals
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 16 · Next

AuthorMessage
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19317
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034639 - Posted: 29 Feb 2020, 18:00:53 UTC - in response to Message 2034638.  

I think that many users will set a cache size to give them as many as possible for the CPU. Which if the average CPU task takes 1 hour will require now, a cache size of 6 days. Which because the computer has 6 cores actually only lasts 1 day.
FALSE - check the work requests in seconds (<sched_op_debug>). If you set 1 day cache, and you have six cores, BOINC will ask for 6 days of work - one day (cache size) of work for each core.
Not what I was told here https://setiathome.berkeley.edu/forum_thread.php?id=84983&postid=2032189#2032189
That was the "in-progress cap" by number of tasks - that is indeed for the the whole CPU.

By 'cache', I mean the user setting - 'store at least -- days of work'. that one is per core. The system is, indeed, illogical.

Why do I keep getting told s/ware guy's and gal's have to be logical.

Maybe time to get the thumb screws out and interrogate the youngest, who has turned up with a few days off because his trip to India is cancelled.
ID: 2034639 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14674
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2034640 - Posted: 29 Feb 2020, 18:08:30 UTC - in response to Message 2034639.  

Why do I keep getting told s/ware guy's and gal's have to be logical.

Maybe time to get the thumb screws out and interrogate the youngest, who has turned up with a few days off because his trip to India is cancelled.
They do - just that nobody's told Berkeley yet. Perhaps you could send him there instead?
ID: 2034640 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22455
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2034642 - Posted: 29 Feb 2020, 18:22:39 UTC

I think you are confusing task requests with the maximum allowed number of tasks.
A host will request as many seconds as it needs to fill the cache, but it will only get tasks up to 150 (CPU) and 150 (per GPU).
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2034642 · Report as offensive     Reply Quote
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 2034644 - Posted: 29 Feb 2020, 18:43:22 UTC - in response to Message 2034629.  
Last modified: 29 Feb 2020, 18:50:49 UTC

4. (future) work towards developing a more variable limit for tasks in progress based on host performance
SETI has been able to estimate task limits based on performance for quite some time. That's how they can calculate the numbers in the Manager's Remaining column, and know how many tasks to send when Richard set's his cache to 0.75 days. They can do it NOW. My suggestion was they remove the Numerical Cap and set the Cache limit to One Day instead of Ten days. Yes, this would mean the CUDA users will get Many More tasks, as they should, and the Hosts completing 20 tasks a day will get 20 tasks for their Cache, as they should. There was a Time when there wasn't a numerical Cap, and it wouldn't be difficult to return to that time.
ID: 2034644 · Report as offensive     Reply Quote
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 2034648 - Posted: 29 Feb 2020, 19:25:18 UTC - in response to Message 2034644.  

yes they could figure out the number of tasks a host a needs. but that doesn't necessarily mean that the software infrastructure currently exists to implement that kind of method, right now. I don't think its a matter of simply going in to the server code and changing "150 tasks" to "1 day" and that's it. obviously some tweaking and additional logic would need to be added. not that it's very hard, just saying that it will take some level of work, and discussion to their preferred limits.

SETI only has a 150 per device cap. SETI has nothing to do with any 10-day cap. that is a cap in the BOINC software client side for what the client can ASK for, then SETI imposes their own 150 limit in their response to that request.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 2034648 · Report as offensive     Reply Quote
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22455
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2034659 - Posted: 29 Feb 2020, 20:08:11 UTC

SETI only has a 150 per device cap. SETI has nothing to do with any 10-day cap. that is a cap in the BOINC software client side for what the client can ASK for, then SETI imposes their own 150 limit in their response to that request.


Yes, each project can set its own limits, or can use the cache size, or can ignore them both, just as the project team feels fits their project model, or servers.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2034659 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2034666 - Posted: 29 Feb 2020, 20:40:58 UTC - in response to Message 2034587.  

nobody further inflates their number of GPUs to keep their caches at their current inflated levels.

May i ask: Why are you so obsessed with the maybe 35 hosts who run with the spoofed client?
This 35 (or whatever number they are) are all at the top of the daily production hosts of the projects.
All their users are top crunchers, who knows what they doing and does all they can to protect the DB.
Almost all of them has an APR of around a day or two.
They crunch their WU very fast so the deadline has no meaning for them.
They actually helps the project to clear the WU as fast as possible not the inverse.
So they actually contributes to squeeze the DB size.
IIRC before the rise of the WU limits by the project admins they worked and we have no problem with the WU supply.
Please en-light me, i can't understand.
ID: 2034666 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2034669 - Posted: 29 Feb 2020, 20:59:56 UTC - in response to Message 2034610.  

Therefore if you limit the cache to one day, for the CPU it will be 1day/number of cores.
Wrong! The client asks CPU work for the configured time multiplied by the number of simultaneously running tasks (which may be the number of cores if you use each core or even twice the number of cores if you run cpu tasks in each thread of an SMT processor).

If you configure six day cache for a six core cpu, you get six days of work for each core if your cpu is slow enough to not hit the 150 task cap with that amount.

I'm running 8 cpu apps with one day configured cache but my CPU crunches the 150 task cap in approximately 12 hours, which means each core is half a day short of the configured cache due to the server side cap. So if the client wanted to ask that missing half a day of work for each core, it wouldn't ask for just half a day but 8 times that i.e 4 days.

Let's see what my latest scheduler request actually asked...

<cpu_req_secs>347441.568090</cpu_req_secs>

That's about 4.02 days!
ID: 2034669 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19317
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034684 - Posted: 29 Feb 2020, 22:01:39 UTC - in response to Message 2034669.  

Therefore if you limit the cache to one day, for the CPU it will be 1day/number of cores.
Wrong! The client asks CPU work for the configured time multiplied by the number of simultaneously running tasks (which may be the number of cores if you use each core or even twice the number of cores if you run cpu tasks in each thread of an SMT processor).

If you configure six day cache for a six core cpu, you get six days of work for each core if your cpu is slow enough to not hit the 150 task cap with that amount.

I'm running 8 cpu apps with one day configured cache but my CPU crunches the 150 task cap in approximately 12 hours, which means each core is half a day short of the configured cache due to the server side cap. So if the client wanted to ask that missing half a day of work for each core, it wouldn't ask for just half a day but 8 times that i.e 4 days.

Let's see what my latest scheduler request actually asked...

<cpu_req_secs>347441.568090</cpu_req_secs>

That's about 4.02 days!

Not be too rude about it, but if you had read beyond my post and found that I have now got the full story.
BOINC lets you ask for a timed cache per device. BUT Seti enforces it's own limits which is limited to a number of tasks per device and that device is CPU not cores or threads.
ID: 2034684 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 2034686 - Posted: 29 Feb 2020, 22:21:24 UTC - in response to Message 2034591.  
Last modified: 29 Feb 2020, 22:23:25 UTC

Still doesn't answer why there are 4 million workunits in the waiting for Validation row when there used to be a very small number, 108 in the example given, on 15th Nov 2019.
How many times does this have to be said before people take notice. Seriously?

All the issues we are seeing (Assimilation, Deletion & Purge backlogs as they occur) can be explained by what Eric has already told us- the database can no longer be cached in the RAM of the database sever, which is a result of the blowout in the Results returned and awaiting validation due to the need to protect the Science database from corrupt data.
If it can't be cached in RAM, I/O performance falls off a cliff &
any process that makes use of the database will be affected.
Grant
Darwin NT
ID: 2034686 · Report as offensive     Reply Quote
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19317
Credit: 40,757,560
RAC: 67
United Kingdom
Message 2034688 - Posted: 29 Feb 2020, 22:31:30 UTC - in response to Message 2034686.  
Last modified: 29 Feb 2020, 22:32:54 UTC

Still doesn't answer why there are 4 million workunits in the waiting for Validation row when there used to be a very small number, 108 in the example given, on 15th Nov 2019.
How many times does this have to be said before people take notice. Seriously?

All the issues we are seeing (Assimilation, Deletion & Purge backlogs as they occur) can be explained by what Eric has already told us- the database can no longer be cached in the RAM of the database sever, which is a result of the blowout in the Results returned and awaiting validation due to the need to protect the Science database from corrupt data.
If it can't be cached in RAM, I/O performance falls off a cliff &
any process that makes use of the database will be affected.

That still is a weak answer in part because unless the project stops sending out tasks, nothing can be done about the out in the field, returns and their validation. BUT the assimilation and purging could be done during the Tuesday outage and relieve some of the pressure. The reason for the Tuesday outage is to sort out the databases afterall.
ID: 2034688 · Report as offensive     Reply Quote
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 2034692 - Posted: 29 Feb 2020, 22:45:28 UTC - in response to Message 2034688.  

I don't know if that is still the case anymore. It used to be for database compaction and clean up, and then backup. I think they are just backing up now and skipping the database massage. That used to take more than a day back when the db was much smaller. Now that the db is much larger and the disk I/O hasn't really changed, I wonder if the long outages are just covering the backup of the db.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 2034692 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 2034693 - Posted: 29 Feb 2020, 22:48:05 UTC - in response to Message 2034614.  

The Database spiked because all those people that had less than a Ten Day cache, because of the Cap, suddenly were able to download more. From my experience people have the cache set at 10 days and never change it, even when experimenting with things known to trash caches. Seems the furthest thing from their minds is to lower the cache. When running the CUDA Special App most people have to jump through hoops to even get a Half-Days worth of cache, most have much less than a one day cache.
The spike in the database was because of 3 things.
1 an increase in the serverside limits.
2 the loading of files that many of which turned out to be 95% noise bombs.
3 the increasing in the number of times a WU had to be processed in order to reach Quorum due to the RX 5000 driver issues.
4 hosts that get work and don't return it combined with long deadlines.

4 Hosts not returning work.
with the long deadlines, this also will have exacerbated the problem, but like the server side limit increase it wasn't a cause, just a contributing factor. However the long deadlines are contributing the delay in reducing the backlog.
But as to how much they are delaying it/ how much shorter deadlines would reduce the size of the database, is difficult to say.


1 Increase in Serverside limits.
Yes, this did contribute to the problem, but it wasn't the cause- it's effect on the database was not significant. Work in progress went from around 5 million to around 7 million (around a 40% increase)
That increase of 2 million is a drop in the bucket when compared to


2 The loading of files that produced little more than noise bombs.
3 The increase in the minimum Quorum for short running WUs.
This is what blew the database size out. Instead of 4 million Results returned and awaiting validation and we are now up to 14- a more than 3 fold increase (over 300%). And it is that increase that has caused all the other queues and processes to become impacted.


I've said it before & i'll say it again because people don't seem to be getting it- any process that accesses the database will be negatively impacted by the blowout in the database size.



It was 2 & 3- the noise bombs, along with the need to increase the Quorum for short running WUs both occurring at the same time that caused the size of the database to bloat beyond the server's ability to deal with it. Increased serverside limits would have lowered the point at twhich the databsebase sever could no longer be chaced in RAM, but not by much compared to the other fatcors.
Grant
Darwin NT
ID: 2034693 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 2034697 - Posted: 29 Feb 2020, 22:58:17 UTC - in response to Message 2034688.  

That still is a weak answer in part because unless the project stops sending out tasks, nothing can be done about the out in the field, returns and their validation.
It's not a weak answer, it is just a statement of fact.

BUT the assimilation and purging could be done during the Tuesday outage and relieve some of the pressure. The reason for the Tuesday outage is to sort out the databases afterall.
But the Results returned and awaiting validation number remains unchanged, and after the outage everyone reports what work they were able to do blowing that number out even further for a while, and then as the work is Validated it is then ready for Assimilation. But Assimilation requires database access, and the database is larger than the sever can handle & so I/O is severely limited so the Assimilation backlog will grow again...
Back to where we were.


And the fact is that-
Until such time as there is no need to have multiple resends on short running WUs to protect the Science database from corrupt data this issue will continue.
Or
Until such time as we have a new database server with more RAM the issue will continue to exist.


Richard's suggestion of implementing the reduced deadline for resends (if it works) would probably help.
Grant
Darwin NT
ID: 2034697 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 2034704 - Posted: 29 Feb 2020, 23:23:50 UTC - in response to Message 2034587.  

Given the shear number of tasks out in the field and so sitting in the "day-file" it is going to take some time to get things back under control :-(
But, if you think back to November when the allowance was pushed up how rapidly the queue sizes responded that would suggest that dropping allowances would be the first action - and it would work provided everyone lived within the lower limits and nobody further inflates their number of GPUs to keep their caches at their current inflated levels.

I will repost what i posted previously in this thread concerning spoofing.
Summary- spoofing has no effect on the size of the database, just the distribution of WU amongst the different statuses.


... That was my feeling too, till Tbar posted a comparison between 2 systems with almost identical hardware & applications. One spoofed, the other not.
Keep in mind- the faster you return work, the higher your Pendings. The longer it takes you to return work the greater the chance your wingman has already returned theirs & so it will go (pretty much) straight to Validated.

Whether you spoof or not, the end result is the load on the database is pretty much the same. The Task list All numbers- which is the load on the database (In Progress + Pendings + Inconclusives + Valids etc)- for both systems was within a few hundred of each other. The spoofed system had a much higher In Progress number, with a much lower Validation Pending number. The un-spoofed system had a much lower In Progress number, with much higher Validation Pending numbers. But overall, spoofed v unspoofed, for a given system & application the All numbers were pretty much the same, it's just a difference in status (In progess v Validation Pending).

Basically the better a system performs, the greater the load on the database. But so would the same amount of work being done (WUs per hour), if it were being done by many, many more slower systems- with the added load of keeping track of all those extra systems, and all those extra Scheduler requests, of course.


Spoofing has no effect at all on the database size.
Grant
Darwin NT
ID: 2034704 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2034706 - Posted: 29 Feb 2020, 23:28:15 UTC - in response to Message 2034666.  

nobody further inflates their number of GPUs to keep their caches at their current inflated levels.

May i ask: Why are you so obsessed with the maybe 35 hosts who run with the spoofed client?
This 35 (or whatever number they are) are all at the top of the daily production hosts of the projects.
All their users are top crunchers, who knows what they doing and does all they can to protect the DB.
Almost all of them has an APR of around a day or two.
They crunch their WU very fast so the deadline has no meaning for them.
They actually helps the project to clear the WU as fast as possible not the inverse.
So they actually contributes to squeeze the DB size.
IIRC before the rise of the WU limits by the project admins they worked and we have no problem with the WU supply.
Please en-light me, i can't understand.


Still waiting for an answer, if there are any.
ID: 2034706 · Report as offensive     Reply Quote
Darrell Wilcox Project Donor
Volunteer tester

Send message
Joined: 11 Nov 99
Posts: 303
Credit: 180,954,940
RAC: 118
Vietnam
Message 2034741 - Posted: 1 Mar 2020, 1:41:33 UTC - in response to Message 2034693.  

@ Grant (SSSF)
But as to how much they are delaying it/ how much shorter deadlines would reduce the size of the database, is difficult to say.
Based on Eric Korpela's sample taken 2/2/2019, tasks passed their deadlines at about 40,000/day. Tasks older than 30 days
represented about 20% of all tasks. Not having better data, we can tentatively "assume" this is somewhat still true.
ID: 2034741 · Report as offensive     Reply Quote
Darrell Wilcox Project Donor
Volunteer tester

Send message
Joined: 11 Nov 99
Posts: 303
Credit: 180,954,940
RAC: 118
Vietnam
Message 2034747 - Posted: 1 Mar 2020, 2:06:09 UTC - in response to Message 2034629.  

@ Ian&Steve C.
2. reduce the deadlines so that the impact of results being abandoned is lessened (once it hits the deadline,
goes to someone else). can EASILY cut the current deadlines in half without ostracizing 99.99% of hosts
This is probably hyperbole on your part. The 50% mark was 4 days in Eric's sample 2/2/2019. Even a 1
day reduction could possibly impact up to 1%. Reducing the serverside limit is better. And let the "spoofers"
do their thing. For many (most?) crunchers, it is the science, not the RAC, that is important.
ID: 2034747 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 2034748 - Posted: 1 Mar 2020, 2:10:57 UTC - in response to Message 2034741.  

@ Grant (SSSF)
But as to how much they are delaying it/ how much shorter deadlines would reduce the size of the database, is difficult to say.
Based on Eric Korpela's sample taken 2/2/2019, tasks passed their deadlines at about 40,000/day. Tasks older than 30 days
represented about 20% of all tasks. Not having better data, we can tentatively "assume" this is somewhat still true.
So it might be enough to bring the database down in size sufficient to fit in RAM & resume normal I/O performance. Of course it will take a good month (or more) for the impact to become noticeable- for the new deadlines to start having an effect & the existing longer deadline WUs to finally start to clear from the system. That and if the function Richard suggested for reduced deadlines on resends actually works and is implemented with a 7day deadline, that will help things along considerably.
At least until the next shorty storm anyway, in which case it should then reduce the length of time the database remains bloated.
Grant
Darwin NT
ID: 2034748 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 2034751 - Posted: 1 Mar 2020, 2:21:09 UTC - in response to Message 2034747.  

@ Ian&Steve C.
2. reduce the deadlines so that the impact of results being abandoned is lessened (once it hits the deadline,
goes to someone else). can EASILY cut the current deadlines in half without ostracizing 99.99% of hosts
This is probably hyperbole on your part.
I have posted about this too many times in the past to redo it all over again- but the fact is that reducing the deadlines to 28 days will have little to no effect on slow/infrequently on hosts contributing. If they can process 1 WU a month, they can still contribute with a 28 day deadline.


Reducing the serverside limit is better. And let the "spoofers"
do their thing.
And punish those that don't spoof.
As i posted earlier in this thread, the amount of work In progress has contributed bugger all to the present problems (it's the amount of work in progress with increased Quorum requirements that has). But if people want to go that way, reducing the server-side limits is not better. Removing the Server-side limits (a fixed number of WUs per resource per host) and employing a machine cache limit (number of days) would be better,
Grant
Darwin NT
ID: 2034751 · Report as offensive     Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 16 · Next

Message boards : Number crunching : About Deadlines or Database reduction proposals


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.