Message boards :
Number crunching :
About Deadlines or Database reduction proposals
Message board moderation
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 16 · Next
Author | Message |
---|---|
W-K 666 Send message Joined: 18 May 99 Posts: 19317 Credit: 40,757,560 RAC: 67 |
That was the "in-progress cap" by number of tasks - that is indeed for the the whole CPU.Not what I was told here https://setiathome.berkeley.edu/forum_thread.php?id=84983&postid=2032189#2032189I think that many users will set a cache size to give them as many as possible for the CPU. Which if the average CPU task takes 1 hour will require now, a cache size of 6 days. Which because the computer has 6 cores actually only lasts 1 day.FALSE - check the work requests in seconds (<sched_op_debug>). If you set 1 day cache, and you have six cores, BOINC will ask for 6 days of work - one day (cache size) of work for each core. Why do I keep getting told s/ware guy's and gal's have to be logical. Maybe time to get the thumb screws out and interrogate the youngest, who has turned up with a few days off because his trip to India is cancelled. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14674 Credit: 200,643,578 RAC: 874 |
Why do I keep getting told s/ware guy's and gal's have to be logical.They do - just that nobody's told Berkeley yet. Perhaps you could send him there instead? |
rob smith Send message Joined: 7 Mar 03 Posts: 22455 Credit: 416,307,556 RAC: 380 |
I think you are confusing task requests with the maximum allowed number of tasks. A host will request as many seconds as it needs to fill the cache, but it will only get tasks up to 150 (CPU) and 150 (per GPU). Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
4. (future) work towards developing a more variable limit for tasks in progress based on host performanceSETI has been able to estimate task limits based on performance for quite some time. That's how they can calculate the numbers in the Manager's Remaining column, and know how many tasks to send when Richard set's his cache to 0.75 days. They can do it NOW. My suggestion was they remove the Numerical Cap and set the Cache limit to One Day instead of Ten days. Yes, this would mean the CUDA users will get Many More tasks, as they should, and the Hosts completing 20 tasks a day will get 20 tasks for their Cache, as they should. There was a Time when there wasn't a numerical Cap, and it wouldn't be difficult to return to that time. |
Ian&Steve C. Send message Joined: 28 Sep 99 Posts: 4267 Credit: 1,282,604,591 RAC: 6,640 |
yes they could figure out the number of tasks a host a needs. but that doesn't necessarily mean that the software infrastructure currently exists to implement that kind of method, right now. I don't think its a matter of simply going in to the server code and changing "150 tasks" to "1 day" and that's it. obviously some tweaking and additional logic would need to be added. not that it's very hard, just saying that it will take some level of work, and discussion to their preferred limits. SETI only has a 150 per device cap. SETI has nothing to do with any 10-day cap. that is a cap in the BOINC software client side for what the client can ASK for, then SETI imposes their own 150 limit in their response to that request. Seti@Home classic workunits: 29,492 CPU time: 134,419 hours |
rob smith Send message Joined: 7 Mar 03 Posts: 22455 Credit: 416,307,556 RAC: 380 |
SETI only has a 150 per device cap. SETI has nothing to do with any 10-day cap. that is a cap in the BOINC software client side for what the client can ASK for, then SETI imposes their own 150 limit in their response to that request. Yes, each project can set its own limits, or can use the cache size, or can ignore them both, just as the project team feels fits their project model, or servers. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
nobody further inflates their number of GPUs to keep their caches at their current inflated levels. May i ask: Why are you so obsessed with the maybe 35 hosts who run with the spoofed client? This 35 (or whatever number they are) are all at the top of the daily production hosts of the projects. All their users are top crunchers, who knows what they doing and does all they can to protect the DB. Almost all of them has an APR of around a day or two. They crunch their WU very fast so the deadline has no meaning for them. They actually helps the project to clear the WU as fast as possible not the inverse. So they actually contributes to squeeze the DB size. IIRC before the rise of the WU limits by the project admins they worked and we have no problem with the WU supply. Please en-light me, i can't understand. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Therefore if you limit the cache to one day, for the CPU it will be 1day/number of cores.Wrong! The client asks CPU work for the configured time multiplied by the number of simultaneously running tasks (which may be the number of cores if you use each core or even twice the number of cores if you run cpu tasks in each thread of an SMT processor). If you configure six day cache for a six core cpu, you get six days of work for each core if your cpu is slow enough to not hit the 150 task cap with that amount. I'm running 8 cpu apps with one day configured cache but my CPU crunches the 150 task cap in approximately 12 hours, which means each core is half a day short of the configured cache due to the server side cap. So if the client wanted to ask that missing half a day of work for each core, it wouldn't ask for just half a day but 8 times that i.e 4 days. Let's see what my latest scheduler request actually asked... <cpu_req_secs>347441.568090</cpu_req_secs> That's about 4.02 days! |
W-K 666 Send message Joined: 18 May 99 Posts: 19317 Credit: 40,757,560 RAC: 67 |
Therefore if you limit the cache to one day, for the CPU it will be 1day/number of cores.Wrong! The client asks CPU work for the configured time multiplied by the number of simultaneously running tasks (which may be the number of cores if you use each core or even twice the number of cores if you run cpu tasks in each thread of an SMT processor). Not be too rude about it, but if you had read beyond my post and found that I have now got the full story. BOINC lets you ask for a timed cache per device. BUT Seti enforces it's own limits which is limited to a number of tasks per device and that device is CPU not cores or threads. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
Still doesn't answer why there are 4 million workunits in the waiting for Validation row when there used to be a very small number, 108 in the example given, on 15th Nov 2019.How many times does this have to be said before people take notice. Seriously? All the issues we are seeing (Assimilation, Deletion & Purge backlogs as they occur) can be explained by what Eric has already told us- the database can no longer be cached in the RAM of the database sever, which is a result of the blowout in the Results returned and awaiting validation due to the need to protect the Science database from corrupt data. If it can't be cached in RAM, I/O performance falls off a cliff & any process that makes use of the database will be affected. Grant Darwin NT |
W-K 666 Send message Joined: 18 May 99 Posts: 19317 Credit: 40,757,560 RAC: 67 |
Still doesn't answer why there are 4 million workunits in the waiting for Validation row when there used to be a very small number, 108 in the example given, on 15th Nov 2019.How many times does this have to be said before people take notice. Seriously? That still is a weak answer in part because unless the project stops sending out tasks, nothing can be done about the out in the field, returns and their validation. BUT the assimilation and purging could be done during the Tuesday outage and relieve some of the pressure. The reason for the Tuesday outage is to sort out the databases afterall. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I don't know if that is still the case anymore. It used to be for database compaction and clean up, and then backup. I think they are just backing up now and skipping the database massage. That used to take more than a day back when the db was much smaller. Now that the db is much larger and the disk I/O hasn't really changed, I wonder if the long outages are just covering the backup of the db. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
The Database spiked because all those people that had less than a Ten Day cache, because of the Cap, suddenly were able to download more. From my experience people have the cache set at 10 days and never change it, even when experimenting with things known to trash caches. Seems the furthest thing from their minds is to lower the cache. When running the CUDA Special App most people have to jump through hoops to even get a Half-Days worth of cache, most have much less than a one day cache.The spike in the database was because of 3 things. 1 an increase in the serverside limits. 2 the loading of files that many of which turned out to be 95% noise bombs. 3 the increasing in the number of times a WU had to be processed in order to reach Quorum due to the RX 5000 driver issues. 4 hosts that get work and don't return it combined with long deadlines. 4 Hosts not returning work. with the long deadlines, this also will have exacerbated the problem, but like the server side limit increase it wasn't a cause, just a contributing factor. However the long deadlines are contributing the delay in reducing the backlog. But as to how much they are delaying it/ how much shorter deadlines would reduce the size of the database, is difficult to say. 1 Increase in Serverside limits. Yes, this did contribute to the problem, but it wasn't the cause- it's effect on the database was not significant. Work in progress went from around 5 million to around 7 million (around a 40% increase) That increase of 2 million is a drop in the bucket when compared to 2 The loading of files that produced little more than noise bombs. 3 The increase in the minimum Quorum for short running WUs. This is what blew the database size out. Instead of 4 million Results returned and awaiting validation and we are now up to 14- a more than 3 fold increase (over 300%). And it is that increase that has caused all the other queues and processes to become impacted. I've said it before & i'll say it again because people don't seem to be getting it- any process that accesses the database will be negatively impacted by the blowout in the database size. It was 2 & 3- the noise bombs, along with the need to increase the Quorum for short running WUs both occurring at the same time that caused the size of the database to bloat beyond the server's ability to deal with it. Increased serverside limits would have lowered the point at twhich the databsebase sever could no longer be chaced in RAM, but not by much compared to the other fatcors. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
That still is a weak answer in part because unless the project stops sending out tasks, nothing can be done about the out in the field, returns and their validation.It's not a weak answer, it is just a statement of fact. BUT the assimilation and purging could be done during the Tuesday outage and relieve some of the pressure. The reason for the Tuesday outage is to sort out the databases afterall.But the Results returned and awaiting validation number remains unchanged, and after the outage everyone reports what work they were able to do blowing that number out even further for a while, and then as the work is Validated it is then ready for Assimilation. But Assimilation requires database access, and the database is larger than the sever can handle & so I/O is severely limited so the Assimilation backlog will grow again... Back to where we were. And the fact is that- Until such time as there is no need to have multiple resends on short running WUs to protect the Science database from corrupt data this issue will continue. Or Until such time as we have a new database server with more RAM the issue will continue to exist. Richard's suggestion of implementing the reduced deadline for resends (if it works) would probably help. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
Given the shear number of tasks out in the field and so sitting in the "day-file" it is going to take some time to get things back under control :-( I will repost what i posted previously in this thread concerning spoofing. Summary- spoofing has no effect on the size of the database, just the distribution of WU amongst the different statuses. ... That was my feeling too, till Tbar posted a comparison between 2 systems with almost identical hardware & applications. One spoofed, the other not. Keep in mind- the faster you return work, the higher your Pendings. The longer it takes you to return work the greater the chance your wingman has already returned theirs & so it will go (pretty much) straight to Validated. Whether you spoof or not, the end result is the load on the database is pretty much the same. The Task list All numbers- which is the load on the database (In Progress + Pendings + Inconclusives + Valids etc)- for both systems was within a few hundred of each other. The spoofed system had a much higher In Progress number, with a much lower Validation Pending number. The un-spoofed system had a much lower In Progress number, with much higher Validation Pending numbers. But overall, spoofed v unspoofed, for a given system & application the All numbers were pretty much the same, it's just a difference in status (In progess v Validation Pending). Basically the better a system performs, the greater the load on the database. But so would the same amount of work being done (WUs per hour), if it were being done by many, many more slower systems- with the added load of keeping track of all those extra systems, and all those extra Scheduler requests, of course. Spoofing has no effect at all on the database size. Grant Darwin NT |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
nobody further inflates their number of GPUs to keep their caches at their current inflated levels. Still waiting for an answer, if there are any. |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
@ Grant (SSSF) But as to how much they are delaying it/ how much shorter deadlines would reduce the size of the database, is difficult to say.Based on Eric Korpela's sample taken 2/2/2019, tasks passed their deadlines at about 40,000/day. Tasks older than 30 days represented about 20% of all tasks. Not having better data, we can tentatively "assume" this is somewhat still true. |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
@ Ian&Steve C. 2. reduce the deadlines so that the impact of results being abandoned is lessened (once it hits the deadline,This is probably hyperbole on your part. The 50% mark was 4 days in Eric's sample 2/2/2019. Even a 1 day reduction could possibly impact up to 1%. Reducing the serverside limit is better. And let the "spoofers" do their thing. For many (most?) crunchers, it is the science, not the RAC, that is important. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
@ Grant (SSSF)So it might be enough to bring the database down in size sufficient to fit in RAM & resume normal I/O performance. Of course it will take a good month (or more) for the impact to become noticeable- for the new deadlines to start having an effect & the existing longer deadline WUs to finally start to clear from the system. That and if the function Richard suggested for reduced deadlines on resends actually works and is implemented with a 7day deadline, that will help things along considerably. At least until the next shorty storm anyway, in which case it should then reduce the length of time the database remains bloated. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
@ Ian&Steve C.I have posted about this too many times in the past to redo it all over again- but the fact is that reducing the deadlines to 28 days will have little to no effect on slow/infrequently on hosts contributing. If they can process 1 WU a month, they can still contribute with a 28 day deadline. Reducing the serverside limit is better. And let the "spoofers"And punish those that don't spoof. As i posted earlier in this thread, the amount of work In progress has contributed bugger all to the present problems (it's the amount of work in progress with increased Quorum requirements that has). But if people want to go that way, reducing the server-side limits is not better. Removing the Server-side limits (a fixed number of WUs per resource per host) and employing a machine cache limit (number of days) would be better, Grant Darwin NT |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.