Task Deadline Discussion

Author	Message
Brent Norman Volunteer tester Send message Joined: 1 Dec 99 Posts: 2786 Credit: 685,657,289 RAC: 835	Message 1905334 - Posted: 7 Dec 2017, 15:50:36 UTC For broken hosts, they could be sent a "ValidationTestFile.wu" to complete and validate (against a known expected result) in order to receive more tasks. The user might get the hint with the strange file name being sent to them ... ID: 1905334 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1905337 - Posted: 7 Dec 2017, 16:27:58 UTC - in response to Message 1905332. I don't see how shorter deadlines would increase the rate of re-sends from broken hosts. Maybe from very slow hosts, but from broken hosts they would need to be re-sent at some point anyway. There just should be a better mechanism for not sending many more tasks to these broken hosts. Tom Rough speaking: N per 3 or more weeks (whatever current deadline is) or N per 1 week (or whatever suggested deadline is). It's for never-returning hosts. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1905337 ·

tullio Volunteer tester Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1	Message 1905340 - Posted: 7 Dec 2017, 16:41:58 UTC I have 91 tasks, 35 valid and 36 pending, plus 16 in progress. But these do not exist because they died in a system crash. They will have to wait until their deadline expires to be resent. I have always kept a small cache just not to make too many zombies in a system crash. PCs do crash at times. Tullio ID: 1905340 ·

Keith Myers Volunteer tester Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873	Message 1905342 - Posted: 7 Dec 2017, 16:56:18 UTC - in response to Message 1905340. You could also be kind and use the ghost recovery protocol to send them on to someone else sooner or crunch them yourself if they are still active. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) ID: 1905342 ·

tullio Volunteer tester Send message Joined: 9 Apr 04 Posts: 8797 Credit: 2,930,782 RAC: 1	Message 1905345 - Posted: 7 Dec 2017, 17:15:45 UTC I am a physicist not a ghostbuster. Tullio ID: 1905345 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1905359 - Posted: 7 Dec 2017, 18:20:56 UTC - in response to Message 1905308. This debate started when the main db server was having issues with space, so perhaps we should look at the number of pending tasks vs. time to deadline (or time in progress). This will quantify the scale of the database issue. I would expect it to be some sort of Gussian curve, but does it have a "long thin" tail, or is a "long fat" tail? The issues with Master/Replica database space and/or other resources seem like they've been long-standing ones. Those database problems seem to have been quiet for awhile, with the science database being the one that seems to be causing the current headaches. However, taking steps now that might head off future recurrence of those problems is a prudent thing to do, rather than waiting for something to break before jumping into fire-fighting mode. I would think it would be possible to get some sense of the pending tasks vs. time to deadline from the numbers in my initial post by looking at how quickly the majority of hosts report tasks versus the small number that approach or exceed the deadlines. If I understand the terminology, I would say it would look like a "long thin" tail (but perhaps I'm misunderstanding that). Raistmer's point about shorter deadlines simply resulting in some hosts receiving replacement tasks more quickly is certainly valid, but it looks to me like such hosts represent a small percentage of those who ultimately successfully report at the upper end of the deadline range. For those who are already timing out on large numbers of tasks despite the current extended deadlines, it would make no difference to have shorter deadlines. And, as BetelgeuseFive pointed out, fixing the quota mechanism to better throttle such hosts, is really where some serious work needs to be done, also. In the short term, I'm in favor of a more K.I.S.S. approach to the deadline issue. While I'm definitely in favor of shortening the deadlines, for reasons I've mentioned earlier, I think complex solutions would probably require more programming effort at the server level than could possibly be devoted with the current staff situation. However, simply applying a flat percentage reduction seems like it should be little more than a minor change to the existing deadline calculation formula, which is based on a WU's Angle Range. For now, at least, I would say that a 20% across-the-board reduction in deadlines could have a marked benefit to the project, with minimal impact to that tiny percentage of hosts who currently exceed the 80% of deadline threshold, at least based on the sample I analyzed. (BTW, I did run the same analysis on my tasks for September, 2017, and got similar results, with only 33 tasks out of 98,016, from 23 hosts, reported past the 80% threshold.) Perhaps even a 25% or 30% reduction could be justified, but I didn't break out the percentages at those thresholds and so can't provide exact numbers for those impacted. Long term, of course, any of the other possible solutions mentioned here might be viable, but I suspect that, even if the programming time was available, adding too much complexity might be counterproductive. Keep the ideas coming, though. ID: 1905359 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22199 Credit: 416,307,556 RAC: 380	Message 1905399 - Posted: 7 Dec 2017, 20:42:18 UTC Right let's have a look. I have ~5400 pending #1000 - sent today #2000 - sent yesterday #3000 - sent 2017 12 03 #4000 - sent 2017 11 20 (17 days ago) #4500 - sent 2017 11 08 (29 days) #5000 - sent 2017 10 25 (43 days) last - sent 2017 10 14 (54 days) The scaling is quite crude, I've just looked at the xxxxth task sent date, but the "problem" is nothing like the size some would have us believe - my oldest pending isn't 60 days old, less than 16% are still outstanding after 30 days. So is it really worth while loosing 16% of the processing capability of the project just to get tasks turned round in a shorter time? As Raistmer has said, if you reduce the deadline you automatically increase the number of resends, and that adds more to the database load than having a task sitting there waiting for its partner to come back and validation to take place. Every resend results in the record for the "failed" host and the record for the "new" host being created and stored, and such records are not dumped (apart from in the ready to send queue there is only ever one copy of the workunit). The failed host record has any data it returns stored, but not necessarily passing through the validation process "just in case". Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1905399 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1905406 - Posted: 7 Dec 2017, 20:59:28 UTC Last modified: 7 Dec 2017, 21:03:22 UTC Why not look from the other side? Instead of make a short deathtime Line just increase the limit of the fastest hosts? The ones with very fast return. They are the ones who actually runs empty in the outages. Nothing complex, something simple to code. Like, if the hosts return the work in 1 day increase the limit of WU from 100 to 200 per GPU. That could satisfy most of the hungry hosts. And leaves some more room for the future when even faster GPU's where available. Not mess with the rest of the SETI community and will not impact the size of the DB. My 0.02 Cents. ID: 1905406 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1905409 - Posted: 7 Dec 2017, 21:29:43 UTC - in response to Message 1905399. Last modified: 7 Dec 2017, 21:44:59 UTC ...less than 16% are still outstanding after 30 days. So is it really worth while loosing 16% of the processing capability of the project just to get tasks turned round in a shorter time? You are assuming that the entire 16% would still validate before their deadlines. In fact, based on the numbers in my analysis, the vast majority of those are still going to time out and have to be resent anyway. In particular, you should take a look at those above #5000 in your list and see how close the wingmen are to their deadlines and then, ideally, track them (or at least a sampling of them) until they either validate or time out. In my sample, if I were to look just at tasks still pending after 39 days (80% of the 48.72 days average until deadline), there were only 23 that still validated while 878 timed out. (Due to inconsistencies in the way a "timed out" status seems to be recorded, those "Not started" errors can be misleading, so some or many of those 95 tasks could also be added to the 878.) Anyway, that amounts to only about 2.6% of the tasks outstanding beyond the 80% mark. Also, of those 23, I'm pretty sure that there are at least a few hosts that drag their heels just because of shared processing with other projects. The S@h tasks don't get run until they absolutely, positively have to be. With shorter deadlines, I think that's still likely to be the case, they'd just get run sooner. As Raistmer has said, if you reduce the deadline you automatically increase the number of resends... It seems to me that there's only a very tiny grain of truth to that. Again, using my data, out of 901 tasks still hanging around past the 80% mark, only 23 were eventually validated. The other 878 had to be resent no matter what. Those do not increase the number of resends. They just get resent more quickly, hopefully to reliable hosts who process them quickly (as most do), thereby clearing the task and WU data from the database that much sooner, and shrinking the resource requirements of the database. At most, only those 23 might get resent sooner and, to my earlier point, probably some of those would still beat a shorter deadline. BTW, if you look back at the list, there were 3 of those 23 that exceeded 100% of the deadline, meaning new tasks had already been sent out anyway, before the original hosts reported. EDIT: And, of course, that 2.6% figure doesn't represent 2.6% "of the processing capability of the project", as you attributed to your 16% figure, but only 2.6% of those tasks that passed the 80% threshold or, by my calculations, about 0.03% of the total WUs my hosts processed in October. ID: 1905409 ·

BetelgeuseFive Volunteer tester Send message Joined: 6 Jul 99 Posts: 158 Credit: 17,117,787 RAC: 19	Message 1905551 - Posted: 8 Dec 2017, 11:06:07 UTC - in response to Message 1905406. Why not look from the other side? Instead of make a short deathtime Line just increase the limit of the fastest hosts? The ones with very fast return. They are the ones who actually runs empty in the outages. Nothing complex, something simple to code. Like, if the hosts return the work in 1 day increase the limit of WU from 100 to 200 per GPU. That could satisfy most of the hungry hosts. And leaves some more room for the future when even faster GPU's where available. Not mess with the rest of the SETI community and will not impact the size of the DB. My 0.02 Cents. That would help to feed the fast hosts, but it would not help to reduce the number of timeouts (and tasks in the database tables). Maybe it could be combined with reducing the number of tasks sent to hosts with a (very) low RAC. This would prevent sending tasks to hosts like this one: https://setiathome.berkeley.edu/show_host_detail.php?hostid=8363703 Tom ID: 1905551 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22199 Credit: 416,307,556 RAC: 380	Message 1905556 - Posted: 8 Dec 2017, 11:18:50 UTC A bad example to choose - that cruncher is in a classical "drive by". It has only contacted the server on the day of its creation, and has only tasks from that date, so there is virtually no way of stopping such events. What would be more interesting would be to see an example of a computer that contacts regularly, grabs loads of tasks, and returns very few valid results within the tasks deadlines because those are the "real menaces". Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1905556 ·

BetelgeuseFive Volunteer tester Send message Joined: 6 Jul 99 Posts: 158 Credit: 17,117,787 RAC: 19	Message 1905565 - Posted: 8 Dec 2017, 12:29:03 UTC - in response to Message 1905556. A bad example to choose - that cruncher is in a classical "drive by". It has only contacted the server on the day of its creation, and has only tasks from that date, so there is virtually no way of stopping such events. What would be more interesting would be to see an example of a computer that contacts regularly, grabs loads of tasks, and returns very few valid results within the tasks deadlines because those are the "real menaces". The point is that only a very limited amount of tasks should be sent to hosts with a (very) low RAC. It doesn't matter if it is the first time the host requests work or not. Obviously some tasks need to be sent (so computers can build up RAC), but it makes no sense to send so many. Tom ID: 1905565 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1905568 - Posted: 8 Dec 2017, 13:04:33 UTC - in response to Message 1905345. I am a physicist not a ghostbuster. Tullio One doesn't exclude the other ;) :D SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1905568 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1905571 - Posted: 8 Dec 2017, 13:08:55 UTC - in response to Message 1905359. For now, at least, I would say that a 20% across-the-board reduction in deadlines could have a marked benefit to the project, with minimal impact to that tiny percentage of hosts who currently exceed the 80% of deadline threshold, at least based on the sample I analyzed. Could you precisely formulate "benefit"? And if benefit is shrinkage in BOINC DB why it's required. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1905571 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1905573 - Posted: 8 Dec 2017, 13:14:52 UTC - in response to Message 1905406. Why not look from the other side? Instead of make a short deathtime Line just increase the limit of the fastest hosts? The ones with very fast return. They are the ones who actually runs empty in the outages. Nothing complex, something simple to code. Like, if the hosts return the work in 1 day increase the limit of WU from 100 to 200 per GPU. That could satisfy most of the hungry hosts. And leaves some more room for the future when even faster GPU's where available. Not mess with the rest of the SETI community and will not impact the size of the DB. My 0.02 Cents. That's right approach IMO, especially because nothing really disallow to make this manually on each host by introducing "virtual devices" to BOINC (either by re-scheduling CPU<-> GPU or by running multiple BOINC instances or even by creation some app_info.xml based additional "accelerators" (not tested by seems possible)). SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1905573 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1905575 - Posted: 8 Dec 2017, 13:20:03 UTC - in response to Message 1905409. As Raistmer has said, if you reduce the deadline you automatically increase the number of resends... It seems to me that there's only a very tiny grain of truth to that. Again, using my data, out of 901 tasks still hanging around past the 80% mark, only 23 were eventually validated. The other 878 had to be resent no matter what. Those do not increase the number of resends. Actually they do. Both never returning hosts as I said earlier _and_ those who regularly miss deadline. If you trash 1 task per 3 week it's 3 times less than if you trash 1 task each week. For year you will get 3 times more resends from such hosts. + increase in such hosts number just due to increasing in processing power required to finish in time. @The other 878 had to be resent no matter what.@ you consider this as one time event, but actually it should be considered as recurring event! SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1905575 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 1905619 - Posted: 8 Dec 2017, 16:09:41 UTC - in response to Message 1905551. Last modified: 8 Dec 2017, 16:13:35 UTC This would prevent sending tasks to hosts like this one: https://setiathome.berkeley.edu/show_host_detail.php?hostid=8363703 Why send more than 10 WU to a new host?. There must be a limit for that. The number of available WU for a host must be related to the returned crunched WU. Something already implemented on the code when you reach the host daily task limit. For each returned WU you could receive 2 more. For each bad/crased WU you loose some. Not sure how its works. If your host return 1000 WU/day then your cache could be up to 2000 WU, if you return 10 then 20, etc. Nothing complicated our insane. ID: 1905619 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1905629 - Posted: 8 Dec 2017, 17:02:45 UTC - in response to Message 1905556. A bad example to choose - that cruncher is in a classical "drive by". It has only contacted the server on the day of its creation, and has only tasks from that date, so there is virtually no way of stopping such events. What would be more interesting would be to see an example of a computer that contacts regularly, grabs loads of tasks, and returns very few valid results within the tasks deadlines because those are the "real menaces". Perhaps a couple from my earlier posts would be better. HostID: 8261239 has an average turnaround of 38.07 days, yet has 95 tasks on board at the moment (down from 107 when I posted 2 days ago). His timeouts have climbed to 32. Appears to be a host which only makes sporadic contact, yet still manages to download a quantity of tasks far in excess of its ability to process them in a timely fashion. HostID: 6122802 has 6,148 tasks on board, with 361 recently timed out. It probably hasn't actually successfully processed a task in a long time, yet still is allowed to download more than a hundred new tasks every day. Addressing hosts such as these requires looking at different issues than just task deadlines, but shortening task deadlines would likely at least reduce the number of essentially dead tasks they would be sitting on at any given time. ID: 1905629 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1905639 - Posted: 8 Dec 2017, 17:52:32 UTC - in response to Message 1905571. For now, at least, I would say that a 20% across-the-board reduction in deadlines could have a marked benefit to the project, with minimal impact to that tiny percentage of hosts who currently exceed the 80% of deadline threshold, at least based on the sample I analyzed. Could you precisely formulate "benefit"? And if benefit is shrinkage in BOINC DB why it's required. Yes, shrinkage in the size of the BOINC DBs (Master and Replica), and their associated overhead would be the primary tangible benefit, as I see it. Quantifying how much that reduction would be based on any specific reduction in deadlines isn't possible without knowing how much of those DBs is occupied by task and WU data. Obviously, those DBs contain account, host, forum, and other data, and I have no way of knowing how that all breaks down. However, considering the fact that we have gone through stretches over the last several years where the breakdown of those DBs has caused crunchers to run out of work, reducing their overall productivity, any improvements that would reduce the DB load can't help but be a good thing. Just because those DBs have been quiet and reliable for awhile, doesn't mean the same sorts of problems won't surface again. And the ideal time to identify possible improvements is precisely when things are running smoothly, not once the havoc commences. It's also worth noting, I think, that DB load is often cited (though not clearly documented) as the reason for such limitations as the restriction to 100 CPU tasks per cruncher and 100 tasks per GPU, or the inability of the servers to perform automatic "lost" task resends. Some modification to both of those policies (and perhaps others) might be possible with a reduced DB load. ID: 1905639 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1905645 - Posted: 8 Dec 2017, 18:19:16 UTC - in response to Message 1905573. That's right approach IMO, especially because nothing really disallow to make this manually on each host by introducing "virtual devices" to BOINC (either by re-scheduling CPU<-> GPU or by running multiple BOINC instances or even by creation some app_info.xml based additional "accelerators" (not tested by seems possible)). Or take Petri's approach by fooling BOINC with his "[16] NVIDIA GeForce GTX 1080 Tu " GPUs. But these simply address the temporary Tuesday outage drought that high-volume crunchers face. In fact, all of these approaches actually inflate the size of the DB, either temporarily (as with my own, and others', stockpiling through rescheduling on Monday, which returns to normal by the end of the outage) or more permanently (as with Petri's virtual GPUs). It's really a very separate issue from the deadlines. ID: 1905645 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.