Task Deadline Discussion

Message boards : Number crunching : Task Deadline Discussion
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1905034 - Posted: 6 Dec 2017, 5:44:59 UTC

Several days ago, we had a bit of a discussion in the Panic Mode thread concerning the current task deadline structure. Rather than continuing to hijack the Panic Mode thread, I wanted to continue this discussion in a dedicated thread, if anyone is interested.

To recap, I had noticed that, following the shutdown of the Scheduling server, the RTS buffer was steadily climbing, apparently due to tasks timing out once their original deadlines had passed, thereby causing new tasks to be generated which would then have to be sent to other hosts. Ultimately, more than 29,000 such tasks populated the RTS buffer in the first 24 hours of the scheduler's outage. Each of those tasks represented a corresponding number of workunits comprised of at least two tasks, all of which consumed storage space in the Master and Replica databases, along with other indeterminate overhead.

Those single-day numbers seemed like a good starting point for a discussion about the current state of task deadlines and whether the deadlines as currently constituted are excessive in today's processing environment. Many have suggested from time to time that they are, but there's never been any hard data to support any particular point of view. As Richard Haselgrove wrote in Message 1904460:
What the project needs most of all is time to think, and data to base their decisions on...
It's that actual data that's difficult to inject into a discussion at the user level, since we don't have access to the kind of detail that's likely available to the admins from the Master database.

However, I have a fairly extensive archive of my own tasks and WUs, and I figured I'd take a crack at seeing what sort of meaningful numbers I could tease out of it, limited though they may be. What I've done is to run an analysis of all the Multibeam tasks in my archive for the entire month of October, 2017. For each task, I first calculated the allowed turnaround time (deadline date/time minus sent date/time). Then, for each wingman with a "Completed and validated" task for the corresponding WU, I computed the actual turnaround time, as an absolute value and as a percentage of allowed turnaround. The results shown below break down those numbers into several percentile ranges, primarily focused on both ends of the scale, those who are quickest to return tasks and those who are slowest.

Total Number of WUs:                    83606
Total Number of Wingmen's Valid Tasks:  86727
Avg. Allowed Turnaround:  48.85 days
Avg. Actual Turnaround:   0.98 days
Tasks reported w/i 0-5% of Allowed Turnaround:   78739
Tasks reported w/i 5-10% of Allowed Turnaround:  5161
Tasks reported w/i 10-25% of Allowed Turnaround: 2314
Tasks reported w/i 25-50% of Allowed Turnaround: 431
Tasks reported w/i 50-80% of Allowed Turnaround: 59
Tasks reported w/i 80-90% of Allowed Turnaround: 6
Tasks reported w/i 90-95% of Allowed Turnaround: 3
Tasks reported w/i 95-98% of Allowed Turnaround: 9
Tasks reported w/i 98-99% of Allowed Turnaround: 2
Tasks reported after 99% of Allowed Turnaround:  3
"Timed out - no response" Tasks: 878
"Not started by deadline" Tasks: 95

The averages for Allowed and Actual Turnaround should be reflective of the fact that some tasks (normal AR and VLARs) have deadlines approaching 8 weeks or so, while high AR tasks often have deadlines of around 3 weeks. I made no attempt to identify maximum or minimum deadlines.

To me, the key ranges are those below 10% and those above 80%. What those numbers show is that over 96% of hosts return their tasks within 10% of the allowable turnaround time, while less than 0.03% of hosts exceed 80% of the time allotted.

Let the discussions begin. ;^)

Oh, and for anyone interested, here's the breakdown on those 23 tasks where the 80% threshold was exceeded:
HostID: 8024072 used 102.54% of allowable time
HostID: 7476862 used 86.08% of allowable time
HostID: 7019479 used 95.12% of allowable time
HostID: 7019479 used 95.12% of allowable time
HostID: 7019479 used 95.12% of allowable time
HostID: 7019479 used 95.12% of allowable time
HostID: 7019479 used 94.42% of allowable time
HostID: 7019479 used 94.42% of allowable time
HostID: 7019479 used 95.12% of allowable time
HostID: 7019479 used 95.12% of allowable time
HostID: 7019479 used 95.12% of allowable time
HostID: 7019479 used 95.12% of allowable time
HostID: 7474275 used 98.49% of allowable time
HostID: 8347683 used 82.26% of allowable time
HostID: 8028551 used 98.47% of allowable time
HostID: 8061474 used 90.27% of allowable time
HostID: 8360135 used 80.68% of allowable time
HostID: 8261239 used 80.42% of allowable time
HostID: 8301075 used 100.06% of allowable time
HostID: 5983829 used 104.23% of allowable time
HostID: 8324796 used 97.25% of allowable time
HostID: 8109586 used 88.48% of allowable time
HostID: 8119021 used 83.84% of allowable time

I took a quick look at that first one, 8024072. As expected, it was one of those where the host timed out, a new task was sent to another host (mine), and then several hours later the original host finally reported the task, followed about 5 hours later by my host's report. Fortunately, we all got credit for the WU. I think it might be informative to take a look at some or all of those other laggardly hosts to see if there's anything that stands out about them, whether it's that they seldom contact the server, carry a heavy load of other projects, or the like. I simply don't have any more time this evening to dig any deeper. Bedtime approacheth. ;^)
ID: 1905034 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1905043 - Posted: 6 Dec 2017, 6:18:35 UTC

This is a good first start at accumulating real data. Thanks for the analysis. Anyone else going to jump in with a similar analysis of their own tasks?
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1905043 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22158
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1905050 - Posted: 6 Dec 2017, 7:42:06 UTC

Looking at Jeff's times it would appear that there are far more folks using a large proportion of the available time than I expected. My "gut feeling" was that the majority of tasks would be returned inside 50% of the deadline, but seeing many in the >80% bracket suggests that the deadlines are very close to being correct.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1905050 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1905058 - Posted: 6 Dec 2017, 8:46:45 UTC - in response to Message 1905050.  

Looking at Jeff's times it would appear that there are far more folks using a large proportion of the available time than I expected. My "gut feeling" was that the majority of tasks would be returned inside 50% of the deadline, but seeing many in the >80% bracket suggests that the deadlines are very close to being correct.

The way I read it is that the vast majority are returned within only 5% of the allowed deadline.

Tasks reported w/i 0-5% of Allowed Turnaround: 78739

7,475 were reported later than that- that's only 8.7% of the total number reported.
So 91.3% were returned within 5% of the allowed turnaround time.
So for a 3 week deadline, they were being returned in just over 1 day (1.05), and for a 8 week deadline it's within 3 days (2.8).

So giving a deadline of 3 weeks for applications that return work within 3 weeks, and 3 weeks on top of the applications Average turnaround time for those that take longer would put a big dent in all those WUs that are out there for months at a time, but still give a safety net for system & Seti server issues.
Grant
Darwin NT
ID: 1905058 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22158
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1905060 - Posted: 6 Dec 2017, 8:55:18 UTC

Now I'm on a full size screen instead of squinting at a tiny phone one I see I missed half of Jeff's post - you're right Grant.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1905060 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1905096 - Posted: 6 Dec 2017, 14:41:02 UTC

Other sources of useful data available on our end may be on BONCstats
A complete list of Host stats and not just the first 10,000
The Host CPU breakdown to see what kinds of hardware the ~100,000 active hosts are running.

Another consideration to keep in mind regarding the tasks that timed out and were added the ready to send tasks are ghosts that were timing out.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1905096 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1905139 - Posted: 6 Dec 2017, 18:50:16 UTC

I took some time this morning to focus on those 23 tasks (out of 86727) which exceeded 80% of the allowable deadline. There are only 14 different hosts responsible for those 23 tasks, so those are the hosts (in this sample, anyway) which are being accommodated by the current deadline structure. I wanted to see if those hosts had any particular characteristics which might account for their slow turnaround times and which might warrant special consideration for such hosts. Surprisingly, despite the fact that previous discussion of this topic has often raised the issue of slow Android hosts, not one of these 14 fall into that category.

Here's the full list, with some highlighting applied in those cases where I think that there's a special issue in play.

HostID: 8024072
CPU type AuthenticAMD AMD FX(tm)-8350 Eight-Core Processor
Number of processors 8
Coprocessors AMD AMD Radeon R7 200 Series
Operating System Microsoft Windows 10 Professional x64 Edition
Average turnaround time 0.88 days
Last contact 6 Dec 2017
OTHER ACTIVE PROJECTS: None

HostID: 7476862
CPU type GenuineIntel Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
Number of processors 12
Coprocessors NVIDIA GeForce GTX 970
Operating System Microsoft Windows 10 Professional x64 Edition
Average turnaround time 4.57 days
Last contact 23 Oct 2017
OTHER ACTIVE PROJECTS: None

HostID: 7019479
CPU type GenuineIntel Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz
Number of processors 4
Coprocessors ---
Operating System Microsoft Windows 8 Professional x64 Edition
Average turnaround time 0.45 days
Last contact 1 Nov 2017
OTHER ACTIVE PROJECTS: None

HostID: 7474275
CPU type GenuineIntel Intel(R) Core(TM) i3-2125 CPU @ 3.30GHz
Number of processors 4
Coprocessors ---
Operating System Microsoft Windows 7 Professional x64 Edition
Average turnaround time 5 days
Last contact 30 Oct 2017
OTHER ACTIVE PROJECTS: MilkyWay@home (RAC=52)

HostID: 8347683
CPU type GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz [x86 Family 6 Model 58 Stepping 9]
Number of processors 8
Coprocessors NVIDIA GeForce GTX 680MX
Operating System Darwin 14.5.0
Average turnaround time 2.65 days
Last contact 29 Nov 2017
OTHER ACTIVE PROJECTS: None

HostID: 8028551
CPU type AuthenticAMD AMD FX-8320E Eight-Core Processor
Number of processors 8
Coprocessors NVIDIA GeForce GTX 750 Ti
Operating System Microsoft Windows 7 Professional x64 Edition
Average turnaround time 2.89 days
Last contact 6 Dec 2017
OTHER ACTIVE PROJECTS: None

HostID: 8061474
CPU type AuthenticAMD AMD FX(tm)-8120 Eight-Core Processor
Number of processors 8
Coprocessors [2] NVIDIA GeForce GTX 660
Operating System Microsoft Windows 10 Core x64 Edition
Average turnaround time 0.58 days
Last contact 6 Dec 2017
OTHER ACTIVE PROJECTS: Unknown

HostID: 8360135
CPU type GenuineIntel Intel(R) Core(TM) i5-3470S CPU @ 2.90GHz
Number of processors 4
Coprocessors NVIDIA GeForce GTX 660M
Operating System Microsoft Windows 7 Professional x64 Edition
Average turnaround time 14.74 days
Last contact 30 Nov 2017
OTHER ACTIVE PROJECTS: MilkyWay@home (RAC=6564)

HostID: 8261239
CPU type GenuineIntel Intel(R) Core(TM) i7-3610QM CPU @ 2.30GHz
Number of processors 8
Coprocessors AMD AMD Radeon HD 6570/6670/7570/7670 series (Turks)
Operating System Microsoft Windows 10 Professional x64 Edition
Average turnaround time 38.07 days
Last contact 2 Dec 2017
Tasks 107 (of which 16 have timed out in last 24 hours)
OTHER ACTIVE PROJECTS: None

HostID: 8301075
CPU type GenuineIntel Intel(R) Core(TM)2 Duo CPU T7800 @ 2.60GHz
Number of processors 2
Coprocessors NVIDIA Quadro FX 1600M
Operating System Microsoft Windows 10 Professional x86 Edition
Average turnaround time 4.32 days
Last contact 6 Dec 2017
OTHER ACTIVE PROJECTS: Unknown

HostID: 5983829
CPU type GenuineIntel Intel(R) Pentium(R) Dual CPU E2200 @ 2.20GHz
Number of processors 2
Coprocessors ---
Operating System Microsoft Windows 7 Professional x86 Edition
Average turnaround time 27.25 days
Last contact 29 Nov 2017
OTHER ACTIVE PROJECTS: Unknown

HostID: 8324796
CPU type AuthenticAMD AMD A10-9620P RADEON R5, 10 COMPUTE CORES 4C+6G
Number of processors 4
Coprocessors AMD AMD Radeon R5 Graphics
Operating System Microsoft Windows 10 Core x64 Edition
Average turnaround time 47.42 days
Last contact 4 Dec 2017
Tasks 76 (75 "In progress", w/ "Sent" dates as far back as 11 Oct)
OTHER ACTIVE PROJECTS: Collatz Conjecture (RAC=19,842), Moo! Wrapper (RAC=16,005), MilkyWay@home (RAC=6564), PrimeGrid (RAC=68)

HostID: 8109586
CPU type GenuineIntel Intel(R) Xeon(R) CPU E5420 @ 2.50GHz
Number of processors 8
Coprocessors ---
Operating System Microsoft Windows 10 Enterprise x64 Edition
Average turnaround time 6.73 days
Last contact 6 Dec 2017
Tasks 105 (92 "In progress", w/ "Sent" dates as far back as 1 Nov)
OTHER ACTIVE PROJECTS: Rosetta@home (RAC=2,326), World Community Grid (RAC=2)

HostID: 8119021
CPU type GenuineIntel
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz [Family 6 Model 60 Stepping 3]
Number of processors 8
Coprocessors NVIDIA Quadro K3100M; INTEL Intel(R) HD Graphics 4600
Average turnaround time 7.71 days
Last contact 6 Nov 2017
OTHER ACTIVE PROJECTS: None

Some of those hosts certainly seem to fall into the low-volume category, with only sporadic contact with the server. By the same token, a few of those still seem to be able to download a quantity of tasks well in excess of their ability to process them in a timely fashion. There also appear to be only a few hosts whose turnaround times seem likely to be impacted by their service to other projects. However, I suspect that, in those cases, what often happens is that S@h tasks get downloaded, then languish in the queue until BOINC suddenly realizes that their deadline is approaching and then runs them at high priority so that they get reported "just in time". Shorter deadlines would likely just speed up that process, I would think.

Anyway, hopefully this additional info will provide more food for thought.
ID: 1905139 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1905140 - Posted: 6 Dec 2017, 19:03:23 UTC

Another interesting tidbit of data. My surmise that Android devices are the slow pokes gets thrown out the window. Looks more likely the sporadic connection to the project or operating time. I see several laptops in there and had a suspicion that kind of host also would be a prime suspect. I agree that with hosts also attached to other projects would benefit from shorter deadlines and wouldn't be constantly running into the high-priority run mode or that they should not be allowed to download too many tasks to finish before deadline when resources are shared with other projects. The resource share algorithm likely needs some attention too along with with task deadline discussion.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1905140 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1905208 - Posted: 6 Dec 2017, 22:29:55 UTC

One more batch of manually-extracted data here. I wasn't about to sift through the entire group of 878 "Timed out - no response" tasks from my initial report (it's available, though, if somebody wants to tackle it), but I figured at least a sampling might tell us something about the impact of the current deadlines on actual time outs. So, here's some details that I thought looked relevant from the first 20 in the list:

HostID: 7140180, Last contact: 4 Aug 2017
HostID: 8315614, Last contact: 11 Nov 2017
HostID: 6078057, Last contact: 6 Dec 2017, Tasks: 5, Average turnaround time 17.38 days
HostID: 6760356, Last contact: 6 Dec 2017, Tasks: 44 (0 in progress, 19 timed out)
HostID: 8271040, Last contact: 11 Sep 2017
HostID: 8150652, Last contact: 9 Sep 2017
HostID: 8209392, Last contact: 26 Sep 2017
HostID: 5603279, Last contact: 16 Nov 2017, Tasks: 21 (19 in progress, 2 timed out)
HostID: 6084476, Last contact: 23 Oct 2017, Tasks: 167 (155 in progress, 6 timed out), Average turnaround time: 12.54 days
HostID: 8060232, Last contact: 6 Dec 2017, Tasks: 185 (incl. 7 sent 20 Nov or earlier, all CPU tasks)
HostID: 8340532, Last contact: 18 Nov 2017, Tasks: 27 (6 in progress, all sent 16-18 Nov)
HostID: 8223282, Last contact: 11 Sep 2017
HostID: 8056647, Last contact: 11 Sep 2017
HostID: 6122802, Last contact: 6 Dec 2017, Tasks: 6095 (5767 in progress, 328 timed out), Current RAC: 0.05
HostID: 8340613, Last contact: 11 Sep 2017
HostID: 7995392, Last contact: 25 Nov 2017, Tasks: 32, Average turnaround time: 43.33 days
HostID: 7433929, Last contact: 23 Sep 2017
HostID: 8324822, Last contact: 1 Oct 2017
HostID: 8340749, Last contact: 20 Oct 2017
HostID: 7770549, Last contact: 6 Dec 2017, Tasks: 2810 (2718 in progress, 9 timed out) (appears to be mostly ghosts from 1-9 Nov; shown as Anonymous, but previously identifiable as Userid: 3337286)

One thing that was surprising to me was the absence of any true "drive-bys" in the sample. Several of these appear to be long-time hosts that have apparently been shut down, while some of the newer hosts appear to have spent at least a few weeks with the project before drifting away. Regarding the "last contact" dates from September and October, I think that's consistent with the fact that my analysis was run on the tasks that I processed in October, so these first 20 time-outs would be lifted from tasks that I processed at the beginning of the month.
ID: 1905208 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1905241 - Posted: 7 Dec 2017, 0:52:26 UTC

HostID: 6122802, Last contact: 6 Dec 2017, Tasks: 6095 (5767 in progress, 328 timed out), Current RAC: 0.05

This 1 is using the stock Win10 driver and his AV suite is the likely cause of no work being done and the cache building.

Another simple solution to end a lot of timeouts by new comers would be for stock SETI settings to actually supply a full free CPU core for each GPU found and used. Last winter here I did a "Newbie Install" of the latest and not so greatest BOINC version and started SETI up as stock only to find that the C2D 6300 with GTX550Ti was so over committed on resources that it became impossible to use. Manually freeing up a core did wonders (even running CUDA tasks), but very few new people would know to do that so they just wind up dumping the program and leaving their tasks to timeout.

Well that's my 2c's worth. ;-)

Cheers.
ID: 1905241 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1905273 - Posted: 7 Dec 2017, 4:08:13 UTC - in response to Message 1905241.  

Yeah, that host looks like it used to be productive, but it's running stock, so probably just a "set and forget" user, and when something changes there's no obvious warning signals if they're not watching their results. I've suggested before that the project should at least have some sort of automated email notification system to alert users when their hosts go off the rails. It doesn't have to tell them how to fix it, but a brief "official" email letting them know that their machine is returning more Invalids and/or Errors than Valid results could simply direct them to the forums. Leaving it up to other users to once in a blue moon send off a PM (which I suspect many "set and forget" users never see), just doesn't cut it, and Anonymous users are completely unreachable that way. A "once a week" query against the database could probably identify all the current problem children quite readily.

The processing environment has evolved in many ways and, I suspect, a higher percentage of newbie "set and forget" installs don't necessarily work as smoothly as they did when the system was originally designed. That's another topic worthy of discussion certainly, particularly inasmuch as there always seems to be such a push to sign up new users to process the increasing volume of data coming from multiple sources. Clearly, we can't even keep up with a fraction of what Breakthrough Listen is generating from GBT, and now Parkes data might be flowing soon. They should step back for just a bit and look at what issues newbies might be facing in the current environment.
ID: 1905273 · Report as offensive
Gene Project Donor

Send message
Joined: 26 Apr 99
Posts: 150
Credit: 48,393,279
RAC: 118
United States
Message 1905297 - Posted: 7 Dec 2017, 6:49:16 UTC

I fully support a shorter deadline. Consulting my skimpy Seti notes, my last "classic" seti workunit was completed in November 2005 after a CPU elapsed time of 45 hours. And even as recently as 2008 I had an AstroPulse task that ran for 387 hours. In that (long ago) environment an 8-week deadline made sense. In the present day I see no reason for deadlines more than 30 days. Yes, some hosts are low-performance hosts and may only be intermittently connected but in that case only a few tasks need be downloaded so that any risk of time-outs is minimal and the impact on the Seti database is also minimal.
It is not solely a Seti@home solution - the Boinc system has a great deal to do with the management of a host's resources and, from my experience, does not do a good job. I have two "other" boinc projects with 1% resource allocation and if BoincMgr were left alone those projects would flood the work unit cache, over-commit the system resources, and result in Seti tasks failing to meet deadlines. R.H., Ageless, and others, are working toward a "new and improved" Boinc structure and one can hope that issues, such as slow machines receiving more work than they can possible do, will be addressed.
https://setiathome.berkeley.edu/forum_thread.php?id=81729&postid=1879241
Meanwhile, I vote for a 30-day Seti deadline.
ID: 1905297 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1905305 - Posted: 7 Dec 2017, 9:05:05 UTC

Until we go into realtime processing shorter deadlines == lose processing power from partially-involved hosts and nothing more.
Deadlines should be set as big and long as current server infrastructure allows.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1905305 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22158
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1905308 - Posted: 7 Dec 2017, 10:07:49 UTC

I think Raistmer has hit the nail on the head.
This debate started when the main db server was having issues with space, so perhaps we should look at the number of pending tasks vs. time to deadline (or time in progress). This will quantify the scale of the database issue. I would expect it to be some sort of Gussian curve, but does it have a "long thin" tail, or is a "long fat" tail?
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1905308 · Report as offensive
BetelgeuseFive Project Donor
Volunteer tester

Send message
Joined: 6 Jul 99
Posts: 158
Credit: 17,117,787
RAC: 19
Netherlands
Message 1905309 - Posted: 7 Dec 2017, 10:08:33 UTC - in response to Message 1905305.  

Until we go into realtime processing shorter deadlines == lose processing power from partially-involved hosts and nothing more.
Deadlines should be set as big and long as current server infrastructure allows.


If deadlines are reduced this may allow for more work to be sent to 'reliable' hosts so they don't run dry during the weekly outage. The 100 task limit is a pain for modern high-end GPUs and if that limit could be increased it may actually improve processing power. I think the gain from the high-end GPUs would be much higher than the loss from the partially involved slow hosts. Maybe there should be some kind of algorithm to determine how reliable a host is (and how fast it is in returning work) so there could be a more flexible limit instead of the fixed 100 task limit. This algorithm does not have to work on the fly, but there could also be some kind of (weekly) script that determines the task limit for a certain host (based on number of tasks processed, errors, invalids, average return time, ...).

Just my 2 cents ...

Tom
ID: 1905309 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1905313 - Posted: 7 Dec 2017, 11:33:39 UTC

One way to look at it (with I think better results, but much more complex) would be to have 2 server caches - one for <1.5d turnaround, one for >1.5d. And I guess rob from the other cache when either is empty.

Without having such a wide range of systems being served from the same cache, the subsets of returned tasks should decrease in size significantly. But that would be a scheduling nightmare with our 'picky' schedulers to begin with.

I think those slower computers would see a increase in credits, and a decrease for the fast ones - love that system, hey? I range wildly from .018 - 14.6 days averages on my computers., but 14 days is for 15 AP tasks on my AMD. So for me it would just be a matter of validating with equal speed computers. Which would be much like the slow hand-helds/atoms/etc, where the deadlines could be allowed to be long (or even extended), but the number of tasks low thus having little impact on the database. One could go father ans say tasks with >xyz flop_estimate (i.e. Shorties) go to slow cache first, and overflow to the fast one.

But it's all a mood point any ways, when I remember Eric saying - It's not going to change. That's that.

So I guess we can just make a v10 wish list.
ID: 1905313 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22158
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1905314 - Posted: 7 Dec 2017, 11:57:29 UTC

For such a system to work properly one would actually need to ensure that a representative sample of tasks was validated on both the "slow" and "fast" stream, whcih of course would mean more tasks hrtling around the system, inceased load on the servers. Indeed splitting users into two groups would result in a significant increas in the background user management processes - promoting & demoting users as their systems and situations change.
One thing that would be fairly light on the servers would be limits set by actual valid results returned. Doing so would "starve" the persistant error/invalid generators, while allowing the "good boys" a few more tasks. This is laready done to an extent, but the half-life on returning to normal is pretty short. This might also have a benifit in helping to clear out ghosts (but I haven't really thought about that side of things)
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1905314 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1905323 - Posted: 7 Dec 2017, 14:39:45 UTC - in response to Message 1905309.  


If deadlines are reduced this may allow for more work to be sent to 'reliable' hosts so they don't run dry during the weekly outage. The 100 task limit is a pain for modern high-end GPUs and if that limit could be increased it may actually improve processing power.

Tom


How so? You mix 100 task per device limit with deadline. From what you infer that shortening deadline will automatically increase 100-per-device limit?

If that limit would be increased, yes, performance would increase, but no need "to mix salt with hot". Shorter deadlines also could just increase rate of re-sends from broken hosts. Cause now they will refresh their locked tasks more often. And this could result in shortening "100" limit instead of increasing it.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1905323 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1905324 - Posted: 7 Dec 2017, 14:45:06 UTC - in response to Message 1905314.  
Last modified: 7 Dec 2017, 14:46:08 UTC

One thing that would be fairly light on the servers would be limits set by actual valid results returned. Doing so would "starve" the persistant error/invalid generators, while allowing the "good boys" a few more tasks. This is laready done to an extent, but the half-life on returning to normal is pretty short. This might also have a benifit in helping to clear out ghosts (but I haven't really thought about that side of things)


Improvement in quota management definitely needed. For now host with good + bad GPU of same vendor fully invisible for quota management.
Especially bad case when slow GPU is good and fast one is broken - small rate of good tasks enough to trash many tasks with good rate... giving almost nothing good in return.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1905324 · Report as offensive
BetelgeuseFive Project Donor
Volunteer tester

Send message
Joined: 6 Jul 99
Posts: 158
Credit: 17,117,787
RAC: 19
Netherlands
Message 1905332 - Posted: 7 Dec 2017, 15:37:10 UTC - in response to Message 1905323.  


If deadlines are reduced this may allow for more work to be sent to 'reliable' hosts so they don't run dry during the weekly outage. The 100 task limit is a pain for modern high-end GPUs and if that limit could be increased it may actually improve processing power.

Tom


How so? You mix 100 task per device limit with deadline. From what you infer that shortening deadline will automatically increase 100-per-device limit?

If that limit would be increased, yes, performance would increase, but no need "to mix salt with hot". Shorter deadlines also could just increase rate of re-sends from broken hosts. Cause now they will refresh their locked tasks more often. And this could result in shortening "100" limit instead of increasing it.


From what I read earlier I understand that the task limit is there because the total number of tasks out in the field is causing problems on the server. Reducing the deadline would reduce the number of tasks out in the field allowing for more tasks to be sent to reliable hosts. I think that some kind of flexible limit per device would be easier on the server than having multiple queues. And of course it would be possible to dramatically reduce the task limit for 'broken' hosts.
I don't see how shorter deadlines would increase the rate of re-sends from broken hosts. Maybe from very slow hosts, but from broken hosts they would need to be re-sent at some point anyway. There just should be a better mechanism for not sending many more tasks to these broken hosts.

Tom
ID: 1905332 · Report as offensive
1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Task Deadline Discussion


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.