Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation
Previous · 1 . . . 90 · 91 · 92 · 93 · 94 · Next
Author | Message |
---|---|
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
For what it is worth of the projects I participate in, Seti@Home has the most relaxed "due" schedule. Many of my other projects allow a week or less per task. . . Tom! Are you after my nick? Talk like that might get you excommunicated. ... Stephen :) |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
TomM proposes/suggests reducing the deadline to 2 weeks. I would vote for a less "ambitious" adjustment to the deadlines. I do observe that the AstroPulse tasks are issued with a 26-day deadline, as compared to the 60-day deadline for everything else. If the deadline were reduced to perhaps 40 or 50 days and allowed to remain there a couple of months (i.e. long enough to stabilize to some sort of equilibrium) that ought to give the project some hard data on the effects on database issues and resend statistics. Then decide whether it was a mistake - and revert to previous values; or, decide it was a positive move and, perhaps, continue adjusting deadlines in similar small steps. . . I would vote for 28 days myself. I remain convinced the project would be perfectly viable with an even shorter deadline but in the spirit of compromise 28 days seems way more than sufficient. Stephen . . |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
So what does this mean?I don't believe that to be the case. The reason there are so many systems that take so long to return work, is because of the exiting long deadlines. We are allowing it to occur. And even so, the Average turnaround time at present is only 34 hours! Even the slowest of the slow systems can return a WU within 2 days. Even allowing them to spend much of their time not actually processing work or working on another project, they can still return the longest to process WU within a week. But people do have issues- power, comms, system etc. So we set deadlines at 4 weeks. In that time the slowest of the slow that spends most of it's time powered off will still be able to return several WUs. And even if there are floods, fires, storms etc that make it impossible for systems to return the work within a week, people will still be able to return finished work before it times out by giving them that 28 day deadline. Grant Darwin NT |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
I would suggest that the sweet-spot may be deadlines around 40-50 days, where the impact on the slowest hosts is probably about as low as one can reasonably expect. . . Except that the majority of that 'delay' on the slow hosts is not due to their low productivity so much as their oversized caches. The reason they sit on tasks for 50 days is not because it takes them that long to process a task, but because WUs sit in their 'in progress' status for weeks on end before they get around to processing them. Shortening the deadline and if necessary reducing their work fetch limits would eliminate that unnecessary period of WUs sitting in purgatory. To avoid large numbers of time outs and system imposed work allocation limits they would have to actually administer their hosts more responsibly and reduce their caches to a size that matches their level of productivity. What a shame that would be ... Stephen :( |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
... oversized caches ...Now there's a challenge! I'll have a look through some of my pendings later, and see how many of my wingmates fall into that category. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
. . Except that the majority of that 'delay' on the slow hosts is not due to their low productivity so much as their oversized caches. The reason they sit on tasks for 50 days is not because it takes them that long to process a task, but because WUs sit in their 'in progress' status for weeks on end before they get around to processing them.In theory, if a WU is processed it should be done within 20 days (10+10 for cache settings).* Any longer than that, and still returned by that host, would most likely be due to outside factors (System, power, comms etc issues), or a recently connected very slow host, possibly with more than one project with 10+10 cache settings still figuring things out. *Unless bunkering or other such user manipulation is at play. Grant Darwin NT |
rob smith Send message Joined: 7 Mar 03 Posts: 22203 Credit: 416,307,556 RAC: 380 |
Just take a look at the graph before making ANY assumption about "having no effect", "Because of long deadlines" - these two are totally and utterly WRONG. The truth is, and some do not accept this, is that SETI@Home has a POLICY of supporting a very wide range of computer performance, and human activity such as holidays and forgetting to stop a host, infrequent processing and so on. Twenty days would mean about 40% of the task sent out would have to be resent, and, as these are probably on hosts that only do a very small number of tasks per year that means alienating a very large proportion of the user base, which according to many reports is shrinking - do you want to decimate that base over night? Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Reduced server side limits would have no effect on those super slow hosts but would hurt fast gpus disproportionally. When the limit was 100 per GPU, my cache was limited to less than 2 hours and I have just a cheap mid range graphics card. There is something fishy in the graph. I have monitored the validating time of my tasks for a long time and 95% of all my tasks have been validated within 5 days of me originally downloading the task. 98% in 13 days. So a two week deadline would force at most 2% of the tasks to be resent and in practice a lot less because people would adjust their caches. I guess the graph shows a snapshot of tasks in validation queue. Such a snapshot would show disproportionally high percentage of long waiting tasks as they are the ones that get 'stuck' in the queue while the quickly validated ones don't wait in there to be seen. This is what the real validation time distribution looks like in graph form (x axis is days, y axis is percentage of tasks not validated yet): The sudden drop at 55 days is the result of the tasks expiring, getting resent and then getting validated fast in a scaled down version of similar curve. |
rob smith Send message Joined: 7 Mar 03 Posts: 22203 Credit: 416,307,556 RAC: 380 |
Have you ever worked out how much it would "hurt" the super-fast hosts? It's quite simple to do: How many tasks per hour does ONE GPU get through. Now work out the legnth of time a 150 task cache lasts for. Now work out what percentage of a week is that? Now let's see the "hurt" for a 4 hour and an 8-hour period where no tasks are sent to that GPU, remember that the first x minutes of that time is covered by the GPU's cache. (I've not done the sums, but I think you will be amazed by how small the figure is - less than 10% (for the 8-hour period) is my first guess, but prove me wrong) Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Have you ever worked out how much it would "hurt" the super-fast hosts?You would be amazed how high the figure is. Tuesday downtimes are not the only periods when the caches deplete. In the last couple of months we have had lot of periods of throttled work generation where my host spends several hours getting noting until it gets lucky and gets some work. And then again long time of nothing. 150 tasks last about 3 hours with my GPU but it's a slow one. High end GPUs will probably crunch those 150 tasks in less than an hour. |
Lazydude Send message Joined: 17 Jan 01 Posts: 45 Credit: 96,158,001 RAC: 136 |
please remember 10+10 days are per PROJECT so my concluson deadlines 20 days + 10 days (for the gremlings) i just tested with 10+10 and got 12,5 days of work in one requst from Einstein. and 150+150 wus from seti 1000units from ( 3days) from Asteriods before i set nnt So in a couple of days i suspect i have a lot of task in high priorty mode |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Now there's a challenge!Took a sample from my host 7118033 - single GTX 1050 Ti, cache 0.8 days, turnround 0.81 days, 82 tasks in progress, 120 pending. 100 of those pending were less than 4 weeks old. Here are the other 20. Workunit Deadline Wingmate Turnround Platform CPU 3843402125 10-Mar-20 8011299 1.77 days Ubuntu i7 Block of ghosts on that day? Later work returned normally. 3838811280 06-Mar-20 6834070 0.06 days Darwin i5 No contact since that allocation. Stopped crunching? 3835694801 04-Mar-20 8882763 0.04 days Win 10 Ryzen 5 No contact since that allocation. Stopped crunching? 3833579833 06-Mar-20 7862206) 17.02 days Darwin i7 Only contacts once a week. Nothing since 29 Jan 3833579839 03-Mar-20 7862206) 3831370022 02-Mar-20 8504851) n/a Win 7 Turion Never re-contacted 3831369958 02-Mar-20 8504851) 3830290903 02-Mar-20 8623725) 0.48 days Win 10 i7 No contact since that allocation. Stopped crunching? 3830290941 27-Feb-20 8623725) 3827620430 29-Feb-20 8879055 6.2 days Win 10 i5 Last contact 12 Jan. Stopped crunching? 3826924227 25-Mar-20 8756342 1.21 days Android ? Active, but many gaps in record. 3821828603 02-Mar-20 8871849 5.29 days Win 10 i5 Last contact 5 Jan. Stopped crunching? 3821313504 26-Feb-20 8664947) 0.96 days Win 10 Ryzen Last contact 10 Feb. Stopped crunching? 3821313516 26-Feb-20 8664947) 3821313522 26-Feb-20 8664947) 3820902138 25-Feb-20 8665965 2.66 days Win 7 i7 Last contact 6 Jan. Stopped crunching? 3819012955 15-Mar-20 8842969 2.75 days Win 10 i7 Last contact 11 Jan. Stopped crunching? 3816054138 Timed out/resent. Should return today. 3808676716 14-Mar-20 8873865 53.85 days Win 10 i5 Host still active, but not crunching. Hit his own bad wingmate! 3783208510 Timed out/resent. Should returnApart from one Android and one Turion, all of those are perfectly good crunchers - should have no problem with deadlines. No sign of an excessive cache amongst them. The biggest problem is people who sign up, then leave without cleaning up behind them. I'd say that supports a shorter (set of) deadlines - remember deadlines are variable. |
Tom M Send message Joined: 28 Nov 02 Posts: 5124 Credit: 276,046,078 RAC: 462 |
And for extremely slow rarely on systems, 1 month is plenty of time for them to return a WU. It's actually plenty of time for them to return many WUs. +42 A proud member of the OFA (Old Farts Association). |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
The biggest problem is people who sign up, then leave without cleaning up behind them. I'd say that supports a shorter (set of) deadlines - remember deadlines are variable. Richard hit the target. Even the super slow host can crunch it`s WU in less than a month, The ones who not do that are the ones with problems. Ghosts, stop crunching, hardware failure, etc. I made a search (very painful due the slow response from the servers) on my Validation pending (6380) and find some very close to Richard findings, about 15% of it are WU received and crunched on the begging of January and still waiting for the wingmens. By the ones i was able to follow (takes a long time to show a single page) i could say with high confidence about 1/2 of them will not return before the deadline. I f we extrapolate that 7.5% to the DB size we are talking on a huge number. Then why not set the deadline to up to 1 month? That will give the project admins an extra time to think on a more permanent solution. BTW I still believe the only practical solution, with the available server hardware & software is to limit the WU cache to something like 1 day of the host actual returning valid tasks number, for all the hosts, fastest or slower. And even that will give only some extra time. The real permanent solution is a complete update of the project, better hardware could help obviously, but what is the bottleneck is the way the project works, still in the same way of >20 years ago. When we take more than a day to crunch a single WU and dial up connections are the only available. I trying to find one app from >20 years ago who is still running in the same way and i was unable to find, maybe someone knows and could share. |
rob smith Send message Joined: 7 Mar 03 Posts: 22203 Credit: 416,307,556 RAC: 380 |
I trying to find one app from >20 years ago who is still running in the same way and i was unable to find, maybe someone knows and could share. I assume you mean "host" not "app", as the 20 year old application was the Classic (pre-BOINC) and those results have been collated a long time ago - probably just after BOINC burst on the block in 2003. Grumpy Swede was, and possibly still is, using a Windows XP system, so that might be one of the oldest around. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
I trying to find one app from >20 years ago who is still running in the same way and i was unable to find, maybe someone knows and could share. Yes my mistake, you know my English is bad. But you get the point, why insists to keep the >2 moths deadlines in this days? While what we urgent needs is to squeeze the DB size. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
We were reminded recently of Estimates and Deadlines revisited from 2008 (just before GPUs were introduced). That link drops you in on the final outcome, but here's a summary of Joe's table of deadlines: Angle Deadline (days from issue) 0.001 23.25 0.05 23.25 (VLAR) 0.0501 27.16 0.22548 16.85 0.22549 32.23 0.295 27.76 0.385 24.38 0.41 23.70 (common from Arecibo) 1.12744 7.00 (VHAR) 10 7.00Since then, we've had two big increases in crunching time, due to increases in search sensitivity, and each has been accompanied by an extension of deadlines. So the table now looks something like: Angle Deadline 0.05 52.75 (VLAR) 0.425 53.39 (nearest from Arecibo) 1.12744 20.46 (VHAR)So, deadlines overall have more than doubled since 2008, without any allowance for the faster average computer available now. I think we could safely halve the current figures, as the simplest adjustment. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
So, deadlines overall have more than doubled since 2008, without any allowance for the faster average computer available now. I think we could safely halve the current figures, as the simplest adjustment. Another reason to support the idea. Something must be done. Anyway, any changes on the deadlines will take weeks to make effect. Plenty of time to make any fine adjust if necessary. |
Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 |
Increasing deadlines when workunits become slower to crunch doesn't really make much sense. No one needs long deadline because his hosts needs so long to crunch a single task. The need for long deadlines arises from things that have nothing to do with single task duration. AstroPulse tasks have 25 day deadline despite being many times slower to crunch than MultiBeam. Why not drop the deadline of all tasks to this same 25 days? My validation time statistics say that 61% of the tasks that take longer than 25 days will eventually expire. And the remaining 39% is only 0.45% of all tasks. So currently less than one task in 200 would be returned in the time window that would be cut out by deadline reduction to 25 days. And I believe most of that 0.45% won't really hit the new deadline because users who currently run their computers in a way that makes tasks take that long will adapt to the new deadline and will either reduce their cache sizes or increase the time they keep their computers powered on and crunching setiathome. |
rob smith Send message Joined: 7 Mar 03 Posts: 22203 Credit: 416,307,556 RAC: 380 |
How wrong can you be: Let's take an extreme example: A task takes 1 hour to run. The deadline is 1.5 hours Now the task complexity is changed and it takes 2 hours to run, but the deadline remains at 1.5 hours. Therefore all tasks fail to complete within their deadline. As I said that is a deliberately extreme example. You do however raise a reasonable question - why indeed to AstroPulse tasks have a deadline of 25 day, when the much faster to computer (but much more common) MultiBeam tasks have a deadline of over 50 days? I suspect the logic behind that is lost in the mists of time (or Richard will pop up with the answer). Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.