The Server Issues / Outages Thread - Panic Mode On! (118)

5px solid LightGreen" > etiathome.berkeley.edu/view_profile.php?userid=190117"> Profile

Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Top 1% in average credit

Author	Message
Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2033737 - Posted: 24 Feb 2020, 9:43:47 UTC - in response to Message 2033719. For what it is worth of the projects I participate in, Seti@Home has the most relaxed "due" schedule. Many of my other projects allow a week or less per task. An experiment might be reduce the deadline to say 2 weeks and see if the load drops off because tasks are not sitting idle in the DB. If possible could we split the deadlines making the gpu tasks say a week or under? Tom . . Tom! Are you after my nick? Talk like that might get you excommunicated. ... Stephen :) ID: 2033737 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2033738 - Posted: 24 Feb 2020, 9:46:50 UTC - in response to Message 2033727. TomM proposes/suggests reducing the deadline to 2 weeks. I would vote for a less "ambitious" adjustment to the deadlines. I do observe that the AstroPulse tasks are issued with a 26-day deadline, as compared to the 60-day deadline for everything else. If the deadline were reduced to perhaps 40 or 50 days and allowed to remain there a couple of months (i.e. long enough to stabilize to some sort of equilibrium) that ought to give the project some hard data on the effects on database issues and resend statistics. Then decide whether it was a mistake - and revert to previous values; or, decide it was a positive move and, perhaps, continue adjusting deadlines in similar small steps. . . I would vote for 28 days myself. I remain convinced the project would be perfectly viable with an even shorter deadline but in the spirit of compromise 28 days seems way more than sufficient. Stephen . . ID: 2033738 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 2033739 - Posted: 24 Feb 2020, 9:49:42 UTC - in response to Message 2033734. Last modified: 24 Feb 2020, 9:55:48 UTC So what does this mean? A deadline reduction to 10 days would shove the resends through the roof, and turn a large number of currently useful-if-slow hosts into slow-but-useless hosts - they just wouldn't return their data in time and that data forms a very large proportion of the total .... I don't believe that to be the case. The reason there are so many systems that take so long to return work, is because of the exiting long deadlines. We are allowing it to occur. And even so, the Average turnaround time at present is only 34 hours! Even the slowest of the slow systems can return a WU within 2 days. Even allowing them to spend much of their time not actually processing work or working on another project, they can still return the longest to process WU within a week. But people do have issues- power, comms, system etc. So we set deadlines at 4 weeks. In that time the slowest of the slow that spends most of it's time powered off will still be able to return several WUs. And even if there are floods, fires, storms etc that make it impossible for systems to return the work within a week, people will still be able to return finished work before it times out by giving them that 28 day deadline. Grant Darwin NT ID: 2033739 ·

Stephen "Heretic" Volunteer tester Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628	Message 2033740 - Posted: 24 Feb 2020, 9:58:56 UTC - in response to Message 2033734. I would suggest that the sweet-spot may be deadlines around 40-50 days, where the impact on the slowest hosts is probably about as low as one can reasonably expect. . . Except that the majority of that 'delay' on the slow hosts is not due to their low productivity so much as their oversized caches. The reason they sit on tasks for 50 days is not because it takes them that long to process a task, but because WUs sit in their 'in progress' status for weeks on end before they get around to processing them. Shortening the deadline and if necessary reducing their work fetch limits would eliminate that unnecessary period of WUs sitting in purgatory. To avoid large numbers of time outs and system imposed work allocation limits they would have to actually administer their hosts more responsibly and reduce their caches to a size that matches their level of productivity. What a shame that would be ... Stephen :( ID: 2033740 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2033741 - Posted: 24 Feb 2020, 10:08:16 UTC - in response to Message 2033740. ... oversized caches ... Now there's a challenge! I'll have a look through some of my pendings later, and see how many of my wingmates fall into that category. ID: 2033741 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304	Message 2033743 - Posted: 24 Feb 2020, 10:18:51 UTC - in response to Message 2033740. . . Except that the majority of that 'delay' on the slow hosts is not due to their low productivity so much as their oversized caches. The reason they sit on tasks for 50 days is not because it takes them that long to process a task, but because WUs sit in their 'in progress' status for weeks on end before they get around to processing them. In theory, if a WU is processed it should be done within 20 days (10+10 for cache settings).* Any longer than that, and still returned by that host, would most likely be due to outside factors (System, power, comms etc issues), or a recently connected very slow host, possibly with more than one project with 10+10 cache settings still figuring things out. *Unless bunkering or other such user manipulation is at play. Grant Darwin NT ID: 2033743 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22204 Credit: 416,307,556 RAC: 380	Message 2033747 - Posted: 24 Feb 2020, 11:10:29 UTC Just take a look at the graph before making ANY assumption about "having no effect", "Because of long deadlines" - these two are totally and utterly WRONG. The truth is, and some do not accept this, is that SETI@Home has a POLICY of supporting a very wide range of computer performance, and human activity such as holidays and forgetting to stop a host, infrequent processing and so on. Twenty days would mean about 40% of the task sent out would have to be resent, and, as these are probably on hosts that only do a very small number of tasks per year that means alienating a very large proportion of the user base, which according to many reports is shrinking - do you want to decimate that base over night? Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 2033747 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2033748 - Posted: 24 Feb 2020, 11:15:58 UTC Last modified: 24 Feb 2020, 11:46:46 UTC Reduced server side limits would have no effect on those super slow hosts but would hurt fast gpus disproportionally. When the limit was 100 per GPU, my cache was limited to less than 2 hours and I have just a cheap mid range graphics card. There is something fishy in the graph. I have monitored the validating time of my tasks for a long time and 95% of all my tasks have been validated within 5 days of me originally downloading the task. 98% in 13 days. So a two week deadline would force at most 2% of the tasks to be resent and in practice a lot less because people would adjust their caches. I guess the graph shows a snapshot of tasks in validation queue. Such a snapshot would show disproportionally high percentage of long waiting tasks as they are the ones that get 'stuck' in the queue while the quickly validated ones don't wait in there to be seen. This is what the real validation time distribution looks like in graph form (x axis is days, y axis is percentage of tasks not validated yet): The sudden drop at 55 days is the result of the tasks expiring, getting resent and then getting validated fast in a scaled down version of similar curve. ID: 2033748 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22204 Credit: 416,307,556 RAC: 380	Message 2033750 - Posted: 24 Feb 2020, 11:34:13 UTC Last modified: 24 Feb 2020, 11:36:55 UTC Have you ever worked out how much it would "hurt" the super-fast hosts? It's quite simple to do: How many tasks per hour does ONE GPU get through. Now work out the legnth of time a 150 task cache lasts for. Now work out what percentage of a week is that? Now let's see the "hurt" for a 4 hour and an 8-hour period where no tasks are sent to that GPU, remember that the first x minutes of that time is covered by the GPU's cache. (I've not done the sums, but I think you will be amazed by how small the figure is - less than 10% (for the 8-hour period) is my first guess, but prove me wrong) Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 2033750 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2033753 - Posted: 24 Feb 2020, 12:00:50 UTC - in response to Message 2033750. Have you ever worked out how much it would "hurt" the super-fast hosts? (I've not done the sums, but I think you will be amazed by how small the figure is - less than 10% (for the 8-hour period) is my first guess, but prove me wrong) You would be amazed how high the figure is. Tuesday downtimes are not the only periods when the caches deplete. In the last couple of months we have had lot of periods of throttled work generation where my host spends several hours getting noting until it gets lucky and gets some work. And then again long time of nothing. 150 tasks last about 3 hours with my GPU but it's a slow one. High end GPUs will probably crunch those 150 tasks in less than an hour. ID: 2033753 ·

Lazydude Volunteer tester Send message Joined: 17 Jan 01 Posts: 45 Credit: 96,158,001 RAC: 136	Message 2033754 - Posted: 24 Feb 2020, 12:28:20 UTC Last modified: 24 Feb 2020, 12:33:27 UTC please remember 10+10 days are per PROJECT so my concluson deadlines 20 days + 10 days (for the gremlings) i just tested with 10+10 and got 12,5 days of work in one requst from Einstein. and 150+150 wus from seti 1000units from ( 3days) from Asteriods before i set nnt So in a couple of days i suspect i have a lot of task in high priorty mode ID: 2033754 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2033755 - Posted: 24 Feb 2020, 12:36:11 UTC - in response to Message 2033741. ID:

class="panel-body"> Now there's a challenge! Took a sample from my host 7118033 - single GTX 1050 Ti, cache 0.8 days, turnround 0.81 days, 82 tasks in progress, 120 pending. 100 of those pending were less than 4 weeks old. Here are the other 20. ">Workunit Deadline Wingmate Turnround Platform CPU 8011299 1.77 days Ubuntu i7 Block of ghosts on that day? Later work returned normally. 6834070 0.06 days Darwin i5 No contact since that allocation. Stopped crunching? 8882763 0.04 days Win 10 Ryzen 5 No contact since that allocation. Stopped crunching? 7862206) 17.02 days Darwin i7 Only contacts once a week. Nothing since 29 Jan 7862206) 8504851) n/a Win 7 Turion Never re-contacted 8504851) 8623725) 0.48 days Win 10 i7 No contact since that allocation. Stopped crunching? 8623725) 8879055 6.2 days Win 10 i5 Last contact 12 Jan. Stopped crunching? 8756342 1.21 days Android ? Active, but many gaps in record. 8871849 5.29 days Win 10 i5 Last contact 5 Jan. Stopped crunching? 8664947) 0.96 days Win 10 Ryzen Last contact 10 Feb. Stopped crunching? 8664947) 8664947) 8665965 2.66 days Win 7 i7 Last contact 6 Jan. Stopped crunching? 8842969 2.75 days Win 10 i7 Last contact 11 Jan. Stopped crunching? Timed out/resent. Should return today. 8873865 53.85 days Win 10 i5 Host still active, but not crunching. Hit his own bad wingmate! Timed out/resent. Should return Apart from one Android and one Turion, all of those are perfectly good crunchers - should have no problem with deadlines. No sign of an excessive cache amongst them. The biggest problem is people who sign up, then leave without cleaning up behind them. I'd say that supports a shorter (set of) deadlines - remember deadlines are variable. 2033755 ·
Message 2033756 - Posted: 24 Feb 2020, 12:38:05 UTC - in response to Message 2033404. And for extremely slow rarely on systems, 1 month is plenty of time for them to return a WU. It's actually plenty of time for them to return many WUs. While having deadlines as short as one week wouldn't affect such systems, it would affect those that are having problems- be it hardware, internet, power supply (fires, floods, storms etc). A 1 month deadline reduces the time it takes to clear a WU from the database, but still allows people time to recover from problems and not lose any of the work they have processed. +1 +42 A proud member of the OFA (Old Farts Association). ID: 2033756 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2033761 - Posted: 24 Feb 2020, 13:39:43 UTC - in response to Message 2033755. Last modified: 24 Feb 2020, 13:48:38 UTC The biggest problem is people who sign up, then leave without cleaning up behind them. I'd say that supports a shorter (set of) deadlines - remember deadlines are variable. Richard hit the target. Even the super slow host can crunch it`s WU in less than a month, The ones who not do that are the ones with problems. Ghosts, stop crunching, hardware failure, etc. I made a search (very painful due the slow response from the servers) on my Validation pending (6380) and find some very close to Richard findings, about 15% of it are WU received and crunched on the begging of January and still waiting for the wingmens. By the ones i was able to follow (takes a long time to show a single page) i could say with high confidence about 1/2 of them will not return before the deadline. I f we extrapolate that 7.5% to the DB size we are talking on a huge number. Then why not set the deadline to up to 1 month? That will give the project admins an extra time to think on a more permanent solution. BTW I still believe the only practical solution, with the available server hardware & software is to limit the WU cache to something like 1 day of the host actual returning valid tasks number, for all the hosts, fastest or slower. And even that will give only some extra time. The real permanent solution is a complete update of the project, better hardware could help obviously, but what is the bottleneck is the way the project works, still in the same way of >20 years ago. When we take more than a day to crunch a single WU and dial up connections are the only available. I trying to find one app from >20 years ago who is still running in the same way and i was unable to find, maybe someone knows and could share. ID: 2033761 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22204 Credit: 416,307,556 RAC: 380	Message 2033765 - Posted: 24 Feb 2020, 14:21:35 UTC I trying to find one app from >20 years ago who is still running in the same way and i was unable to find, maybe someone knows and could share. I assume you mean "host" not "app", as the 20 year old application was the Classic (pre-BOINC) and those results have been collated a long time ago - probably just after BOINC burst on the block in 2003. Grumpy Swede was, and possibly still is, using a Windows XP system, so that might be one of the oldest around. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 2033765 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2033767 - Posted: 24 Feb 2020, 14:51:32 UTC - in response to Message 2033765. I trying to find one app from >20 years ago who is still running in the same way and i was unable to find, maybe someone knows and could share. I assume you mean "host" not "app", as the 20 year old application was the Classic (pre-BOINC) and those results have been collated a long time ago - probably just after BOINC burst on the block in 2003. Grumpy Swede was, and possibly still is, using a Windows XP system, so that might be one of the oldest around. Yes my mistake, you know my English is bad. But you get the point, why insists to keep the >2 moths deadlines in this days? While what we urgent needs is to squeeze the DB size. ID: 2033767 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 2033768 - Posted: 24 Feb 2020, 15:02:21 UTC We were reminded recently of Estimates and Deadlines revisited from 2008 (just before GPUs were introduced). That link drops you in on the final outcome, but here's a summary of Joe's table of deadlines: Angle Deadline (days from issue) 0.001 23.25 0.05 23.25 (VLAR) 0.0501 27.16 0.22548 16.85 0.22549 32.23 0.295 27.76 0.385 24.38 0.41 23.70 (common from Arecibo) 1.12744 7.00 (VHAR) 10 7.00 Since then, we've had two big increases in crunching time, due to increases in search sensitivity, and each has been accompanied by an extension of deadlines. So the table now looks something like: Angle Deadline 0.05 52.75 (VLAR) 0.425 53.39 (nearest from Arecibo) 1.12744 20.46 (VHAR) So, deadlines overall have more than doubled since 2008, without any allowance for the faster average computer available now. I think we could safely halve the current figures, as the simplest adjustment. ID: 2033768 ·

juan BFP Volunteer tester Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799	Message 2033770 - Posted: 24 Feb 2020, 15:11:01 UTC - in response to Message 2033768. Last modified: 24 Feb 2020, 15:31:24 UTC So, deadlines overall have more than doubled since 2008, without any allowance for the faster average computer available now. I think we could safely halve the current figures, as the simplest adjustment. Another reason to support the idea. Something must be done. Anyway, any changes on the deadlines will take weeks to make effect. Plenty of time to make any fine adjust if necessary. ID: 2033770 ·

Ville Saari Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530	Message 2033775 - Posted: 24 Feb 2020, 16:07:29 UTC Last modified: 24 Feb 2020, 16:23:09 UTC Increasing deadlines when workunits become slower to crunch doesn't really make much sense. No one needs long deadline because his hosts needs so long to crunch a single task. The need for long deadlines arises from things that have nothing to do with single task duration. AstroPulse tasks have 25 day deadline despite being many times slower to crunch than MultiBeam. Why not drop the deadline of all tasks to this same 25 days? My validation time statistics say that 61% of the tasks that take longer than 25 days will eventually expire. And the remaining 39% is only 0.45% of all tasks. So currently less than one task in 200 would be returned in the time window that would be cut out by deadline reduction to 25 days. And I believe most of that 0.45% won't really hit the new deadline because users who currently run their computers in a way that makes tasks take that long will adapt to the new deadline and will either reduce their cache sizes or increase the time they keep their computers powered on and crunching setiathome. ID: 2033775 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22204 Credit: 416,307,556 RAC: 380	Message 2033781 - Posted: 24 Feb 2020, 17:53:22 UTC How wrong can you be: Let's take an extreme example: A task takes 1 hour to run. The deadline is 1.5 hours Now the task complexity is changed and it takes 2 hours to run, but the deadline remains at 1.5 hours. Therefore all tasks fail to complete within their deadline. As I said that is a deliberately extreme example. You do however raise a reasonable question - why indeed to AstroPulse tasks have a deadline of 25 day, when the much faster to computer (but much more common) MultiBeam tasks have a deadline of over 50 days? I suspect the logic behind that is lost in the mists of time (or Richard will pop up with the answer). Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 2033781 ·