The Server Issues / Outages Thread - Panic Mode On! (118)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 90 · 91 · 92 · 93 · 94 · Next

AuthorMessage
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033737 - Posted: 24 Feb 2020, 9:43:47 UTC - in response to Message 2033719.  

For what it is worth of the projects I participate in, Seti@Home has the most relaxed "due" schedule. Many of my other projects allow a week or less per task.
An experiment might be reduce the deadline to say 2 weeks and see if the load drops off because tasks are not sitting idle in the DB.
If possible could we split the deadlines making the gpu tasks say a week or under?
Tom


. . Tom! Are you after my nick? Talk like that might get you excommunicated. ...

Stephen

:)
ID: 2033737 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033738 - Posted: 24 Feb 2020, 9:46:50 UTC - in response to Message 2033727.  

TomM proposes/suggests reducing the deadline to 2 weeks. I would vote for a less "ambitious" adjustment to the deadlines. I do observe that the AstroPulse tasks are issued with a 26-day deadline, as compared to the 60-day deadline for everything else. If the deadline were reduced to perhaps 40 or 50 days and allowed to remain there a couple of months (i.e. long enough to stabilize to some sort of equilibrium) that ought to give the project some hard data on the effects on database issues and resend statistics. Then decide whether it was a mistake - and revert to previous values; or, decide it was a positive move and, perhaps, continue adjusting deadlines in similar small steps.


. . I would vote for 28 days myself. I remain convinced the project would be perfectly viable with an even shorter deadline but in the spirit of compromise 28 days seems way more than sufficient.

Stephen

. .
ID: 2033738 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 2033739 - Posted: 24 Feb 2020, 9:49:42 UTC - in response to Message 2033734.  
Last modified: 24 Feb 2020, 9:55:48 UTC

So what does this mean?
A deadline reduction to 10 days would shove the resends through the roof, and turn a large number of currently useful-if-slow hosts into slow-but-useless hosts - they just wouldn't return their data in time and that data forms a very large proportion of the total
....
I don't believe that to be the case.
The reason there are so many systems that take so long to return work, is because of the exiting long deadlines. We are allowing it to occur. And even so, the Average turnaround time at present is only 34 hours!

Even the slowest of the slow systems can return a WU within 2 days. Even allowing them to spend much of their time not actually processing work or working on another project, they can still return the longest to process WU within a week. But people do have issues- power, comms, system etc.
So we set deadlines at 4 weeks. In that time the slowest of the slow that spends most of it's time powered off will still be able to return several WUs. And even if there are floods, fires, storms etc that make it impossible for systems to return the work within a week, people will still be able to return finished work before it times out by giving them that 28 day deadline.
Grant
Darwin NT
ID: 2033739 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2033740 - Posted: 24 Feb 2020, 9:58:56 UTC - in response to Message 2033734.  

I would suggest that the sweet-spot may be deadlines around 40-50 days, where the impact on the slowest hosts is probably about as low as one can reasonably expect.


. . Except that the majority of that 'delay' on the slow hosts is not due to their low productivity so much as their oversized caches. The reason they sit on tasks for 50 days is not because it takes them that long to process a task, but because WUs sit in their 'in progress' status for weeks on end before they get around to processing them. Shortening the deadline and if necessary reducing their work fetch limits would eliminate that unnecessary period of WUs sitting in purgatory. To avoid large numbers of time outs and system imposed work allocation limits they would have to actually administer their hosts more responsibly and reduce their caches to a size that matches their level of productivity. What a shame that would be ...

Stephen

:(
ID: 2033740 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2033741 - Posted: 24 Feb 2020, 10:08:16 UTC - in response to Message 2033740.  

... oversized caches ...
Now there's a challenge! I'll have a look through some of my pendings later, and see how many of my wingmates fall into that category.
ID: 2033741 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 2033743 - Posted: 24 Feb 2020, 10:18:51 UTC - in response to Message 2033740.  

. . Except that the majority of that 'delay' on the slow hosts is not due to their low productivity so much as their oversized caches. The reason they sit on tasks for 50 days is not because it takes them that long to process a task, but because WUs sit in their 'in progress' status for weeks on end before they get around to processing them.
In theory, if a WU is processed it should be done within 20 days (10+10 for cache settings).*
Any longer than that, and still returned by that host, would most likely be due to outside factors (System, power, comms etc issues), or a recently connected very slow host, possibly with more than one project with 10+10 cache settings still figuring things out.



*Unless bunkering or other such user manipulation is at play.
Grant
Darwin NT
ID: 2033743 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22204
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2033747 - Posted: 24 Feb 2020, 11:10:29 UTC

Just take a look at the graph before making ANY assumption about "having no effect", "Because of long deadlines" - these two are totally and utterly WRONG.

The truth is, and some do not accept this, is that SETI@Home has a POLICY of supporting a very wide range of computer performance, and human activity such as holidays and forgetting to stop a host, infrequent processing and so on. Twenty days would mean about 40% of the task sent out would have to be resent, and, as these are probably on hosts that only do a very small number of tasks per year that means alienating a very large proportion of the user base, which according to many reports is shrinking - do you want to decimate that base over night?
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2033747 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033748 - Posted: 24 Feb 2020, 11:15:58 UTC
Last modified: 24 Feb 2020, 11:46:46 UTC

Reduced server side limits would have no effect on those super slow hosts but would hurt fast gpus disproportionally. When the limit was 100 per GPU, my cache was limited to less than 2 hours and I have just a cheap mid range graphics card.

There is something fishy in the graph. I have monitored the validating time of my tasks for a long time and 95% of all my tasks have been validated within 5 days of me originally downloading the task. 98% in 13 days. So a two week deadline would force at most 2% of the tasks to be resent and in practice a lot less because people would adjust their caches.

I guess the graph shows a snapshot of tasks in validation queue. Such a snapshot would show disproportionally high percentage of long waiting tasks as they are the ones that get 'stuck' in the queue while the quickly validated ones don't wait in there to be seen.

This is what the real validation time distribution looks like in graph form (x axis is days, y axis is percentage of tasks not validated yet):


The sudden drop at 55 days is the result of the tasks expiring, getting resent and then getting validated fast in a scaled down version of similar curve.
ID: 2033748 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22204
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2033750 - Posted: 24 Feb 2020, 11:34:13 UTC
Last modified: 24 Feb 2020, 11:36:55 UTC

Have you ever worked out how much it would "hurt" the super-fast hosts?

It's quite simple to do:
How many tasks per hour does ONE GPU get through.
Now work out the legnth of time a 150 task cache lasts for.
Now work out what percentage of a week is that?
Now let's see the "hurt" for a 4 hour and an 8-hour period where no tasks are sent to that GPU, remember that the first x minutes of that time is covered by the GPU's cache.

(I've not done the sums, but I think you will be amazed by how small the figure is - less than 10% (for the 8-hour period) is my first guess, but prove me wrong)
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2033750 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033753 - Posted: 24 Feb 2020, 12:00:50 UTC - in response to Message 2033750.  

Have you ever worked out how much it would "hurt" the super-fast hosts?
(I've not done the sums, but I think you will be amazed by how small the figure is - less than 10% (for the 8-hour period) is my first guess, but prove me wrong)
You would be amazed how high the figure is. Tuesday downtimes are not the only periods when the caches deplete. In the last couple of months we have had lot of periods of throttled work generation where my host spends several hours getting noting until it gets lucky and gets some work. And then again long time of nothing. 150 tasks last about 3 hours with my GPU but it's a slow one. High end GPUs will probably crunch those 150 tasks in less than an hour.
ID: 2033753 · Report as offensive
Lazydude
Volunteer tester

Send message
Joined: 17 Jan 01
Posts: 45
Credit: 96,158,001
RAC: 136
Sweden
Message 2033754 - Posted: 24 Feb 2020, 12:28:20 UTC
Last modified: 24 Feb 2020, 12:33:27 UTC

please remember
10+10 days are per PROJECT
so my concluson deadlines 20 days + 10 days (for the gremlings)

i just tested with 10+10 and got 12,5 days of work in one requst from Einstein.
and 150+150 wus from seti
1000units from ( 3days) from Asteriods before i set nnt
So in a couple of days i suspect i have a lot of task in high priorty mode
ID: 2033754 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2033755 - Posted: 24 Feb 2020, 12:36:11 UTC - in response to Message 2033741.  

Now there's a challenge!
Took a sample from my host 7118033 - single GTX 1050 Ti, cache 0.8 days, turnround 0.81 days, 82 tasks in progress, 120 pending. 100 of those pending were less than 4 weeks old. Here are the other 20.

Workunit      Deadline     Wingmate   Turnround    Platform  CPU    
3843402125    10-Mar-20    8011299    1.77 days    Ubuntu    i7       Block of ghosts on that day? Later work returned normally.
3838811280    06-Mar-20    6834070    0.06 days    Darwin    i5       No contact since that allocation. Stopped crunching?
3835694801    04-Mar-20    8882763    0.04 days    Win 10    Ryzen 5  No contact since that allocation. Stopped crunching?
3833579833    06-Mar-20    7862206)  17.02 days    Darwin    i7       Only contacts once a week. Nothing since 29 Jan
3833579839    03-Mar-20    7862206)            	
3831370022    02-Mar-20    8504851)         n/a    Win 7     Turion   Never re-contacted
3831369958    02-Mar-20    8504851)            	
3830290903    02-Mar-20    8623725)   0.48 days    Win 10    i7       No contact since that allocation. Stopped crunching?
3830290941    27-Feb-20    8623725)            	
3827620430    29-Feb-20    8879055    6.2  days    Win 10    i5       Last contact 12 Jan. Stopped crunching?
3826924227    25-Mar-20    8756342    1.21 days    Android    ?       Active, but many gaps in record.
3821828603    02-Mar-20    8871849    5.29 days    Win 10    i5       Last contact 5 Jan. Stopped crunching?
3821313504    26-Feb-20    8664947)   0.96 days    Win 10    Ryzen    Last contact 10 Feb. Stopped crunching?
3821313516    26-Feb-20    8664947)            	
3821313522    26-Feb-20    8664947)            	
3820902138    25-Feb-20    8665965    2.66 days    Win 7     i7       Last contact 6 Jan. Stopped crunching?
3819012955    15-Mar-20    8842969    2.75 days    Win 10    i7       Last contact 11 Jan. Stopped crunching?
3816054138                                                            Timed out/resent. Should return today.
3808676716    14-Mar-20    8873865    53.85 days   Win 10    i5       Host still active, but not crunching. Hit his own bad wingmate!
3783208510                                                            Timed out/resent. Should return 
Apart from one Android and one Turion, all of those are perfectly good crunchers - should have no problem with deadlines. No sign of an excessive cache amongst them. The biggest problem is people who sign up, then leave without cleaning up behind them. I'd say that supports a shorter (set of) deadlines - remember deadlines are variable.
ID: 2033755 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 2033756 - Posted: 24 Feb 2020, 12:38:05 UTC - in response to Message 2033404.  

And for extremely slow rarely on systems, 1 month is plenty of time for them to return a WU. It's actually plenty of time for them to return many WUs.
While having deadlines as short as one week wouldn't affect such systems, it would affect those that are having problems- be it hardware, internet, power supply (fires, floods, storms etc). A 1 month deadline reduces the time it takes to clear a WU from the database, but still allows people time to recover from problems and not lose any of the work they have processed.


+1


+42
A proud member of the OFA (Old Farts Association).
ID: 2033756 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2033761 - Posted: 24 Feb 2020, 13:39:43 UTC - in response to Message 2033755.  
Last modified: 24 Feb 2020, 13:48:38 UTC

The biggest problem is people who sign up, then leave without cleaning up behind them. I'd say that supports a shorter (set of) deadlines - remember deadlines are variable.

Richard hit the target. Even the super slow host can crunch it`s WU in less than a month, The ones who not do that are the ones with problems. Ghosts, stop crunching, hardware failure, etc.

I made a search (very painful due the slow response from the servers) on my Validation pending (6380) and find some very close to Richard findings, about 15% of it are WU received and crunched on the begging of January and still waiting for the wingmens. By the ones i was able to follow (takes a long time to show a single page) i could say with high confidence about 1/2 of them will not return before the deadline.

I f we extrapolate that 7.5% to the DB size we are talking on a huge number.

Then why not set the deadline to up to 1 month?

That will give the project admins an extra time to think on a more permanent solution.

BTW I still believe the only practical solution, with the available server hardware & software is to limit the WU cache to something like 1 day of the host actual returning valid tasks number, for all the hosts, fastest or slower. And even that will give only some extra time.

The real permanent solution is a complete update of the project, better hardware could help obviously, but what is the bottleneck is the way the project works, still in the same way of >20 years ago. When we take more than a day to crunch a single WU and dial up connections are the only available.

I trying to find one app from >20 years ago who is still running in the same way and i was unable to find, maybe someone knows and could share.
ID: 2033761 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22204
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2033765 - Posted: 24 Feb 2020, 14:21:35 UTC

I trying to find one app from >20 years ago who is still running in the same way and i was unable to find, maybe someone knows and could share.


I assume you mean "host" not "app", as the 20 year old application was the Classic (pre-BOINC) and those results have been collated a long time ago - probably just after BOINC burst on the block in 2003.

Grumpy Swede was, and possibly still is, using a Windows XP system, so that might be one of the oldest around.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2033765 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2033767 - Posted: 24 Feb 2020, 14:51:32 UTC - in response to Message 2033765.  

I trying to find one app from >20 years ago who is still running in the same way and i was unable to find, maybe someone knows and could share.


I assume you mean "host" not "app", as the 20 year old application was the Classic (pre-BOINC) and those results have been collated a long time ago - probably just after BOINC burst on the block in 2003.

Grumpy Swede was, and possibly still is, using a Windows XP system, so that might be one of the oldest around.

Yes my mistake, you know my English is bad.

But you get the point, why insists to keep the >2 moths deadlines in this days? While what we urgent needs is to squeeze the DB size.
ID: 2033767 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2033768 - Posted: 24 Feb 2020, 15:02:21 UTC

We were reminded recently of Estimates and Deadlines revisited from 2008 (just before GPUs were introduced). That link drops you in on the final outcome, but here's a summary of Joe's table of deadlines:

Angle     Deadline (days from issue)
0.001     23.25
0.05      23.25 (VLAR)
0.0501    27.16
0.22548   16.85
0.22549   32.23
0.295     27.76
0.385     24.38
0.41      23.70 (common from Arecibo)
1.12744    7.00 (VHAR)
10         7.00
Since then, we've had two big increases in crunching time, due to increases in search sensitivity, and each has been accompanied by an extension of deadlines. So the table now looks something like:

Angle     Deadline
0.05      52.75 (VLAR)
0.425     53.39 (nearest from Arecibo)
1.12744   20.46 (VHAR)
So, deadlines overall have more than doubled since 2008, without any allowance for the faster average computer available now. I think we could safely halve the current figures, as the simplest adjustment.
ID: 2033768 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2033770 - Posted: 24 Feb 2020, 15:11:01 UTC - in response to Message 2033768.  
Last modified: 24 Feb 2020, 15:31:24 UTC

So, deadlines overall have more than doubled since 2008, without any allowance for the faster average computer available now. I think we could safely halve the current figures, as the simplest adjustment.

Another reason to support the idea. Something must be done.
Anyway, any changes on the deadlines will take weeks to make effect.
Plenty of time to make any fine adjust if necessary.
ID: 2033770 · Report as offensive
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2033775 - Posted: 24 Feb 2020, 16:07:29 UTC
Last modified: 24 Feb 2020, 16:23:09 UTC

Increasing deadlines when workunits become slower to crunch doesn't really make much sense. No one needs long deadline because his hosts needs so long to crunch a single task. The need for long deadlines arises from things that have nothing to do with single task duration.

AstroPulse tasks have 25 day deadline despite being many times slower to crunch than MultiBeam. Why not drop the deadline of all tasks to this same 25 days?

My validation time statistics say that 61% of the tasks that take longer than 25 days will eventually expire. And the remaining 39% is only 0.45% of all tasks. So currently less than one task in 200 would be returned in the time window that would be cut out by deadline reduction to 25 days.

And I believe most of that 0.45% won't really hit the new deadline because users who currently run their computers in a way that makes tasks take that long will adapt to the new deadline and will either reduce their cache sizes or increase the time they keep their computers powered on and crunching setiathome.
ID: 2033775 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22204
Credit: 416,307,556
RAC: 380
United Kingdom
Message 2033781 - Posted: 24 Feb 2020, 17:53:22 UTC

How wrong can you be:
Let's take an extreme example:
A task takes 1 hour to run.
The deadline is 1.5 hours
Now the task complexity is changed and it takes 2 hours to run, but the deadline remains at 1.5 hours.
Therefore all tasks fail to complete within their deadline.
As I said that is a deliberately extreme example.

You do however raise a reasonable question - why indeed to AstroPulse tasks have a deadline of 25 day, when the much faster to computer (but much more common) MultiBeam tasks have a deadline of over 50 days? I suspect the logic behind that is lost in the mists of time (or Richard will pop up with the answer).
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 2033781 · Report as offensive
Previous · 1 . . . 90 · 91 · 92 · 93 · 94 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (118)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.