The Server Issues / Outages Thread - Panic Mode On! (119)

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 83 · 84 · 85 · 86 · 87 · 88 · 89 . . . 107 · Next

AuthorMessage
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2045840 - Posted: 19 Apr 2020, 16:17:40 UTC
Last modified: 19 Apr 2020, 16:22:24 UTC

I got some early resends from April 20. What will happen on April 20?

Let's say I have done the WU, and the WU was done by one of the original computers. It has matched and validated. It sits waiting on the other original tasks to hit the due date and then what. Will it resend ? (wasteful since it has already validated) or will it know it has validated and assimilate?

edit: some did not validate with the other WU, and they sit there, as they did not trigger a resend. So I'm hoping that the non-validated WUs will send out a wave of resends tomorrow.
ID: 2045840 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2045846 - Posted: 19 Apr 2020, 16:29:20 UTC - in response to Message 2045840.  

Let's say I have done the WU, and the WU was done by one of the original computers. It has matched and validated. It sits waiting on the other original tasks to hit the due date and then what. Will it resend ? (wasteful since it has already validated) or will it know it has validated and assimilate?
If the workunit already has enough valid results to fill its quorum, then it will just wait for the remaining outstanding results to time out or be returned and when all the remaining results have done so, the workunit gets assimilated and purged.
ID: 2045846 · Report as offensive     Reply Quote
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 2045863 - Posted: 19 Apr 2020, 17:51:03 UTC - in response to Message 2045846.  

If the workunit already has enough valid results to fill its quorum, then it will just wait for the remaining outstanding results to time out or be returned and when all the remaining results have done so, the workunit gets assimilated and purged.
Looking at the current state of the SSP ("Workunits waiting for assimilation: 2"), I think we've returned to normal: workunits are assimilated as soon as possible after the Validator has chosen a canonical result. But they are purged 'n' hours after the final replication has been declared 'over' - reported, timed out, cancelled or whatever.

'n' used to be 24 hours: during the non-normal times earlier this year it was substantially reduced.
ID: 2045863 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2045864 - Posted: 19 Apr 2020, 18:17:48 UTC - in response to Message 2045863.  

But they are purged 'n' hours after the final replication has been declared 'over' - reported, timed out, cancelled or whatever.
'n' used to be 24 hours: during the non-normal times earlier this year it was substantially reduced.
It was reduced to 1 hour or near 1 hour to reduce the database bloat and still hasn't been returned back to 24 hours although the database is nice and lean now. This makes it hard to see what happens to my results because they disappear so fast after returning :(
ID: 2045864 · Report as offensive     Reply Quote
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 962
Credit: 537,293
RAC: 9
United Kingdom
Message 2045865 - Posted: 19 Apr 2020, 18:39:22 UTC - in response to Message 2045864.  

But they are purged 'n' hours after the final replication has been declared 'over' - reported, timed out, cancelled or whatever.
'n' used to be 24 hours: during the non-normal times earlier this year it was substantially reduced.
It was reduced to 1 hour or near 1 hour to reduce the database bloat and still hasn't been returned back to 24 hours although the database is nice and lean now. This makes it hard to see what happens to my results because they disappear so fast after returning :(


Agreed

When Eric is making his next adjustments, it would be great if he could put them back to 24h.

It might also be a good time to consider dropping the deadlines to something like 14 days for all future re-sends.


I also posted about "Anomalous Workunits that won't Validate without intervention" https://setiathome.berkeley.edu/forum_thread.php?id=85459, a few days ago. I'm fairly sure that the 2 that I noticed are not the only stuck Workunits in the database, no _2 generated after an abnormal exit.
ID: 2045865 · Report as offensive     Reply Quote
Ville Saari
Avatar

Send message
Joined: 30 Nov 00
Posts: 1158
Credit: 49,177,052
RAC: 82,530
Finland
Message 2045871 - Posted: 19 Apr 2020, 19:36:48 UTC - in response to Message 2045865.  

It might also be a good time to consider dropping the deadlines to something like 14 days for all future re-sends.
Such a sudden reduction in deadlines could be problematic for many hosts. Especially those that also crunch other projects that already have
short deadlines. If such a host suddenly receives a big bunch of shorter than expected deadline Seti resends, it could find itself in a situation where it can't crunch them and all the other project even shorter deadline tasks in time. Normal long deadline seti tasks allow the client to see in advance if there is congestion in the future and throttle download of new tasks from the other projects.

Bursty nature of this setiathome decay phase is already challenging for boinc client scheduling. A sudden deadline drop would make it even worse.

And most of hosts that are likely to go MIA have already done so and the remaining ones are likely going to stay to the bitter end, so the chance for a task to actually hit those long deadlines becomes lower and lower diminishing the potential gains from a deadline reduction.
ID: 2045871 · Report as offensive     Reply Quote
Profile doublechaz

Send message
Joined: 17 Nov 00
Posts: 90
Credit: 76,455,865
RAC: 735
United States
Message 2045878 - Posted: 19 Apr 2020, 20:25:22 UTC

Also, the data phase has been 20 years long. Will it really help things to swing a hammer around in hopes of reducing the taper from 6 weeks to 4?
ID: 2045878 · Report as offensive     Reply Quote
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 2045884 - Posted: 19 Apr 2020, 21:18:20 UTC

I'm betting they are trying to wrap it up so they can move the lot back into the closet...but that is just me...
ID: 2045884 · Report as offensive     Reply Quote
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 962
Credit: 537,293
RAC: 9
United Kingdom
Message 2045885 - Posted: 19 Apr 2020, 21:19:39 UTC - in response to Message 2045878.  
Last modified: 19 Apr 2020, 21:43:24 UTC

Also, the data phase has been 20 years long. Will it really help things to swing a hammer around in hopes of reducing the taper from 6 weeks to 4?


It's one of the many configuration options in https://boinc.berkeley.edu/trac/wiki/ProjectOptions
I know that some of the other BOINC projects that I have crunched for in the past, sometimes use this option
I seem to remember that SIMAP used to use it regularly.

Of course, if SETI doesn't have Nebula ready yet, and can wait a few more months for the last few results to time out and arrive eventually, then [Accelerating retries] might not be necessary here.
The Project Staff could probably clean up the very last Workunits locally as well.
ID: 2045885 · Report as offensive     Reply Quote
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 962
Credit: 537,293
RAC: 9
United Kingdom
Message 2045887 - Posted: 19 Apr 2020, 21:33:24 UTC - in response to Message 2045871.  
Last modified: 19 Apr 2020, 21:36:40 UTC

It might also be a good time to consider dropping the deadlines to something like 14 days for all future re-sends.
Such a sudden reduction in deadlines could be problematic for many hosts. Especially those that also crunch other projects that already have
short deadlines. If such a host suddenly receives a big bunch of shorter than expected deadline Seti resends, it could find itself in a situation where it can't crunch them and all the other project even shorter deadline tasks in time. Normal long deadline seti tasks allow the client to see in advance if there is congestion in the future and throttle download of new tasks from the other projects.

Bursty nature of this setiathome decay phase is already challenging for boinc client scheduling. A sudden deadline drop would make it even worse.

And most of hosts that are likely to go MIA have already done so and the remaining ones are likely going to stay to the bitter end, so the chance for a task to actually hit those long deadlines becomes lower and lower diminishing the potential gains from a deadline reduction.


I can see that it could be a problem for some. :)
Both of my active machines have received plenty of SETI work today to last for several days, I have dropped the requests on Android to 2 days, instead of 14; My Windows machine is still looking for up to 20 days of work, but a lot of the re-sends we are getting now are running close to the estimated time, instead of bombing out after 90 seconds.
ID: 2045887 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13745
Credit: 208,696,464
RAC: 304
Australia
Message 2045896 - Posted: 19 Apr 2020, 22:23:33 UTC - in response to Message 2045835.  

You can see that the "Results returned and awaiting validation" is now slightly below "Results out in the field" for SETI@home v8
Yeah, something i never thought i'd see.
Now almost 54,000 below.
Grant
Darwin NT
ID: 2045896 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13745
Credit: 208,696,464
RAC: 304
Australia
Message 2045898 - Posted: 19 Apr 2020, 22:28:24 UTC - in response to Message 2045871.  

It might also be a good time to consider dropping the deadlines to something like 14 days for all future re-sends.
Such a sudden reduction in deadlines could be problematic for many hosts.
Hence the suggestion for a reduction to 2 weeks, not a few days.
Only likely to be a problem (if it is a problem at all) for those with excessive cache settings & multiple projects. Seti only, huge cache- no problem. Multiple projects, small cache- no problem. Multiple projects & huge cache, maybe a problem.
The BOINC manager would sort it out.
Grant
Darwin NT
ID: 2045898 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2045903 - Posted: 19 Apr 2020, 22:52:58 UTC - in response to Message 2045898.  

It might also be a good time to consider dropping the deadlines to something like 14 days for all future re-sends.
Such a sudden reduction in deadlines could be problematic for many hosts.
Hence the suggestion for a reduction to 2 weeks, not a few days.
Only likely to be a problem (if it is a problem at all) for those with excessive cache settings & multiple projects. Seti only, huge cache- no problem. Multiple projects, small cache- no problem. Multiple projects & huge cache, maybe a problem.
The BOINC manager would sort it out.


. . +1

Stephen

:)
ID: 2045903 · Report as offensive     Reply Quote
Profile Unixchick Project Donor
Avatar

Send message
Joined: 5 Mar 12
Posts: 815
Credit: 2,361,516
RAC: 22
United States
Message 2045906 - Posted: 19 Apr 2020, 23:15:53 UTC

while the discussion here is about shortening deadlines, I think the solution the seti team came up with is much more elegant.

Looks like all WUs still out there will get a 3rd copy. Everyone still gets credit as long as it gets done before the deadline. I'm sure some WUs will get sent into a black hole, and at some later date they can either send out a 4th copy or do it themselves.

Kudos to the seti team for continuing to honor all the systems doing seti (even the slow ones).

After looking at other projects, I'm understanding and loving seti all the more!
ID: 2045906 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2045907 - Posted: 19 Apr 2020, 23:17:48 UTC - in response to Message 2045903.  
Last modified: 19 Apr 2020, 23:22:45 UTC

It might also be a good time to consider dropping the deadlines to something like 14 days for all future re-sends.
Such a sudden reduction in deadlines could be problematic for many hosts.
Hence the suggestion for a reduction to 2 weeks, not a few days.
Only likely to be a problem (if it is a problem at all) for those with excessive cache settings & multiple projects. Seti only, huge cache- no problem. Multiple projects, small cache- no problem. Multiple projects & huge cache, maybe a problem.
The BOINC manager would sort it out.


. . +1

Stephen

:)

Please forgive me if i not agree. Let's explain why.

Not believe that is relevant for those who uses large caches. Mainly because the ones who knows how to create such large caches are advanced users so they know what they are doing. Most of the hosts in this category are well managed by their users and are fast hosts too. So they will adapt their crunching speed to those new deadlines. For example mine own host who is not one of the fastest ones is already running with 1/4 of his GPU's and the rests are slowing down to 50% of their crunching capacity. AFAIK i'm one of the ones who uses one of the largest caches around (up to 150K) so not imagine a problem with such reduction. You need to remember something, what you see is not what the host rely has, so if you see 2 GPU's not means it only has rely 2.

Instead of that on the other hand the hosts who will have bigger problems must be the slow ones who uses the standard 150 WU cache with the 10+10 days configuration . A large quantity of them are leaving without any user interaction (just set & go), this hosts could be have serious problem with this reduction.

But the extra wingmen generated in the last days practically ends the problem. That i agree with Unixchick and send a applause to the Seti team for this solution.

my 0.02
ID: 2045907 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13745
Credit: 208,696,464
RAC: 304
Australia
Message 2045909 - Posted: 19 Apr 2020, 23:24:01 UTC - in response to Message 2045907.  

Instead of that on the other had the hosts who will have bigger problems must be the slow ones who uses the standard 150 WU cache with the 10+10 days configuration . A large quantity of them are leaving without any user interaction (just set & go), this hosts could be have serious problem with this reduction.
What problem? They downloaded work, then stopped contributing.
Their work isn't coming back- they are the problem.
Grant
Darwin NT
ID: 2045909 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2045912 - Posted: 19 Apr 2020, 23:33:03 UTC - in response to Message 2045909.  

Instead of that on the other had the hosts who will have bigger problems must be the slow ones who uses the standard 150 WU cache with the 10+10 days configuration . A large quantity of them are leaving without any user interaction (just set & go), this hosts could be have serious problem with this reduction.
What problem? They downloaded work, then stopped contributing.
Their work isn't coming back- they are the problem.

Exactly what you posted. 150 WU per device on a non attended slow host with 10+10 days cache and small deadlines will make it not returning the work in time or until it has time to self adjust what take time to be done.

The extra wingmate solution is more elegant while keep all hosts, no matter if it is fast, slow or whatever working until the end.
ID: 2045912 · Report as offensive     Reply Quote
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 2045913 - Posted: 19 Apr 2020, 23:38:46 UTC - in response to Message 2045909.  

[quote]Instead of that on the other had the hosts who will have bigger problems must be the slow ones who uses the standard 150 WU cache with the 10+10 days configuration . A large quantity of them are leaving without any user interaction (just set & go), this hosts could be have serious problem with this reduction.
What problem? They downloaded work, then stopped contributing.
Their work isn't coming back
. . Translation error ...
. . "Slow hosts using default settings and so having a cache of 150 Tasks per device with work request set to 10+10. Many of them are allowed to run with the SETI defaults and no further interaction from the host owner (set & go)" . In other words, the silent majority who are here is spirit only :) Not that I see this as a problem for them. If their excessive tasks time out then so be it. I still believe in the 24 hours of work cache. <shrug>

. . But to Juan, Grant acknowledged that those with large caches still doing only (or mainly) S@H work would not be at issue.

Stephen
ID: 2045913 · Report as offensive     Reply Quote
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13745
Credit: 208,696,464
RAC: 304
Australia
Message 2045914 - Posted: 19 Apr 2020, 23:51:02 UTC - in response to Message 2045767.  

Server operators have two options available to them:
* Abort if unstarted
* Abort unconditionally
The server doesn't know which tasks are unstarted and which are not. Client reports the tasks it has on scheduler request but it doesn't tell their status. Unstarted, running and completed but not yet reported tasks all look the same. If the server offers the operators the option to abort unstarted tasks, then the only way for that option to work is the server telling the client to do this aborting. Which means the aborting won't happen if the client is MIA.
Yes, that's right. The client has to initiate the conversation, and the server will send the 'abort if unstarted' message in the reply. Then it's up to the client to check if it has started, and act accordingly.

It happens quite a lot if you have a large cache, and you receive a resend of a task which has timed out. If you're running a standard client (without prioritisation of resends), and the original owner gets their report in a little bit late, then the server can send a "didn't need" conditional kill message. Rare to see it for SETI MB tasks - more common for AP, and at other projects.
What irks me with the Server cancellations is they count as errors. Missing a deadline, yeah that should be an error.
But if the Task is no longer needed, then cancel it, but have it as a Valid result with 0 Credit and 0 input on Task completion times. Not a computation error.
Grant
Darwin NT
ID: 2045914 · Report as offensive     Reply Quote
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 2045915 - Posted: 19 Apr 2020, 23:52:16 UTC - in response to Message 2045913.  

I still believe in the 24 hours of work cache.
You are absolutely right.
ID: 2045915 · Report as offensive     Reply Quote
Previous · 1 . . . 83 · 84 · 85 · 86 · 87 · 88 · 89 . . . 107 · Next

Message boards : Number crunching : The Server Issues / Outages Thread - Panic Mode On! (119)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.