Message boards :
Number crunching :
The Server Issues / Outages Thread - Panic Mode On! (119)
Message board moderation
Previous · 1 . . . 83 · 84 · 85 · 86 · 87 · 88 · 89 . . . 107 · Next
Author | Message |
---|---|
![]() ![]() ![]() Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22 ![]() ![]() |
I got some early resends from April 20. What will happen on April 20? Let's say I have done the WU, and the WU was done by one of the original computers. It has matched and validated. It sits waiting on the other original tasks to hit the due date and then what. Will it resend ? (wasteful since it has already validated) or will it know it has validated and assimilate? edit: some did not validate with the other WU, and they sit there, as they did not trigger a resend. So I'm hoping that the non-validated WUs will send out a wave of resends tomorrow. |
Ville Saari ![]() Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 ![]() ![]() |
Let's say I have done the WU, and the WU was done by one of the original computers. It has matched and validated. It sits waiting on the other original tasks to hit the due date and then what. Will it resend ? (wasteful since it has already validated) or will it know it has validated and assimilate?If the workunit already has enough valid results to fill its quorum, then it will just wait for the remaining outstanding results to time out or be returned and when all the remaining results have done so, the workunit gets assimilated and purged. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
If the workunit already has enough valid results to fill its quorum, then it will just wait for the remaining outstanding results to time out or be returned and when all the remaining results have done so, the workunit gets assimilated and purged.Looking at the current state of the SSP ("Workunits waiting for assimilation: 2"), I think we've returned to normal: workunits are assimilated as soon as possible after the Validator has chosen a canonical result. But they are purged 'n' hours after the final replication has been declared 'over' - reported, timed out, cancelled or whatever. 'n' used to be 24 hours: during the non-normal times earlier this year it was substantially reduced. |
Ville Saari ![]() Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 ![]() ![]() |
But they are purged 'n' hours after the final replication has been declared 'over' - reported, timed out, cancelled or whatever.It was reduced to 1 hour or near 1 hour to reduce the database bloat and still hasn't been returned back to 24 hours although the database is nice and lean now. This makes it hard to see what happens to my results because they disappear so fast after returning :( |
![]() ![]() Send message Joined: 23 Aug 99 Posts: 962 Credit: 537,293 RAC: 9 ![]() |
But they are purged 'n' hours after the final replication has been declared 'over' - reported, timed out, cancelled or whatever.It was reduced to 1 hour or near 1 hour to reduce the database bloat and still hasn't been returned back to 24 hours although the database is nice and lean now. This makes it hard to see what happens to my results because they disappear so fast after returning :( Agreed When Eric is making his next adjustments, it would be great if he could put them back to 24h. It might also be a good time to consider dropping the deadlines to something like 14 days for all future re-sends. I also posted about "Anomalous Workunits that won't Validate without intervention" https://setiathome.berkeley.edu/forum_thread.php?id=85459, a few days ago. I'm fairly sure that the 2 that I noticed are not the only stuck Workunits in the database, no _2 generated after an abnormal exit. |
Ville Saari ![]() Send message Joined: 30 Nov 00 Posts: 1158 Credit: 49,177,052 RAC: 82,530 ![]() ![]() |
It might also be a good time to consider dropping the deadlines to something like 14 days for all future re-sends.Such a sudden reduction in deadlines could be problematic for many hosts. Especially those that also crunch other projects that already have short deadlines. If such a host suddenly receives a big bunch of shorter than expected deadline Seti resends, it could find itself in a situation where it can't crunch them and all the other project even shorter deadline tasks in time. Normal long deadline seti tasks allow the client to see in advance if there is congestion in the future and throttle download of new tasks from the other projects. Bursty nature of this setiathome decay phase is already challenging for boinc client scheduling. A sudden deadline drop would make it even worse. And most of hosts that are likely to go MIA have already done so and the remaining ones are likely going to stay to the bitter end, so the chance for a task to actually hit those long deadlines becomes lower and lower diminishing the potential gains from a deadline reduction. |
![]() Send message Joined: 17 Nov 00 Posts: 90 Credit: 76,455,865 RAC: 735 ![]() ![]() |
Also, the data phase has been 20 years long. Will it really help things to swing a hammer around in hopes of reducing the taper from 6 weeks to 4? |
![]() ![]() ![]() Send message Joined: 27 May 99 Posts: 5517 Credit: 528,817,460 RAC: 242 ![]() ![]() |
I'm betting they are trying to wrap it up so they can move the lot back into the closet...but that is just me... ![]() ![]() |
![]() ![]() Send message Joined: 23 Aug 99 Posts: 962 Credit: 537,293 RAC: 9 ![]() |
Also, the data phase has been 20 years long. Will it really help things to swing a hammer around in hopes of reducing the taper from 6 weeks to 4? It's one of the many configuration options in https://boinc.berkeley.edu/trac/wiki/ProjectOptions I know that some of the other BOINC projects that I have crunched for in the past, sometimes use this option I seem to remember that SIMAP used to use it regularly. Of course, if SETI doesn't have Nebula ready yet, and can wait a few more months for the last few results to time out and arrive eventually, then [Accelerating retries] might not be necessary here. The Project Staff could probably clean up the very last Workunits locally as well. |
![]() ![]() Send message Joined: 23 Aug 99 Posts: 962 Credit: 537,293 RAC: 9 ![]() |
It might also be a good time to consider dropping the deadlines to something like 14 days for all future re-sends.Such a sudden reduction in deadlines could be problematic for many hosts. Especially those that also crunch other projects that already have I can see that it could be a problem for some. :) Both of my active machines have received plenty of SETI work today to last for several days, I have dropped the requests on Android to 2 days, instead of 14; My Windows machine is still looking for up to 20 days of work, but a lot of the re-sends we are getting now are running close to the estimated time, instead of bombing out after 90 seconds. |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13913 Credit: 208,696,464 RAC: 304 ![]() ![]() |
You can see that the "Results returned and awaiting validation" is now slightly below "Results out in the field" for SETI@home v8Yeah, something i never thought i'd see. Now almost 54,000 below. Grant Darwin NT |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13913 Credit: 208,696,464 RAC: 304 ![]() ![]() |
Hence the suggestion for a reduction to 2 weeks, not a few days.It might also be a good time to consider dropping the deadlines to something like 14 days for all future re-sends.Such a sudden reduction in deadlines could be problematic for many hosts. Only likely to be a problem (if it is a problem at all) for those with excessive cache settings & multiple projects. Seti only, huge cache- no problem. Multiple projects, small cache- no problem. Multiple projects & huge cache, maybe a problem. The BOINC manager would sort it out. Grant Darwin NT |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
Hence the suggestion for a reduction to 2 weeks, not a few days.It might also be a good time to consider dropping the deadlines to something like 14 days for all future re-sends.Such a sudden reduction in deadlines could be problematic for many hosts. . . +1 Stephen :) |
![]() ![]() ![]() Send message Joined: 5 Mar 12 Posts: 815 Credit: 2,361,516 RAC: 22 ![]() ![]() |
while the discussion here is about shortening deadlines, I think the solution the seti team came up with is much more elegant. Looks like all WUs still out there will get a 3rd copy. Everyone still gets credit as long as it gets done before the deadline. I'm sure some WUs will get sent into a black hole, and at some later date they can either send out a 4th copy or do it themselves. Kudos to the seti team for continuing to honor all the systems doing seti (even the slow ones). After looking at other projects, I'm understanding and loving seti all the more! |
juan BFP ![]() ![]() ![]() ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 ![]() ![]() |
Hence the suggestion for a reduction to 2 weeks, not a few days.It might also be a good time to consider dropping the deadlines to something like 14 days for all future re-sends.Such a sudden reduction in deadlines could be problematic for many hosts. Please forgive me if i not agree. Let's explain why. Not believe that is relevant for those who uses large caches. Mainly because the ones who knows how to create such large caches are advanced users so they know what they are doing. Most of the hosts in this category are well managed by their users and are fast hosts too. So they will adapt their crunching speed to those new deadlines. For example mine own host who is not one of the fastest ones is already running with 1/4 of his GPU's and the rests are slowing down to 50% of their crunching capacity. AFAIK i'm one of the ones who uses one of the largest caches around (up to 150K) so not imagine a problem with such reduction. You need to remember something, what you see is not what the host rely has, so if you see 2 GPU's not means it only has rely 2. Instead of that on the other hand the hosts who will have bigger problems must be the slow ones who uses the standard 150 WU cache with the 10+10 days configuration . A large quantity of them are leaving without any user interaction (just set & go), this hosts could be have serious problem with this reduction. But the extra wingmen generated in the last days practically ends the problem. That i agree with Unixchick and send a applause to the Seti team for this solution. my 0.02 ![]() |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13913 Credit: 208,696,464 RAC: 304 ![]() ![]() |
Instead of that on the other had the hosts who will have bigger problems must be the slow ones who uses the standard 150 WU cache with the 10+10 days configuration . A large quantity of them are leaving without any user interaction (just set & go), this hosts could be have serious problem with this reduction.What problem? They downloaded work, then stopped contributing. Their work isn't coming back- they are the problem. Grant Darwin NT |
juan BFP ![]() ![]() ![]() ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 ![]() ![]() |
Instead of that on the other had the hosts who will have bigger problems must be the slow ones who uses the standard 150 WU cache with the 10+10 days configuration . A large quantity of them are leaving without any user interaction (just set & go), this hosts could be have serious problem with this reduction.What problem? They downloaded work, then stopped contributing. Exactly what you posted. 150 WU per device on a non attended slow host with 10+10 days cache and small deadlines will make it not returning the work in time or until it has time to self adjust what take time to be done. The extra wingmate solution is more elegant while keep all hosts, no matter if it is fast, slow or whatever working until the end. ![]() |
Stephen "Heretic" ![]() ![]() ![]() ![]() Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 ![]() ![]() |
[quote]Instead of that on the other had the hosts who will have bigger problems must be the slow ones who uses the standard 150 WU cache with the 10+10 days configuration . A large quantity of them are leaving without any user interaction (just set & go), this hosts could be have serious problem with this reduction.What problem? They downloaded work, then stopped contributing. Their work isn't coming back . . Translation error ... . . "Slow hosts using default settings and so having a cache of 150 Tasks per device with work request set to 10+10. Many of them are allowed to run with the SETI defaults and no further interaction from the host owner (set & go)" . In other words, the silent majority who are here is spirit only :) Not that I see this as a problem for them. If their excessive tasks time out then so be it. I still believe in the 24 hours of work cache. <shrug> . . But to Juan, Grant acknowledged that those with large caches still doing only (or mainly) S@H work would not be at issue. Stephen |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13913 Credit: 208,696,464 RAC: 304 ![]() ![]() |
What irks me with the Server cancellations is they count as errors. Missing a deadline, yeah that should be an error.Yes, that's right. The client has to initiate the conversation, and the server will send the 'abort if unstarted' message in the reply. Then it's up to the client to check if it has started, and act accordingly.Server operators have two options available to them:The server doesn't know which tasks are unstarted and which are not. Client reports the tasks it has on scheduler request but it doesn't tell their status. Unstarted, running and completed but not yet reported tasks all look the same. If the server offers the operators the option to abort unstarted tasks, then the only way for that option to work is the server telling the client to do this aborting. Which means the aborting won't happen if the client is MIA. But if the Task is no longer needed, then cancel it, but have it as a Valid result with 0 Credit and 0 input on Task completion times. Not a computation error. Grant Darwin NT |
juan BFP ![]() ![]() ![]() ![]() Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 ![]() ![]() |
I still believe in the 24 hours of work cache.You are absolutely right. ![]() |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.