Message boards :
Number crunching :
Completed Too Late To Validate - the Hostage AP 5.05 Work Units
Message board moderation
Author | Message |
---|---|
Les Send message Joined: 20 May 99 Posts: 53 Credit: 21,062,237 RAC: 18 |
Apparently the hostage situation of AP work units has been resolved in favor of giving credit to those of us that completed those tasks in a timely manner but our results are flagged as invalid, Completed Too Late To Validate. So much for claims that the science is more important than the credit since the results reported in a timely manner are treated as errors and invalid while the tasks returned days or weeks after the deadline are validated. Backwards to me – the errors were on the part of the delinquent computers or users, those results should be the ones flagged to be invalid and in error. While the number of work units being invalidated may be small on a per user basis, I am still offended that the reward for properly completing work in a timely manner is to be flagged invalid while those creating the conflict and errors are rewarded. EDIT: To summarize the original issue: There have been several posts on this topic but the short story is that many of us received AP work units when one or more computers failed to report results by the scheduled deadline. Sometime after, be that days or weeks late, the offending overdue results were reported before those with the replacement work unit could report. As a result the system validated the results using the late work unit and those of us that reported work in a timely manner had those results held hostage because no wing result could be generated or returned. These are some examples of threads discussing this problem: http://setiathome.berkeley.edu/forum_thread.php?id=67959 http://setiathome.berkeley.edu/forum_thread.php?id=67740 http://setiathome.berkeley.edu/forum_thread.php?id=67382 |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
I think, you are getting it wrong (or Im not getting you ;D ). Too late to validate, means that those WUs were already validated so this last result can not be used to perform a validation (or something in this line), but it does not means that you wont get credit... If you look at the details you'll see you are beeing issued the same credits that were awarded to the first 2 wingman instead of the ussual zero credit from the other errors. |
Les Send message Joined: 20 May 99 Posts: 53 Credit: 21,062,237 RAC: 18 |
In the case of the AP work units being held hostage this is a problem of flagging the wrong user as submitting results too late. There have been several posts on this topic but the short story is that many of us received AP work units when one or more computers failed to report results by the scheduled deadline. Sometime after, be that days or weeks late, the offending overdue results were reported before those with the replacement work unit could report. As a result the system validated the results using the late work unit and those of us that reported work in a timely manner had those results held hostage because no wing result could be generated or returned. These are some examples of threads discussing this problem: http://setiathome.berkeley.edu/forum_thread.php?id=67959 http://setiathome.berkeley.edu/forum_thread.php?id=67740 http://setiathome.berkeley.edu/forum_thread.php?id=67382 Earlier today I noticed that my tasks in this situation have been invalidated although credit had been granted. I will copy the above into my original post to keep everything together and avoid misinterpretation. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
In the case of the AP work units being held hostage this is a problem of flagging the wrong user as submitting results too late. Thanks for the clear explanation, Les. Can you hang on to that idea, please, and keep it (plus some sample data) for use in evidence. With this sudden breaking of the log-jam, and the rapid database purge settings at SETI, there's a danger that lessons for the future may be lost: we haven't even tried to work out whether the workunits have been held hostage by a SETI mis-configuration, or some flaw in the underlying BOINC code. I had hoped that when we got down to zero, the staff would have a chance to turn a keen analytic eye on the problem, but instead they seem to have swept it under the carpet - until the time comes for APv6 to be replaced. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
I agree that the "Completed, too late to validate" status looks like a black mark, but if BOINC managed to delete the canonical result before the last wingmate had uploaded and reported it's the simple truth. I think the project staff is doing the civilized thing by granting credit. I had noticed a large surplus of "Results waiting for db purging" before, that was an indication that the associated files had been deleted. What isn't clear was how that related to the large "Results returned and awaiting validation" count. For those users whose BOINC core client failed to get the tasks completed before deadline to be blamed, one has to assume the user had done something wrong. I think in most cases it reflected a flaw in BOINC instead, and giving credit for the uploaded results which matched makes more sense than punishing them. Joe |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
I agree that the "Completed, too late to validate" status looks like a black mark, but if BOINC managed to delete the canonical result before the last wingmate had uploaded and reported it's the simple truth. I think the project staff is doing the civilized thing by granting credit. I think this is a case where allowing the numbers to count down on the Server Status Page at the end of the run helped to clarify the cause. Having something in the low 40s of tasks 'in the field', with 12 thousand or more tasks awaiting validation, represents an unfeasibly large number of tasks per WU. I hope there's enough data left to enable some preventative code to be devised to prevent it happening again. |
Les Send message Joined: 20 May 99 Posts: 53 Credit: 21,062,237 RAC: 18 |
Thank You Richard. I performed a screen capture of the Work Unit for tasks directly affecting me/my computer but these task/work units will be purged by this time tomorrow. As I have complained about in related posts the additional problem is that since the task was validated without considering the work unit that was submitted in a timely manner then the assignment of canonical result was flawed because valid work was ignored. While the differences may be small between the actual analyses, it precludes proper crediting of results because the best analysis was ignored. As Josef pointed out, there was a decision made as to whom to punish for this problem. I prefer that the offender be punished, not the person or computer that played by the rules and reported results in a timely manner. |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
I'll just keep my eye on these two v6 AP's to see if the same v505 problem occurs again. ;) http://setiathome.berkeley.edu/workunit.php?wuid=953687194 http://setiathome.berkeley.edu/workunit.php?wuid=961583113 Cheers. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
Thanks for the clear explanation, Les. Joe will be the best person to comment on this, but I don't think the validator has any concept of 'best' result, over and above the requirement that the 'canonical result' be strongly similar to at least one other result. Since, for the vast majority of WUs, there are only two results, and each is strongly similar to the other, there's an element of luck here. The only possible remaining element of 'punishment' here would be if, by some remote, remote chance, one of the affected WUs might contain the 'discovery' signal for some new pulsar or other astronomical phenomenon. If the problem has only been considered at the level of credit (Eric's Benevolence, as I once called it), isn't there a danger that somebody might be left off the list of co-authors when the scientific discovery paper comes to be written? |
Interstel Send message Joined: 29 Nov 01 Posts: 23 Credit: 2,231,105 RAC: 0 |
I had something similar just occur on this. Nearly 2 1/2 months after I submitted it finally awarded credit but said too late to validate yet I had sent it back in about a week. http://setiathome.berkeley.edu/result.php?resultid=2302170780 Joined SETI@Home in 2001 Online since ArpNET days First activity on Honeywell 1648 Series Mainframe in 1975 at age 12. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
I'll just keep my eye on these two v6 AP's to see if the same v505 problem occurs again. ;) It looks like task 953687194 is in that very state. I would have thought that returning a result after the deadline would mark the task as "to late to validate". Also the validator running while there are tasks "in progress" doesn't make much sense. I imagine there is no logic to check for a condition that "shouldn't exist". However it does. So hopefully the BOINC server dev guys have one or both defects in their queue for "will be fixed". SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Link Send message Joined: 18 Sep 03 Posts: 834 Credit: 1,807,369 RAC: 0 |
I would have thought that returning a result after the deadline would mark the task as "to late to validate". Also the validator running while there are tasks "in progress" doesn't make much sense. I imagine there is no logic to check for a condition that "shouldn't exist". However it does. So hopefully the BOINC server dev guys have one or both defects in their queue for "will be fixed". Well, it works for MB, if a result is returned after the deadline, the validator waits for the resend and validates all 3 (or more) at once, hence we don't have such issues with MB. Optimal would be to pre-validate the both results and if they are OK cancel the resend on the next sheduler request of the host, which has it. BOINC can do that. Otherwise, if it gets reported on the next request, than revalidate all 3 together. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
I would have thought that returning a result after the deadline would mark the task as "to late to validate". Also the validator running while there are tasks "in progress" doesn't make much sense. I imagine there is no logic to check for a condition that "shouldn't exist". However it does. So hopefully the BOINC server dev guys have one or both defects in their queue for "will be fixed". It seems something would be better than the logic black hole these seem to fall into. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
MikeN Send message Joined: 24 Jan 11 Posts: 319 Credit: 64,719,409 RAC: 85 |
I would have thought that returning a result after the deadline would mark the task as "to late to validate". Also the validator running while there are tasks "in progress" doesn't make much sense. I imagine there is no logic to check for a condition that "shouldn't exist". However it does. So hopefully the BOINC server dev guys have one or both defects in their queue for "will be fixed". I dont like the idea of BOINC cancelling the resend especially with APs. On my slower PC, an AP takes 20 hours to process. If I had been processing it for 19.5 hours when the WU was cancelled I would be most displeased about the waste of my bandwidth, electricity and PC resources, to put it mildly! |
W-K 666 Send message Joined: 18 May 99 Posts: 19012 Credit: 40,757,560 RAC: 67 |
I would have thought that returning a result after the deadline would mark the task as "to late to validate". Also the validator running while there are tasks "in progress" doesn't make much sense. I imagine there is no logic to check for a condition that "shouldn't exist". However it does. So hopefully the BOINC server dev guys have one or both defects in their queue for "will be fixed". Cancelling resends is an available option within the BOINC server code, but it only cancels tasks NOT started, so you would be safe and get your credit if the task is started. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
I would have thought that returning a result after the deadline would mark the task as "to late to validate". Also the validator running while there are tasks "in progress" doesn't make much sense. I imagine there is no logic to check for a condition that "shouldn't exist". However it does. So hopefully the BOINC server dev guys have one or both defects in their queue for "will be fixed". BOINC has a couple of perfectly good bits of logic already, which normally cope with this without risk to the user. Say two tasks are created and sent out - the normal WU here at SETI. Say one is crunched and returned on time, but the other is late and not returned by deadline. As soon as the second task passes deadline, a replacement (third) task is created and put on the back of the queue for distribution - but it probably sits there for several hours. If the delayed task comes back in at this point, and validates, the third task can be cancelled without risk to anyone, before it has even been sent out. A little later, and the third task has been allocated and downloaded, but often enough it will remain unstarted in the new wingmate's cache for several hours or days. Even at this stage, BOINC can and will cancel the task if the belated original copy is returned and validated. BOINC will only do that if the new wingmate contacts the servers, and the servers therefore know that work hasn't started crunching - no CPU time is wasted, only a bit of download bandwidth. I think all the 'hostage' cases we're considering must be the final case: the original late copy is returned after the replacement has been created, allocated, downloaded, and crunching has already started. If that happens, BOINC lets work on the replacement continue until it has finished, which as we all know can be a long time for AP tasks. I really don't know why things go wrong for these few, but not insignificant number of, AP cases. It might be that BOINC, in general, doesn't cope well with the long delays we're seeing here: or it might be the SETI's AP validator (specifically) puts the wrong marker on the files involved when the original two results are - belatedly - validated. |
Les Send message Joined: 20 May 99 Posts: 53 Credit: 21,062,237 RAC: 18 |
For the record or as a reminder - this has actually been going on for many years. For a long time the programming assigned zero credit and invalidated work units that fell into this situation as opposed to yesterday's invalidate the work done but award credit. It probably will be difficult to impossible to reconstruct the ignored results from over the years should subsequent analysis confirm some discovery such as the afore mentioned new pulsar or other astronomical phenomenon. Over three years ago I had inquired about this problem and I am sure that others had done so as well and probably long before I encountered or at least noticed it. While the work units were being held hostage there was at least some hope of the hostages being freed and validated so this was not as much of an issue prior to yesterday. http://setiathome.berkeley.edu/forum_thread.php?id=54250 or http://setiathome.berkeley.edu/forum_thread.php?id=52765 |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
I dont get it... I dont see how is it possible to miss valuable data due to this situation... If there are 2 matching results that were validated and there is something different in the 3rd result then this last result has to be the wrong one. If some data is going to be lost its not by this issue about the hostages. It will be because the differences between apps for different hardware where it can happen that the matching results are not the best ones. But that will happen anyway, hostages or not... Is there something else that Im not seeing/knowing? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
I dont get it... You're right: nothing is lost, scientifically. Tasks have validated, and from the validated tasks, a canonical result for the WU as a whole has been chosen. That's as good as it gets. The rest of the concerns are for users who have lost something, or fear they might be at risk of losing something, or are worried they might lose something in the future. Things like: * reputation (tasks marked 'invalid') * credits * electricity (for a part-crunched task) * scientific kudos (a name-check in a discovery paper) |
SciManStev Send message Joined: 20 Jun 99 Posts: 6651 Credit: 121,090,076 RAC: 0 |
I just found 3 5.05 units that should have cleared last year, but finally came up invalid. http://setiathome.berkeley.edu/results.php?hostid=5483835&offset=0&show_names=0&state=4&appid= Steve Warning, addicted to SETI crunching! Crunching as a member of GPU Users Group. GPUUG Website |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.