Completed Too Late To Validate - the Hostage AP 5.05 Work Units

Message boards : Number crunching : Completed Too Late To Validate - the Hostage AP 5.05 Work Units
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Les

Send message
Joined: 20 May 99
Posts: 53
Credit: 21,062,237
RAC: 18
United States
Message 1228824 - Posted: 7 May 2012, 21:53:25 UTC
Last modified: 7 May 2012, 22:48:13 UTC

Apparently the hostage situation of AP work units has been resolved in favor of giving credit to those of us that completed those tasks in a timely manner but our results are flagged as invalid, Completed Too Late To Validate. So much for claims that the science is more important than the credit since the results reported in a timely manner are treated as errors and invalid while the tasks returned days or weeks after the deadline are validated. Backwards to me – the errors were on the part of the delinquent computers or users, those results should be the ones flagged to be invalid and in error.

While the number of work units being invalidated may be small on a per user basis, I am still offended that the reward for properly completing work in a timely manner is to be flagged invalid while those creating the conflict and errors are rewarded.

EDIT:

To summarize the original issue: There have been several posts on this topic but the short story is that many of us received AP work units when one or more computers failed to report results by the scheduled deadline. Sometime after, be that days or weeks late, the offending overdue results were reported before those with the replacement work unit could report. As a result the system validated the results using the late work unit and those of us that reported work in a timely manner had those results held hostage because no wing result could be generated or returned. These are some examples of threads discussing this problem:

http://setiathome.berkeley.edu/forum_thread.php?id=67959
http://setiathome.berkeley.edu/forum_thread.php?id=67740
http://setiathome.berkeley.edu/forum_thread.php?id=67382
ID: 1228824 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1228841 - Posted: 7 May 2012, 22:27:55 UTC - in response to Message 1228824.  

I think, you are getting it wrong (or Im not getting you ;D ).

Too late to validate, means that those WUs were already validated so this last result can not be used to perform a validation (or something in this line), but it does not means that you wont get credit... If you look at the details you'll see you are beeing issued the same credits that were awarded to the first 2 wingman instead of the ussual zero credit from the other errors.



ID: 1228841 · Report as offensive
Les

Send message
Joined: 20 May 99
Posts: 53
Credit: 21,062,237
RAC: 18
United States
Message 1228851 - Posted: 7 May 2012, 22:45:28 UTC - in response to Message 1228841.  

In the case of the AP work units being held hostage this is a problem of flagging the wrong user as submitting results too late.

There have been several posts on this topic but the short story is that many of us received AP work units when one or more computers failed to report results by the scheduled deadline. Sometime after, be that days or weeks late, the offending overdue results were reported before those with the replacement work unit could report. As a result the system validated the results using the late work unit and those of us that reported work in a timely manner had those results held hostage because no wing result could be generated or returned. These are some examples of threads discussing this problem:

http://setiathome.berkeley.edu/forum_thread.php?id=67959
http://setiathome.berkeley.edu/forum_thread.php?id=67740
http://setiathome.berkeley.edu/forum_thread.php?id=67382

Earlier today I noticed that my tasks in this situation have been invalidated although credit had been granted.

I will copy the above into my original post to keep everything together and avoid misinterpretation.
ID: 1228851 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1228860 - Posted: 7 May 2012, 22:55:57 UTC - in response to Message 1228851.  

In the case of the AP work units being held hostage this is a problem of flagging the wrong user as submitting results too late.

There have been several posts on this topic but the short story is that many of us received AP work units when one or more computers failed to report results by the scheduled deadline. Sometime after, be that days or weeks late, the offending overdue results were reported before those with the replacement work unit could report. As a result the system validated the results using the late work unit and those of us that reported work in a timely manner had those results held hostage because no wing result could be generated or returned. These are some examples of threads discussing this problem:

http://setiathome.berkeley.edu/forum_thread.php?id=67959
http://setiathome.berkeley.edu/forum_thread.php?id=67740
http://setiathome.berkeley.edu/forum_thread.php?id=67382

Earlier today I noticed that my tasks in this situation have been invalidated although credit had been granted.

I will copy the above into my original post to keep everything together and avoid misinterpretation.

Thanks for the clear explanation, Les.

Can you hang on to that idea, please, and keep it (plus some sample data) for use in evidence. With this sudden breaking of the log-jam, and the rapid database purge settings at SETI, there's a danger that lessons for the future may be lost: we haven't even tried to work out whether the workunits have been held hostage by a SETI mis-configuration, or some flaw in the underlying BOINC code. I had hoped that when we got down to zero, the staff would have a chance to turn a keen analytic eye on the problem, but instead they seem to have swept it under the carpet - until the time comes for APv6 to be replaced.
ID: 1228860 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1228864 - Posted: 7 May 2012, 23:00:23 UTC

I agree that the "Completed, too late to validate" status looks like a black mark, but if BOINC managed to delete the canonical result before the last wingmate had uploaded and reported it's the simple truth. I think the project staff is doing the civilized thing by granting credit.

I had noticed a large surplus of "Results waiting for db purging" before, that was an indication that the associated files had been deleted. What isn't clear was how that related to the large "Results returned and awaiting validation" count.

For those users whose BOINC core client failed to get the tasks completed before deadline to be blamed, one has to assume the user had done something wrong. I think in most cases it reflected a flaw in BOINC instead, and giving credit for the uploaded results which matched makes more sense than punishing them.
                                                                   Joe
ID: 1228864 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1228870 - Posted: 7 May 2012, 23:11:18 UTC - in response to Message 1228864.  

I agree that the "Completed, too late to validate" status looks like a black mark, but if BOINC managed to delete the canonical result before the last wingmate had uploaded and reported it's the simple truth. I think the project staff is doing the civilized thing by granting credit.

I had noticed a large surplus of "Results waiting for db purging" before, that was an indication that the associated files had been deleted. What isn't clear was how that related to the large "Results returned and awaiting validation" count.

For those users whose BOINC core client failed to get the tasks completed before deadline to be blamed, one has to assume the user had done something wrong. I think in most cases it reflected a flaw in BOINC instead, and giving credit for the uploaded results which matched makes more sense than punishing them.
                                                                   Joe

I think this is a case where allowing the numbers to count down on the Server Status Page at the end of the run helped to clarify the cause. Having something in the low 40s of tasks 'in the field', with 12 thousand or more tasks awaiting validation, represents an unfeasibly large number of tasks per WU. I hope there's enough data left to enable some preventative code to be devised to prevent it happening again.
ID: 1228870 · Report as offensive
Les

Send message
Joined: 20 May 99
Posts: 53
Credit: 21,062,237
RAC: 18
United States
Message 1228880 - Posted: 7 May 2012, 23:16:27 UTC - in response to Message 1228860.  
Last modified: 7 May 2012, 23:37:53 UTC


Thanks for the clear explanation, Les.

Can you hang on to that idea, please, and keep it (plus some sample data) for use in evidence. With this sudden breaking of the log-jam, and the rapid database purge settings at SETI, there's a danger that lessons for the future may be lost: we haven't even tried to work out whether the workunits have been held hostage by a SETI mis-configuration, or some flaw in the underlying BOINC code. I had hoped that when we got down to zero, the staff would have a chance to turn a keen analytic eye on the problem, but instead they seem to have swept it under the carpet - until the time comes for APv6 to be replaced.


Thank You Richard. I performed a screen capture of the Work Unit for tasks directly affecting me/my computer but these task/work units will be purged by this time tomorrow.

As I have complained about in related posts the additional problem is that since the task was validated without considering the work unit that was submitted in a timely manner then the assignment of canonical result was flawed because valid work was ignored. While the differences may be small between the actual analyses, it precludes proper crediting of results because the best analysis was ignored.

As Josef pointed out, there was a decision made as to whom to punish for this problem. I prefer that the offender be punished, not the person or computer that played by the rules and reported results in a timely manner.
ID: 1228880 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1228889 - Posted: 7 May 2012, 23:22:26 UTC - in response to Message 1228870.  

I'll just keep my eye on these two v6 AP's to see if the same v505 problem occurs again. ;)

http://setiathome.berkeley.edu/workunit.php?wuid=953687194

http://setiathome.berkeley.edu/workunit.php?wuid=961583113

Cheers.
ID: 1228889 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1228900 - Posted: 7 May 2012, 23:40:09 UTC - in response to Message 1228880.  

Thanks for the clear explanation, Les.

Can you hang on to that idea, please, and keep it (plus some sample data) for use in evidence. With this sudden breaking of the log-jam, and the rapid database purge settings at SETI, there's a danger that lessons for the future may be lost: we haven't even tried to work out whether the workunits have been held hostage by a SETI mis-configuration, or some flaw in the underlying BOINC code. I had hoped that when we got down to zero, the staff would have a chance to turn a keen analytic eye on the problem, but instead they seem to have swept it under the carpet - until the time comes for APv6 to be replaced.

Thank You Richard. I performed a screen capture of the Work Unit for tasks directly affecting me/my computer but these task/work units will be purged by this time tomorrow.

As I have complained about in related posts the additional problem is that since the task was validated without considering the work unit that was submitted in a timely manner then the assignment of canonical result was flawed because valid work was ignored. While the differences may be small between the actual analyses, it precludes proper crediting of results because the best analysis was ignored.

As Josef pointed out, there was a decision made as to whom to punish for this problem. I prefer that the offender be punished, not the person or computer that played by the rules and reported results in a timely manner.

Joe will be the best person to comment on this, but I don't think the validator has any concept of 'best' result, over and above the requirement that the 'canonical result' be strongly similar to at least one other result. Since, for the vast majority of WUs, there are only two results, and each is strongly similar to the other, there's an element of luck here.

The only possible remaining element of 'punishment' here would be if, by some remote, remote chance, one of the affected WUs might contain the 'discovery' signal for some new pulsar or other astronomical phenomenon. If the problem has only been considered at the level of credit (Eric's Benevolence, as I once called it), isn't there a danger that somebody might be left off the list of co-authors when the scientific discovery paper comes to be written?
ID: 1228900 · Report as offensive
Profile Interstel
Avatar

Send message
Joined: 29 Nov 01
Posts: 23
Credit: 2,231,105
RAC: 0
United States
Message 1228932 - Posted: 8 May 2012, 0:40:49 UTC - in response to Message 1228889.  

I had something similar just occur on this. Nearly 2 1/2 months after I submitted it finally awarded credit but said too late to validate yet I had sent it back in about a week.

http://setiathome.berkeley.edu/result.php?resultid=2302170780

Joined SETI@Home in 2001
Online since ArpNET days
First activity on Honeywell 1648
Series Mainframe in 1975 at age 12.
ID: 1228932 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1229097 - Posted: 8 May 2012, 13:27:17 UTC - in response to Message 1228889.  

I'll just keep my eye on these two v6 AP's to see if the same v505 problem occurs again. ;)

http://setiathome.berkeley.edu/workunit.php?wuid=953687194

http://setiathome.berkeley.edu/workunit.php?wuid=961583113

Cheers.

It looks like task 953687194 is in that very state.

I would have thought that returning a result after the deadline would mark the task as "to late to validate". Also the validator running while there are tasks "in progress" doesn't make much sense. I imagine there is no logic to check for a condition that "shouldn't exist". However it does. So hopefully the BOINC server dev guys have one or both defects in their queue for "will be fixed".
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1229097 · Report as offensive
Profile Link
Avatar

Send message
Joined: 18 Sep 03
Posts: 834
Credit: 1,807,369
RAC: 0
Germany
Message 1229099 - Posted: 8 May 2012, 13:36:49 UTC - in response to Message 1229097.  

I would have thought that returning a result after the deadline would mark the task as "to late to validate". Also the validator running while there are tasks "in progress" doesn't make much sense. I imagine there is no logic to check for a condition that "shouldn't exist". However it does. So hopefully the BOINC server dev guys have one or both defects in their queue for "will be fixed".

Well, it works for MB, if a result is returned after the deadline, the validator waits for the resend and validates all 3 (or more) at once, hence we don't have such issues with MB.

Optimal would be to pre-validate the both results and if they are OK cancel the resend on the next sheduler request of the host, which has it. BOINC can do that. Otherwise, if it gets reported on the next request, than revalidate all 3 together.
ID: 1229099 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1229103 - Posted: 8 May 2012, 13:44:12 UTC - in response to Message 1229099.  

I would have thought that returning a result after the deadline would mark the task as "to late to validate". Also the validator running while there are tasks "in progress" doesn't make much sense. I imagine there is no logic to check for a condition that "shouldn't exist". However it does. So hopefully the BOINC server dev guys have one or both defects in their queue for "will be fixed".

Well, it works for MB, if a result is returned after the deadline, the validator waits for the resend and validates all 3 (or more) at once, hence we don't have such issues with MB.

Optimal would be to pre-validate the both results and if they are OK cancel the resend on the next sheduler request of the host, which has it. BOINC can do that. Otherwise, if it gets reported on the next request, than revalidate all 3 together.

It seems something would be better than the logic black hole these seem to fall into.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1229103 · Report as offensive
MikeN

Send message
Joined: 24 Jan 11
Posts: 319
Credit: 64,719,409
RAC: 85
United Kingdom
Message 1229107 - Posted: 8 May 2012, 13:53:53 UTC - in response to Message 1229099.  

I would have thought that returning a result after the deadline would mark the task as "to late to validate". Also the validator running while there are tasks "in progress" doesn't make much sense. I imagine there is no logic to check for a condition that "shouldn't exist". However it does. So hopefully the BOINC server dev guys have one or both defects in their queue for "will be fixed".

Well, it works for MB, if a result is returned after the deadline, the validator waits for the resend and validates all 3 (or more) at once, hence we don't have such issues with MB.

Optimal would be to pre-validate the both results and if they are OK cancel the resend on the next sheduler request of the host, which has it. BOINC can do that. Otherwise, if it gets reported on the next request, than revalidate all 3 together.


I dont like the idea of BOINC cancelling the resend especially with APs. On my slower PC, an AP takes 20 hours to process. If I had been processing it for 19.5 hours when the WU was cancelled I would be most displeased about the waste of my bandwidth, electricity and PC resources, to put it mildly!
ID: 1229107 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19012
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1229112 - Posted: 8 May 2012, 14:05:58 UTC - in response to Message 1229107.  

I would have thought that returning a result after the deadline would mark the task as "to late to validate". Also the validator running while there are tasks "in progress" doesn't make much sense. I imagine there is no logic to check for a condition that "shouldn't exist". However it does. So hopefully the BOINC server dev guys have one or both defects in their queue for "will be fixed".

Well, it works for MB, if a result is returned after the deadline, the validator waits for the resend and validates all 3 (or more) at once, hence we don't have such issues with MB.

Optimal would be to pre-validate the both results and if they are OK cancel the resend on the next sheduler request of the host, which has it. BOINC can do that. Otherwise, if it gets reported on the next request, than revalidate all 3 together.


I dont like the idea of BOINC cancelling the resend especially with APs. On my slower PC, an AP takes 20 hours to process. If I had been processing it for 19.5 hours when the WU was cancelled I would be most displeased about the waste of my bandwidth, electricity and PC resources, to put it mildly!

Cancelling resends is an available option within the BOINC server code, but it only cancels tasks NOT started, so you would be safe and get your credit if the task is started.
ID: 1229112 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1229126 - Posted: 8 May 2012, 14:25:23 UTC - in response to Message 1229107.  

I would have thought that returning a result after the deadline would mark the task as "to late to validate". Also the validator running while there are tasks "in progress" doesn't make much sense. I imagine there is no logic to check for a condition that "shouldn't exist". However it does. So hopefully the BOINC server dev guys have one or both defects in their queue for "will be fixed".

Well, it works for MB, if a result is returned after the deadline, the validator waits for the resend and validates all 3 (or more) at once, hence we don't have such issues with MB.

Optimal would be to pre-validate the both results and if they are OK cancel the resend on the next sheduler request of the host, which has it. BOINC can do that. Otherwise, if it gets reported on the next request, than revalidate all 3 together.

I dont like the idea of BOINC cancelling the resend especially with APs. On my slower PC, an AP takes 20 hours to process. If I had been processing it for 19.5 hours when the WU was cancelled I would be most displeased about the waste of my bandwidth, electricity and PC resources, to put it mildly!

BOINC has a couple of perfectly good bits of logic already, which normally cope with this without risk to the user.

Say two tasks are created and sent out - the normal WU here at SETI. Say one is crunched and returned on time, but the other is late and not returned by deadline.

As soon as the second task passes deadline, a replacement (third) task is created and put on the back of the queue for distribution - but it probably sits there for several hours. If the delayed task comes back in at this point, and validates, the third task can be cancelled without risk to anyone, before it has even been sent out.

A little later, and the third task has been allocated and downloaded, but often enough it will remain unstarted in the new wingmate's cache for several hours or days. Even at this stage, BOINC can and will cancel the task if the belated original copy is returned and validated. BOINC will only do that if the new wingmate contacts the servers, and the servers therefore know that work hasn't started crunching - no CPU time is wasted, only a bit of download bandwidth.

I think all the 'hostage' cases we're considering must be the final case: the original late copy is returned after the replacement has been created, allocated, downloaded, and crunching has already started. If that happens, BOINC lets work on the replacement continue until it has finished, which as we all know can be a long time for AP tasks.

I really don't know why things go wrong for these few, but not insignificant number of, AP cases. It might be that BOINC, in general, doesn't cope well with the long delays we're seeing here: or it might be the SETI's AP validator (specifically) puts the wrong marker on the files involved when the original two results are - belatedly - validated.
ID: 1229126 · Report as offensive
Les

Send message
Joined: 20 May 99
Posts: 53
Credit: 21,062,237
RAC: 18
United States
Message 1229131 - Posted: 8 May 2012, 14:50:34 UTC
Last modified: 8 May 2012, 15:06:47 UTC

For the record or as a reminder - this has actually been going on for many years. For a long time the programming assigned zero credit and invalidated work units that fell into this situation as opposed to yesterday's invalidate the work done but award credit. It probably will be difficult to impossible to reconstruct the ignored results from over the years should subsequent analysis confirm some discovery such as the afore mentioned new pulsar or other astronomical phenomenon.

Over three years ago I had inquired about this problem and I am sure that others had done so as well and probably long before I encountered or at least noticed it. While the work units were being held hostage there was at least some hope of the hostages being freed and validated so this was not as much of an issue prior to yesterday.

http://setiathome.berkeley.edu/forum_thread.php?id=54250
or
http://setiathome.berkeley.edu/forum_thread.php?id=52765
ID: 1229131 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1229141 - Posted: 8 May 2012, 15:12:23 UTC

I dont get it...

I dont see how is it possible to miss valuable data due to this situation...
If there are 2 matching results that were validated and there is something different in the 3rd result then this last result has to be the wrong one.

If some data is going to be lost its not by this issue about the hostages. It will be because the differences between apps for different hardware where it can happen that the matching results are not the best ones. But that will happen anyway, hostages or not...

Is there something else that Im not seeing/knowing?
ID: 1229141 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1229149 - Posted: 8 May 2012, 15:34:11 UTC - in response to Message 1229141.  

I dont get it...

I dont see how is it possible to miss valuable data due to this situation...
If there are 2 matching results that were validated and there is something different in the 3rd result then this last result has to be the wrong one.

If some data is going to be lost its not by this issue about the hostages. It will be because the differences between apps for different hardware where it can happen that the matching results are not the best ones. But that will happen anyway, hostages or not...

Is there something else that Im not seeing/knowing?

You're right: nothing is lost, scientifically. Tasks have validated, and from the validated tasks, a canonical result for the WU as a whole has been chosen. That's as good as it gets.

The rest of the concerns are for users who have lost something, or fear they might be at risk of losing something, or are worried they might lose something in the future.

Things like:
* reputation (tasks marked 'invalid')
* credits
* electricity (for a part-crunched task)
* scientific kudos (a name-check in a discovery paper)
ID: 1229149 · Report as offensive
Profile SciManStev Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Jun 99
Posts: 6651
Credit: 121,090,076
RAC: 0
United States
Message 1229170 - Posted: 8 May 2012, 19:34:38 UTC

I just found 3 5.05 units that should have cleared last year, but finally came up invalid.
http://setiathome.berkeley.edu/results.php?hostid=5483835&offset=0&show_names=0&state=4&appid=

Steve
Warning, addicted to SETI crunching!
Crunching as a member of GPU Users Group.
GPUUG Website
ID: 1229170 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Completed Too Late To Validate - the Hostage AP 5.05 Work Units


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.