Suddenly BOINC Decides to Abandon 71 APs...WTH?

Author	Message
jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696547 - Posted: 28 Jun 2015, 19:57:08 UTC - in response to Message 1696540. Last modified: 28 Jun 2015, 20:04:40 UTC The concept behind the coding of BOINC is that it should be fault-tolerant, but cheating-intolerant. The problem here is that faults are being sent down the cheaters' pathway, which is far from ideal for anyone. The question is, what needs to change to route them down a fault-tolerant pathway? My inclination would be for the scheduler to simply take no action at all on an out-of-sequence request, other than perhaps to send a response back to the requesting host that such a request was received. It would neither accept any reported completed tasks nor send out any new tasks when the request is out of sequence, and it certainly wouldn't abort everything in progress without alerting the host to that action. I tend to agree that inaction regarding it as an authentication failure (we are still in the request authentication code) might be the better option. At the same time there is the still the option of correcting the host, or perhaps a simple 'say what?' would do. I do think the timescales noted are worryingly beyond reasonable duration. The transactions should be atomic (all or nothing) and there should be some way to keep track of time, and have no effect upon authentication failure for any reason. Having any effect during/after an authentication failure could be considered a security vulnerability. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696547 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1696549 - Posted: 28 Jun 2015, 19:59:52 UTC - in response to Message 1696543. Last modified: 28 Jun 2015, 20:02:11 UTC Ah, good morning, Jason! Happy to provide some grist for your mill. ;^) I think any of my tests last night where I was resetting the rpc_seqno to a lower number also always followed at least one manual update with the higher number. However, I don't recall doing any where I wasn't also tinkering with the hostid field, since that was the primary focus. Could it have been that the scheduler was dealing with the missing hostid first, and then ignoring the rpc_seqno once it finished correcting the hostid? ID: 1696549 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696553 - Posted: 28 Jun 2015, 20:10:48 UTC - in response to Message 1696549. Yes, there are the multiple conditions required there, which is when I got the image of space shuttle O-rings being connected to the same piece of metal. The exact sequence can go to one of two places the abandonments occur, and either one or both of them could need attention. My current feeling is that the server shouldn't be doing anything to host records, or associated tasks, until authentication is completed successfully. You won't get your front door open if you insert a half sucked lozenge into the lock before the key. This feels like a claymore connected to a lozenge detector. [Edit:} OMG this code has goto statements in it; how quaint! "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696553 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1696555 - Posted: 28 Jun 2015, 20:17:50 UTC - in response to Message 1696553. Yes, there are the multiple conditions required there, which is when I got the image of space shuttle O-rings being connected to the same piece of metal. The exact sequence can go to one of two places the abandonments occur, and either one or both of them could need attention. My current feeling is that the server shouldn't be doing anything to host records, or associated tasks, until authentication is completed successfully. You won't get your front door open if you insert a half sucked lozenge into the lock before the key. This feels like a claymore connected to a lozenge detector. [Edit:} OMG this code has goto statements in it; how quaint! Nothing like spaghetti code to really make things interesting. Sort of like Alice sliding down the rabbit hole! ID: 1696555 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1696558 - Posted: 28 Jun 2015, 20:22:13 UTC I went to check on Dave and Beta says it's down for Maintenance. Maintenance on a Sunday? ID: 1696558 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696559 - Posted: 28 Jun 2015, 20:28:02 UTC - in response to Message 1696558. Last modified: 28 Jun 2015, 20:37:48 UTC I think Raistmer's got some new apps out, and Eric's time juggling again. Just a theory. [Edit:] there is the email I sent, enquiring about RAID status etc. Could be some sortof integrity checks. Will probably have to update with the new developments here. [Edit2:] updated Eric: Update: apparently Jeff Buck has been able to trigger the abandonment with some persistence. It'll probably take a while to verify the mechanism. looks like either the detach/reattach or host migration sensing has some itchy trigger finger, fiddling with hosts/tasks before authentication is complete. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696559 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874	Message 1696563 - Posted: 28 Jun 2015, 20:41:13 UTC - in response to Message 1696559. I think Raistmer's got some new apps out, and Eric's time juggling again. Just a theory. [Edit:] there is the email I sent, enquiring about RAID status etc. Could be some sortof integrity checks. Will probably have to update with the new developments here. [Edit2:] updated Eric: Update: apparently Jeff Buck has been able to trigger the abandonment with some persistence. It'll probably take a while to verify the mechanism. looks like either the detach/reattach or host migration sensing has some itchy trigger finger, fiddling with hosts/tasks before authentication is complete. Beta is back, with no new applications. It was Claggy that offered one this morning - Raistmer's went in on Friday. ID: 1696563 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1696566 - Posted: 28 Jun 2015, 20:56:53 UTC It appears our Test Host is no longer receiving work. After suffering another Abandonment at 10:05:01 UTC his last task was received at 28 Jun 2015, 12:00:39 UTC. ID: 1696566 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1696593 - Posted: 28 Jun 2015, 23:29:30 UTC Well, I received a resend on one of my hosts and it had been Abandoned. I checked the host and found him again, on Main this time. Same same, still getting those Abandoned whatevers; 28 Jun 2015, 15:07:31 UTC - Abandoned Amazing. Can someone check the Server Log on this Host, http://setiathome.berkeley.edu/show_host_detail.php?hostid=7206136 Back in business! ID: 1696593 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1696596 - Posted: 28 Jun 2015, 23:45:15 UTC - in response to Message 1696593. Well, I received a resend on one of my hosts and it had been Abandoned. I checked the host and found him again, on Main this time. Same same, still getting those Abandoned whatevers; 28 Jun 2015, 15:07:31 UTC - Abandoned Amazing. Can someone check the Server Log on this Host, http://setiathome.berkeley.edu/show_host_detail.php?hostid=7206136 Back in business! That's interesting. I have a database with almost all my tasks for the last 2+ years. I just checked it and found I've been paired with host 7206136 14 times in that span. Four of them were abandoned and one was a timeout, all of them in 2014: April 18, July 11 (the timeout), September 16, September 18, and October 5, so it appears to be a long-standing issue. No problems in the 7 WUs since then, although the last one was way back in February of this year. I checked his other host, too, 1850030, but only shared 3 WUs, all last year, and all were fine. ID: 1696596 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696634 - Posted: 29 Jun 2015, 2:57:56 UTC Last modified: 29 Jun 2015, 2:59:51 UTC An interesting feature or side effect in the interior logic to factor in, both places where the bulk abandonment can occur, is that if you set the cc_config option: <allow_multiple_clients>1</allow_multiple_clients> even if only using a single client and data directory, the bulk abandonment portions are disabled, though be aware if both host lookups fail then a new hostid will be generated. I suppose that is intended behaviour so as to make a multiple client host have several hostids, and that subsequently there'd be seperate cpids and sequence numbers to match hosts for each instance. Not sure how that would factor into potential refinements, other than it's a valid form of operation. Looking at the way the code is structured, I find it probable that mashing the two modes of operation in together, as opposed to a completely separate code block, may well be the source of some logic holes. To some extent that's a stylistic choice, but also added spaghettification to the common case. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696634 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1696774 - Posted: 29 Jun 2015, 16:36:00 UTC Last modified: 29 Jun 2015, 16:36:41 UTC Right now it appears the Host on Beta has stopped working, http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71714 My guess is the one on main is working through All those Abandoned/Worthless tasks, there were many Abandoned yesterday. It looks as though he was issued a few more tasks, which were quickly Abandoned once again, http://setiathome.berkeley.edu/results.php?hostid=7206136 So what's the latest theory? Someone Please tell me SETI actually checks the Time Stamps on Requests before labeling them as out of sequence, Abandoning All your tasks, Not removing them from your Host, Not informing you of their actions, and leaving your Host to waste Time and Energy working Worthless tasks. ID: 1696774 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1696785 - Posted: 29 Jun 2015, 17:33:30 UTC - in response to Message 1696774. Someone Please tell me SETI actually checks the Time Stamps on Requests before labeling them as out of sequence, Abandoning All your tasks, Not removing them from your Host, Not informing you of their actions, and leaving your Host to waste Time and Energy working Worthless tasks. Why sure, TBar, and would you also like someone to tell you that the Easter Bunny and the Tooth Fairy are real? I could probably do that, all for the same price. And for just a nominal extra charge, I could throw in Santa Claus. ;^) Seriously, though, I rather doubt that time stamps from local hosts would be a very reliable method of verification, even if the scheduler was storing them in the database. The time stamps would be outside of BOINC's control and subject to all sorts of adjustments that could occur for a variety of reasons on the local hosts. In addition to minor automatic syncs by the OS (or manual ones by the user), I can think of Daylight Savings Time changes (varying by locale and user option), traveling laptops whose owners like to have them on local time, dead or dying CMOS batteries, and probably many other manual manipulations for reasons that have nothing at all to do with trying to game BOINC. No, I think that anything BOINC would use for sequence checking would have to be something that was pretty much completely in BOINC's control. But, again, I don't think the issue is as much detecting the out-of-sequence condition as it is how the scheduler deals with it. ID: 1696785 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1696791 - Posted: 29 Jun 2015, 18:04:34 UTC - in response to Message 1696785. Last modified: 29 Jun 2015, 18:08:55 UTC You appear to be suggesting None of the Request arrived at the server in the 6:49 between the first Request and the second Request. I find that hard to believe considering the previous requests were taking a second or two. Even if SETI doesn't Trust the same Time Stamps other institutions such as Banks and Finance Trust surely they can Trust when the First few packets arrived. Personally I find the suggestion that SETI can't Trust the same Time Stamps everyone else uses rather evasive and questionable. In any event, something should be changed in the ways SETI responds to something as simple as a delayed packet, if that is the case. ID: 1696791 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1696799 - Posted: 29 Jun 2015, 18:28:43 UTC - in response to Message 1696791. You appear to be suggesting None of the Request arrived at the server in the 6:49 between the first Request and the second Request. I find that hard to believe considering the previous requests were taking a second or two. Even if SETI doesn't Trust the same Time Stamps other institutions such as Banks and Finance Trust surely they can Trust when the First few packets arrived. Personally I find the suggestion that SETI can't Trust the same Time Stamps everyone else uses rather evasive and questionable. In any event, something should be changed in the ways SETI responds to something as simple as a delayed packet, if that is the case. I'm only talking about a possible time stamp in the body of the actual request message that the scheduler receives, not header time stamps in individual packets (if, in fact, the scheduler request actually does get broken into separate packets). I'm hardly a communications expert, but I seriously doubt that the scheduler sees anything except the whole message body, once it's been completely received and, if necessary, reassembled from individual packets. It certainly can't do any processing on the message until it's got the whole thing. In any event, using your example that began this thread, I really can't see where it would have made a difference whether it was using a time stamp or a request sequence number. Either way, the scheduler didn't receive the first request until after the second one had been successfully processed. Therefore, an out-of-sequence condition would have been raised no matter what. However, the assumption the scheduler apparently made, in deciding that there was some sort of detach/reattach scenario that required trashing all the tasks in progress, is what certainly seems to me to need fixing. ID: 1696799 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1696805 - Posted: 29 Jun 2015, 19:30:01 UTC There it goes again, 9 more tasks Abandoned, 29 Jun 2015, 19:05:52 UTC - Abandoned ID: 1696805 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696816 - Posted: 29 Jun 2015, 20:50:07 UTC - in response to Message 1696774. I've not come across any timestamp checking/comparison in the authentication code up to this point of bulk detachments, only sequence number, hostids, and some other conditions like that multiple clients flag I mentioned. I'm not sure a timestamp from the client would be more useful than the sequence number, but I imagine if there is a server receipt timestamp it might be. It could facilitate a timeout to drop the request before another server request can be made, which would probably be safer than spontaneous detach/reattach and abandonment after a 9 minute delay "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696816 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696817 - Posted: 29 Jun 2015, 21:17:02 UTC - in response to Message 1696816. Last modified: 29 Jun 2015, 21:39:48 UTC One possible fix comes to mind. On lower sequence number, Instead of immediately enact detach/reattach and abandon, it could set a flag for the host 'abandon_if_nextrequest_RPCseqno_follows_this_one' and do nothing apart from store the old (greater) sequence number, and the new lower one as current. Then on subsequent contact, if the sequence number follows the current one, but not the earlier-greater one, the detach/reattach should be genuine, or a least have a much higher probability of being genuine. Do the detach if so, and in either case finally clear the stored flags before continuing with the full request. Storing that extra little bit of data may not need to happen in the main hosts table, but a small lookup table or file called suspect_contacts_for_possible_reattach. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696817 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1696820 - Posted: 29 Jun 2015, 21:20:29 UTC - in response to Message 1696816. I've not come across any timestamp checking/comparison in the authentication code up to this point of bulk detachments, only sequence number, hostids, and some other conditions like that multiple clients flag I mentioned. I'm not sure a timestamp from the client would be more useful than the sequence number, but I imagine if there is a server receipt timestamp it might be. It could facilitate a timeout to drop the request before another server request can be made, which would probably be safer than spontaneous detach/reattach and abandonment after a 9 minute delay Yes, that would prevent a slow request from being completed late and out of sequence. Sounds like a logical procedure, already working on the client... ID: 1696820 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1696833 - Posted: 29 Jun 2015, 21:48:41 UTC - in response to Message 1696817. One possible fix comes to mind. On lower sequence number, Instead of immediately enact detach/reattach and abandon, it could set a flag for the host 'abandon_if_nextrequest_RPCseqno_follows_this_one' and do nothing apart from store the old (greater) sequence number, and the new lower one (+1) as current. Then on subsequent contact, if the sequence number follows the current one, but not the earliergreater one, the detach/reattach should be genuine, or a least have a much higher probability of being genuine. Do the detach if so, and in either case finally clear the stored flags before continuing with the full request. Storing that extra little bit of data may not need to happen in the main hosts table, but a small lookup table or file called suspect_contacts_for_possible_reattach. I'm curious as to what the worst consequence would be if an out-of-sequence request resulted in no scheduler action at all, other than perhaps a message to the requesting host to that effect, which would be posted in the event log. To me it seems as if the situation was not an actual detach/reattach, but simply an in-transit delay like we've been discussing, that there would be no consequences at all, since the later request (which was successfully processed earlier) would have already taken care of any completed task reporting and new task retrieval. Subsequent requests would just continue as normal after that one. For a legitimate detach/reattach, wouldn't a non-action simply leave those "in progress" tasks on the server until they time out? The only thing the forced abandonment seems to accomplish is that those tasks get resent to new hosts more quickly. ID: 1696833 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.