Suddenly BOINC Decides to Abandon 71 APs...WTH?

Author	Message
jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696368 - Posted: 28 Jun 2015, 6:55:10 UTC - in response to Message 1696359. Yeah, well fingers crossed Eric has some insight into that linked beta host. I have indicated that we poked at the mechanism from client side, and it looks to be a more internal issue. The code not distinguishing between an error and a host not found is a red flag to wave around. It changes the meaning of the logic quite a bit. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696368 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1696389 - Posted: 28 Jun 2015, 9:18:28 UTC - in response to Message 1696342. And here's another "what if" that occurred to me shortly after my head hit the pillow. (Maddening when that happens!) Going back to TBar's original post, his first scheduler request, the one that got delayed en route, tried to report a completed task. When that one timed out, the next request would have reported the same task, this time successfully. Therefore, by the time the delayed request was finally processed, the task it was trying to report would have already been marked as completed, making it look like his host was trying to double-report its tasks. Would that have made the scheduler think that the host was trying to pull something shady and trigger a defensive response? I dunno, just a thought. I'm going back to bed. I don't think that's likely to be any trouble. When we had the old, slow, data link to the lab, we used to have regular discussions about message log (as it was then) entries like 1/15/2008 8:06:02 AM\|SETI@home\|Message from server: Completed result 13ja07ab.5741.11526.16.6.230_0 refused: result already reported as success The words 'result refused by server???' tended to scare people, so David did his usual thing with a bad server message: he removed it completely, so you never know the result was reported twice. But I'm sure it still happens, and the server isn't fazed by it. ID: 1696389 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 1696392 - Posted: 28 Jun 2015, 9:30:53 UTC - in response to Message 1696389. And here's another "what if" that occurred to me shortly after my head hit the pillow. (Maddening when that happens!) Going back to TBar's original post, his first scheduler request, the one that got delayed en route, tried to report a completed task. When that one timed out, the next request would have reported the same task, this time successfully. Therefore, by the time the delayed request was finally processed, the task it was trying to report would have already been marked as completed, making it look like his host was trying to double-report its tasks. Would that have made the scheduler think that the host was trying to pull something shady and trigger a defensive response? I dunno, just a thought. I'm going back to bed. I don't think that's likely to be any trouble. When we had the old, slow, data link to the lab, we used to have regular discussions about message log (as it was then) entries like 1/15/2008 8:06:02 AM\|SETI@home\|Message from server: Completed result 13ja07ab.5741.11526.16.6.230_0 refused: result already reported as success The words 'result refused by server???' tended to scare people, so David did his usual thing with a bad server message: he removed it completely, so you never know the result was reported twice. But I'm sure it still happens, and the server isn't fazed by it. Also, at one point, a double reporting of a task would initiate a resend lost tasks event, even if resend lost tasks was disabled, not sure if that still works. (I'm only running Jason's Boinc 6.10.58 client on one host, and that's shutdown atm) Claggy ID: 1696392 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696393 - Posted: 28 Jun 2015, 9:37:36 UTC - in response to Message 1696392. Last modified: 28 Jun 2015, 9:39:37 UTC Hmmm, reminds me to have a look and see if I have any ghosts floating around. I just had a PM from someone that didn't know what they were, so it'd be interesting if ye olde resend trigger still works. [Edit:] hmm, no ghosts here apparently... "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696393 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1696473 - Posted: 28 Jun 2015, 15:33:37 UTC - in response to Message 1696389. And here's another "what if" that occurred to me shortly after my head hit the pillow. (Maddening when that happens!) Going back to TBar's original post, his first scheduler request, the one that got delayed en route, tried to report a completed task. When that one timed out, the next request would have reported the same task, this time successfully. Therefore, by the time the delayed request was finally processed, the task it was trying to report would have already been marked as completed, making it look like his host was trying to double-report its tasks. Would that have made the scheduler think that the host was trying to pull something shady and trigger a defensive response? I dunno, just a thought. I'm going back to bed. I don't think that's likely to be any trouble. When we had the old, slow, data link to the lab, we used to have regular discussions about message log (as it was then) entries like 1/15/2008 8:06:02 AM\|SETI@home\|Message from server: Completed result 13ja07ab.5741.11526.16.6.230_0 refused: result already reported as success The words 'result refused by server???' tended to scare people, so David did his usual thing with a bad server message: he removed it completely, so you never know the result was reported twice. But I'm sure it still happens, and the server isn't fazed by it. Okay, well, it was just one of those twilight zone thoughts that hit me. ;^) Thanks for the explanation, Richard. One less quirky situation to consider, then. ID: 1696473 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1696501 - Posted: 28 Jun 2015, 17:29:50 UTC Okaaaaaaaaay! In the spirit of "keep hitting it with the hammer until it breaks", here's what I did to actually get it to abandon the tasks in progress. I copied my whole BOINC folder to a temp folder. I then let BOINC continue to run through a couple of manual update cycles (no tasks reported, none actually requested). That bumped the rpc_seqno up by a couple and probably changed a couple of other fields in client_state, as well. I then restored the original BOINC folder from the temp copy and restarted BOINC. After a minute or so, I did another manual update. BINGO!! All four in progress tasks abandoned. (They are still on my host, with 3 of them currently running and one ready to start.) So, it looks to me like the out-of-sequence condition truly is what triggers the abandonment, but perhaps there's something other than the rpc_seqno which has to be out of sync to trigger the scheduler's reaction. Oh, and here's the link to my host again, so you don't have to go digging for it: 6949656 ID: 1696501 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1696509 - Posted: 28 Jun 2015, 17:48:46 UTC And here are some client_state snapshots for comparison. I've just included the bulk of the <project> sections. If anything else might be relevant, let me know. 1) Following last normal update request to scheduler: <project> <master_url>http://setiathome.berkeley.edu/</master_url> <project_name>SETI@home</project_name> <symstore></symstore> <user_name>Jeff Buck</user_name> <team_name></team_name> <host_venue></host_venue> ... <cpid_time>950313059.000000</cpid_time> <user_total_credit>43486089.613761</user_total_credit> <user_expavg_credit>21439.790464</user_expavg_credit> <user_create_time>950313059.000000</user_create_time> <rpc_seqno>4</rpc_seqno> <userid>829043</userid> <teamid>0</teamid> <hostid>6949656</hostid> <host_total_credit>977939.805310</host_total_credit> <host_expavg_credit>3.120990</host_expavg_credit> <host_create_time>1364095238.000000</host_create_time> <nrpc_failures>0</nrpc_failures> <master_fetch_failures>0</master_fetch_failures> <min_rpc_time>1435511501.484375</min_rpc_time> <next_rpc_time>0.000000</next_rpc_time> <rec>269.355469</rec> <rec_time>1435511600.078125</rec_time> <resource_share>100.000000</resource_share> <desired_disk_usage>0.000000</desired_disk_usage> <duration_correction_factor>1.000000</duration_correction_factor> <sched_rpc_pending>0</sched_rpc_pending> <send_time_stats_log>0</send_time_stats_log> <send_job_log>0</send_job_log> <dont_use_dcf/> <dont_request_more_work/> <rsc_backoff_time> <name>CPU</name> <value>0.000000</value> </rsc_backoff_time> <rsc_backoff_interval> <name>CPU</name> <value>0.000000</value> </rsc_backoff_interval> <rsc_backoff_time> <name>NVIDIA</name> <value>0.000000</value> </rsc_backoff_time> <rsc_backoff_interval> <name>NVIDIA</name> <value>0.000000</value> </rsc_backoff_interval> 2) From backup copy which triggered the abandonment: <project> <master_url>http://setiathome.berkeley.edu/</master_url> <project_name>SETI@home</project_name> <symstore></symstore> <user_name>Jeff Buck</user_name> <team_name></team_name> <host_venue></host_venue> ... <cpid_time>950313059.000000</cpid_time> <user_total_credit>43464542.042580</user_total_credit> <user_expavg_credit>20386.186532</user_expavg_credit> <user_create_time>950313059.000000</user_create_time> <rpc_seqno>2</rpc_seqno> <userid>829043</userid> <teamid>0</teamid> <hostid>6949656</hostid> <host_total_credit>977822.869161</host_total_credit> <host_expavg_credit>0.000024</host_expavg_credit> <host_create_time>1364095238.000000</host_create_time> <nrpc_failures>0</nrpc_failures> <master_fetch_failures>0</master_fetch_failures> <min_rpc_time>0.000000</min_rpc_time> <next_rpc_time>0.000000</next_rpc_time> <rec>247.832912</rec> <rec_time>1435511902.343750</rec_time> <resource_share>100.000000</resource_share> <desired_disk_usage>0.000000</desired_disk_usage> <duration_correction_factor>1.000000</duration_correction_factor> <sched_rpc_pending>0</sched_rpc_pending> <send_time_stats_log>0</send_time_stats_log> <send_job_log>0</send_job_log> <dont_use_dcf/> <dont_request_more_work/> <rsc_backoff_time> <name>CPU</name> <value>0.000000</value> </rsc_backoff_time> <rsc_backoff_interval> <name>CPU</name> <value>0.000000</value> </rsc_backoff_interval> <rsc_backoff_time> <name>NVIDIA</name> <value>0.000000</value> </rsc_backoff_time> <rsc_backoff_interval> <name>NVIDIA</name> <value>0.000000</value> </rsc_backoff_interval> 3) Finally, what it looks like now, following the abandonment: <project> <master_url>http://setiathome.berkeley.edu/</master_url> <project_name>SETI@home</project_name> <symstore></symstore> <user_name>Jeff Buck</user_name> <team_name></team_name> <host_venue></host_venue> ... <cpid_time>950313059.000000</cpid_time> <user_total_credit>43486127.734700</user_total_credit> <user_expavg_credit>21407.806651</user_expavg_credit> <user_create_time>950313059.000000</user_create_time> <rpc_seqno>0</rpc_seqno> <userid>829043</userid> <teamid>0</teamid> <hostid>6949656</hostid> <host_total_credit>977939.805310</host_total_credit> <host_expavg_credit>3.120990</host_expavg_credit> <host_create_time>1364095238.000000</host_create_time> <nrpc_failures>0</nrpc_failures> <master_fetch_failures>0</master_fetch_failures> <min_rpc_time>1435512272.625000</min_rpc_time> <next_rpc_time>0.000000</next_rpc_time> <rec>249.186601</rec> <rec_time>1435511962.531250</rec_time> <resource_share>100.000000</resource_share> <desired_disk_usage>0.000000</desired_disk_usage> <duration_correction_factor>1.000000</duration_correction_factor> <sched_rpc_pending>0</sched_rpc_pending> <send_time_stats_log>0</send_time_stats_log> <send_job_log>0</send_job_log> <dont_use_dcf/> <dont_request_more_work/> <rsc_backoff_time> <name>CPU</name> <value>0.000000</value> </rsc_backoff_time> <rsc_backoff_interval> <name>CPU</name> <value>0.000000</value> </rsc_backoff_interval> <rsc_backoff_time> <name>NVIDIA</name> <value>0.000000</value> </rsc_backoff_time> <rsc_backoff_interval> <name>NVIDIA</name> <value>0.000000</value> </rsc_backoff_interval> ID: 1696509 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1696512 - Posted: 28 Jun 2015, 18:01:45 UTC - in response to Message 1696509. Was the <host_cpid> present all three times, and did it change? (That tag is right up at the top, in the <host_info> section - 'cp' stands for 'cross project', so it's not associated with any one project section) ID: 1696512 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1696516 - Posted: 28 Jun 2015, 18:23:51 UTC - in response to Message 1696512. Last modified: 28 Jun 2015, 18:29:16 UTC Was the <host_cpid> present all three times, and did it change? (That tag is right up at the top, in the <host_info> section - 'cp' stands for 'cross project', so it's not associated with any one project section) It was present all 3 times, but it did change after the abandonment. 1st 2 times: <host_cpid>c9dd1af0e69e282b49b5818720bd74e3</host_cpid> 3rd one: <host_cpid>069035a1ccbf06d5edfd3e86541e8cb9</host_cpid> EDIT: BTW, the cross_project_id in the project section, which I guess I edited out along with the email hash in my earlier post, stayed the same: <cross_project_id>cdb7be9e83b9d839e62188ac5c224d3c</cross_project_id> ID: 1696516 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1696517 - Posted: 28 Jun 2015, 18:28:23 UTC That's pretty much what has been suspected from the beginning. For some reason a request is being held until when it is finally acted upon it is out of sequence. I don't buy the 'bouncing around the internet' theory. More likely it is being held by the Server for some reason. ID: 1696517 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1696520 - Posted: 28 Jun 2015, 18:38:47 UTC - in response to Message 1696517. That's pretty much what has been suspected from the beginning. For some reason a request is being held until when it is finally acted upon it is out of sequence. I don't buy the 'bouncing around the internet' theory. More likely it is being held by the Server for some reason. Yeah, see, I tend to think that an Internet glitch is more likely than a scheduler glitch. There are so many way stations between Points A and B, that one transient anomaly somewhere affecting just one packet in a scheduler request (which may or may not comprise multiple packets) is all it would take to create the out-of-sequence situation. Even with an extremely reliable ISP at both ends, a host could still get dinged once in a great while. On the other hand, someone with a notoriously bad ISP or local node (such as, perhaps, your Beta buddy) could get dinged quite often. If it was the scheduler, I'd think the pain would be spread more evenly across all users. ID: 1696520 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1696525 - Posted: 28 Jun 2015, 18:48:38 UTC - in response to Message 1696520. That's pretty much what has been suspected from the beginning. For some reason a request is being held until when it is finally acted upon it is out of sequence. I don't buy the 'bouncing around the internet' theory. More likely it is being held by the Server for some reason. Yeah, see, I tend to think that an Internet glitch is more likely than a scheduler glitch. There are so many way stations between Points A and B, that one transient anomaly somewhere affecting just one packet in a scheduler request (which may or may not comprise multiple packets) is all it would take to create the out-of-sequence situation. Even with an extremely reliable ISP at both ends, a host could still get dinged once in a great while. On the other hand, someone with a notoriously bad ISP or local node (such as, perhaps, your Beta buddy) could get dinged quite often. If it was the scheduler, I'd think the pain would be spread more evenly across all users. The next question is why packets that have time stamps could be deemed out of sequence even if they arrive late. Simply checking the time stamp would identify the sequence. The Server does check time stamps, doesn't it? ID: 1696525 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 1696529 - Posted: 28 Jun 2015, 18:52:14 UTC - in response to Message 1696525. The next question is why packets that have time stamps could be deemed out of sequence even if they arrive late. Simply checking the time stamp would identify the sequence. The Server does check time stamps, doesn't it? I'll bet that it doesn't......just my gut feeling. I'll bet that it takes them as they come in, and the assumption is made that no packet is likely to pass up an earlier one on the way to the server. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 1696529 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1696530 - Posted: 28 Jun 2015, 18:58:12 UTC - in response to Message 1696525. The next question is why packets that have time stamps could be deemed out of sequence even if they arrive late. Simply checking the time stamp would identify the sequence. The Server does check time stamps, doesn't it? To do that, it would have to store the time stamp from the previous message in the DB in order to have something to compare it with. I kind of doubt that it would do that since it thinks that the rpc_seqno addresses the issue. In any event, I think that the underlying problem is not so much that the requests arrive out of sequence (regardless of the reason), it's that the scheduler applies such a drastic solution when it does happen. I would think that could be improved. ID: 1696530 ·

Rasputin42 Volunteer tester Send message Joined: 25 Jul 08 Posts: 412 Credit: 5,834,661 RAC: 0	Message 1696532 - Posted: 28 Jun 2015, 19:02:07 UTC Does the time stamp really help? Is it not the case, that the request needs to be linked to the response, so how does it know, which response belongs to which request? Maybe i am missing something here, so please tell. ID: 1696532 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1696534 - Posted: 28 Jun 2015, 19:05:05 UTC - in response to Message 1696530. The next question is why packets that have time stamps could be deemed out of sequence even if they arrive late. Simply checking the time stamp would identify the sequence. The Server does check time stamps, doesn't it? To do that, it would have to store the time stamp from the previous message in the DB in order to have something to compare it with. I kind of doubt that it would do that since it thinks that the rpc_seqno addresses the issue. In any event, I think that the underlying problem is not so much that the requests arrive out of sequence (regardless of the reason), it's that the scheduler applies such a drastic solution when it does happen. I would think that could be improved. That part I agree with 110%. ID: 1696534 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1696536 - Posted: 28 Jun 2015, 19:11:00 UTC The concept behind the coding of BOINC is that it should be fault-tolerant, but cheating-intolerant. The problem here is that faults are being sent down the cheaters' pathway, which is far from ideal for anyone. The question is, what needs to change to route them down a fault-tolerant pathway? ID: 1696536 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1696538 - Posted: 28 Jun 2015, 19:22:31 UTC - in response to Message 1696536. The concept behind the coding of BOINC is that it should be fault-tolerant, but cheating-intolerant. The problem here is that faults are being sent down the cheaters' pathway, which is far from ideal for anyone. The question is, what needs to change to route them down a fault-tolerant pathway? Checking Time Stamps would definitely solve any sequence confusion. ID: 1696538 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1696540 - Posted: 28 Jun 2015, 19:27:10 UTC - in response to Message 1696536. The concept behind the coding of BOINC is that it should be fault-tolerant, but cheating-intolerant. The problem here is that faults are being sent down the cheaters' pathway, which is far from ideal for anyone. The question is, what needs to change to route them down a fault-tolerant pathway? My inclination would be for the scheduler to simply take no action at all on an out-of-sequence request, other than perhaps to send a response back to the requesting host that such a request was received. It would neither accept any reported completed tasks nor send out any new tasks when the request is out of sequence, and it certainly wouldn't abort everything in progress without alerting the host to that action. ID: 1696540 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696543 - Posted: 28 Jun 2015, 19:30:00 UTC Last modified: 28 Jun 2015, 19:36:05 UTC Hah! that's what I get for getting some sleep (and need a bit more yet), someone goes and breaks the toys :D Superficially I feel the key was the successful rpc's in between. That would allow the server's image of the host to update (including sequence number), consider it a virtual representation of the client [that wasn't getting updated in prior attempts]. In any case, the process and details there should at the very least help identify which of the two known abandonment paths is being reached, at least when my eyeballs and brain are awake and cooperating. I don't see either codepath labelled in any way suggesting it's there for cheat prevention purposes, only explicitly detach/reattach and host migration purposes. Doesn't mean it couldn't have such an intended function, but frankly there'd be far easier ways to cheat than this IMO (not that I'll describe any ;) ) For fault tolerance, yeah I agree the server side is supposed to be fault tolerant and assume everything up to it might be broken or tampered with. From that perspective I suspect a better response might be a better check of the assumptions being made with known good info on hand. The new cpid would seem to be telling in that regard. needless database bloat and whacking of perfectly good tasks in progress are probably counterproductive to fault tolerance, at least when multiplied by some percentage of active hosts. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696543 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.