Message boards :
Number crunching :
Suddenly BOINC Decides to Abandon 71 APs...WTH?
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 15 · Next
Author | Message |
---|---|
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Yeah, well fingers crossed Eric has some insight into that linked beta host. I have indicated that we poked at the mechanism from client side, and it looks to be a more internal issue. The code not distinguishing between an error and a host not found is a red flag to wave around. It changes the meaning of the logic quite a bit. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
And here's another "what if" that occurred to me shortly after my head hit the pillow. (Maddening when that happens!) I don't think that's likely to be any trouble. When we had the old, slow, data link to the lab, we used to have regular discussions about message log (as it was then) entries like 1/15/2008 8:06:02 AM|SETI@home|Message from server: Completed result 13ja07ab.5741.11526.16.6.230_0 refused: result already reported as success The words 'result refused by server???' tended to scare people, so David did his usual thing with a bad server message: he removed it completely, so you never know the result was reported twice. But I'm sure it still happens, and the server isn't fazed by it. |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
And here's another "what if" that occurred to me shortly after my head hit the pillow. (Maddening when that happens!) Also, at one point, a double reporting of a task would initiate a resend lost tasks event, even if resend lost tasks was disabled, not sure if that still works. (I'm only running Jason's Boinc 6.10.58 client on one host, and that's shutdown atm) Claggy |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Hmmm, reminds me to have a look and see if I have any ghosts floating around. I just had a PM from someone that didn't know what they were, so it'd be interesting if ye olde resend trigger still works. [Edit:] hmm, no ghosts here apparently... "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
And here's another "what if" that occurred to me shortly after my head hit the pillow. (Maddening when that happens!) Okay, well, it was just one of those twilight zone thoughts that hit me. ;^) Thanks for the explanation, Richard. One less quirky situation to consider, then. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Okaaaaaaaaay! In the spirit of "keep hitting it with the hammer until it breaks", here's what I did to actually get it to abandon the tasks in progress. I copied my whole BOINC folder to a temp folder. I then let BOINC continue to run through a couple of manual update cycles (no tasks reported, none actually requested). That bumped the rpc_seqno up by a couple and probably changed a couple of other fields in client_state, as well. I then restored the original BOINC folder from the temp copy and restarted BOINC. After a minute or so, I did another manual update. BINGO!! All four in progress tasks abandoned. (They are still on my host, with 3 of them currently running and one ready to start.) So, it looks to me like the out-of-sequence condition truly is what triggers the abandonment, but perhaps there's something other than the rpc_seqno which has to be out of sync to trigger the scheduler's reaction. Oh, and here's the link to my host again, so you don't have to go digging for it: 6949656 |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
And here are some client_state snapshots for comparison. I've just included the bulk of the <project> sections. If anything else might be relevant, let me know. 1) Following last normal update request to scheduler: <project> <master_url>http://setiathome.berkeley.edu/</master_url> <project_name>SETI@home</project_name> <symstore></symstore> <user_name>Jeff Buck</user_name> <team_name></team_name> <host_venue></host_venue> ... <cpid_time>950313059.000000</cpid_time> <user_total_credit>43486089.613761</user_total_credit> <user_expavg_credit>21439.790464</user_expavg_credit> <user_create_time>950313059.000000</user_create_time> <rpc_seqno>4</rpc_seqno> <userid>829043</userid> <teamid>0</teamid> <hostid>6949656</hostid> <host_total_credit>977939.805310</host_total_credit> <host_expavg_credit>3.120990</host_expavg_credit> <host_create_time>1364095238.000000</host_create_time> <nrpc_failures>0</nrpc_failures> <master_fetch_failures>0</master_fetch_failures> <min_rpc_time>1435511501.484375</min_rpc_time> <next_rpc_time>0.000000</next_rpc_time> <rec>269.355469</rec> <rec_time>1435511600.078125</rec_time> <resource_share>100.000000</resource_share> <desired_disk_usage>0.000000</desired_disk_usage> <duration_correction_factor>1.000000</duration_correction_factor> <sched_rpc_pending>0</sched_rpc_pending> <send_time_stats_log>0</send_time_stats_log> <send_job_log>0</send_job_log> <dont_use_dcf/> <dont_request_more_work/> <rsc_backoff_time> <name>CPU</name> <value>0.000000</value> </rsc_backoff_time> <rsc_backoff_interval> <name>CPU</name> <value>0.000000</value> </rsc_backoff_interval> <rsc_backoff_time> <name>NVIDIA</name> <value>0.000000</value> </rsc_backoff_time> <rsc_backoff_interval> <name>NVIDIA</name> <value>0.000000</value> </rsc_backoff_interval> 2) From backup copy which triggered the abandonment: <project> <master_url>http://setiathome.berkeley.edu/</master_url> <project_name>SETI@home</project_name> <symstore></symstore> <user_name>Jeff Buck</user_name> <team_name></team_name> <host_venue></host_venue> ... <cpid_time>950313059.000000</cpid_time> <user_total_credit>43464542.042580</user_total_credit> <user_expavg_credit>20386.186532</user_expavg_credit> <user_create_time>950313059.000000</user_create_time> <rpc_seqno>2</rpc_seqno> <userid>829043</userid> <teamid>0</teamid> <hostid>6949656</hostid> <host_total_credit>977822.869161</host_total_credit> <host_expavg_credit>0.000024</host_expavg_credit> <host_create_time>1364095238.000000</host_create_time> <nrpc_failures>0</nrpc_failures> <master_fetch_failures>0</master_fetch_failures> <min_rpc_time>0.000000</min_rpc_time> <next_rpc_time>0.000000</next_rpc_time> <rec>247.832912</rec> <rec_time>1435511902.343750</rec_time> <resource_share>100.000000</resource_share> <desired_disk_usage>0.000000</desired_disk_usage> <duration_correction_factor>1.000000</duration_correction_factor> <sched_rpc_pending>0</sched_rpc_pending> <send_time_stats_log>0</send_time_stats_log> <send_job_log>0</send_job_log> <dont_use_dcf/> <dont_request_more_work/> <rsc_backoff_time> <name>CPU</name> <value>0.000000</value> </rsc_backoff_time> <rsc_backoff_interval> <name>CPU</name> <value>0.000000</value> </rsc_backoff_interval> <rsc_backoff_time> <name>NVIDIA</name> <value>0.000000</value> </rsc_backoff_time> <rsc_backoff_interval> <name>NVIDIA</name> <value>0.000000</value> </rsc_backoff_interval> 3) Finally, what it looks like now, following the abandonment: <project> <master_url>http://setiathome.berkeley.edu/</master_url> <project_name>SETI@home</project_name> <symstore></symstore> <user_name>Jeff Buck</user_name> <team_name></team_name> <host_venue></host_venue> ... <cpid_time>950313059.000000</cpid_time> <user_total_credit>43486127.734700</user_total_credit> <user_expavg_credit>21407.806651</user_expavg_credit> <user_create_time>950313059.000000</user_create_time> <rpc_seqno>0</rpc_seqno> <userid>829043</userid> <teamid>0</teamid> <hostid>6949656</hostid> <host_total_credit>977939.805310</host_total_credit> <host_expavg_credit>3.120990</host_expavg_credit> <host_create_time>1364095238.000000</host_create_time> <nrpc_failures>0</nrpc_failures> <master_fetch_failures>0</master_fetch_failures> <min_rpc_time>1435512272.625000</min_rpc_time> <next_rpc_time>0.000000</next_rpc_time> <rec>249.186601</rec> <rec_time>1435511962.531250</rec_time> <resource_share>100.000000</resource_share> <desired_disk_usage>0.000000</desired_disk_usage> <duration_correction_factor>1.000000</duration_correction_factor> <sched_rpc_pending>0</sched_rpc_pending> <send_time_stats_log>0</send_time_stats_log> <send_job_log>0</send_job_log> <dont_use_dcf/> <dont_request_more_work/> <rsc_backoff_time> <name>CPU</name> <value>0.000000</value> </rsc_backoff_time> <rsc_backoff_interval> <name>CPU</name> <value>0.000000</value> </rsc_backoff_interval> <rsc_backoff_time> <name>NVIDIA</name> <value>0.000000</value> </rsc_backoff_time> <rsc_backoff_interval> <name>NVIDIA</name> <value>0.000000</value> </rsc_backoff_interval> |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Was the <host_cpid> present all three times, and did it change? (That tag is right up at the top, in the <host_info> section - 'cp' stands for 'cross project', so it's not associated with any one project section) |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Was the <host_cpid> present all three times, and did it change? It was present all 3 times, but it did change after the abandonment. 1st 2 times: <host_cpid>c9dd1af0e69e282b49b5818720bd74e3</host_cpid> 3rd one: <host_cpid>069035a1ccbf06d5edfd3e86541e8cb9</host_cpid> EDIT: BTW, the cross_project_id in the project section, which I guess I edited out along with the email hash in my earlier post, stayed the same: <cross_project_id>cdb7be9e83b9d839e62188ac5c224d3c</cross_project_id> |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
That's pretty much what has been suspected from the beginning. For some reason a request is being held until when it is finally acted upon it is out of sequence. I don't buy the 'bouncing around the internet' theory. More likely it is being held by the Server for some reason. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
That's pretty much what has been suspected from the beginning. For some reason a request is being held until when it is finally acted upon it is out of sequence. I don't buy the 'bouncing around the internet' theory. More likely it is being held by the Server for some reason. Yeah, see, I tend to think that an Internet glitch is more likely than a scheduler glitch. There are so many way stations between Points A and B, that one transient anomaly somewhere affecting just one packet in a scheduler request (which may or may not comprise multiple packets) is all it would take to create the out-of-sequence situation. Even with an extremely reliable ISP at both ends, a host could still get dinged once in a great while. On the other hand, someone with a notoriously bad ISP or local node (such as, perhaps, your Beta buddy) could get dinged quite often. If it was the scheduler, I'd think the pain would be spread more evenly across all users. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
That's pretty much what has been suspected from the beginning. For some reason a request is being held until when it is finally acted upon it is out of sequence. I don't buy the 'bouncing around the internet' theory. More likely it is being held by the Server for some reason. The next question is why packets that have time stamps could be deemed out of sequence even if they arrive late. Simply checking the time stamp would identify the sequence. The Server does check time stamps, doesn't it? |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
I'll bet that it doesn't......just my gut feeling. I'll bet that it takes them as they come in, and the assumption is made that no packet is likely to pass up an earlier one on the way to the server. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
The next question is why packets that have time stamps could be deemed out of sequence even if they arrive late. Simply checking the time stamp would identify the sequence. The Server does check time stamps, doesn't it? To do that, it would have to store the time stamp from the previous message in the DB in order to have something to compare it with. I kind of doubt that it would do that since it thinks that the rpc_seqno addresses the issue. In any event, I think that the underlying problem is not so much that the requests arrive out of sequence (regardless of the reason), it's that the scheduler applies such a drastic solution when it does happen. I would think that could be improved. |
Rasputin42 Send message Joined: 25 Jul 08 Posts: 412 Credit: 5,834,661 RAC: 0 |
Does the time stamp really help? Is it not the case, that the request needs to be linked to the response, so how does it know, which response belongs to which request? Maybe i am missing something here, so please tell. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
The next question is why packets that have time stamps could be deemed out of sequence even if they arrive late. Simply checking the time stamp would identify the sequence. The Server does check time stamps, doesn't it? That part I agree with 110%. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
The concept behind the coding of BOINC is that it should be fault-tolerant, but cheating-intolerant. The problem here is that faults are being sent down the cheaters' pathway, which is far from ideal for anyone. The question is, what needs to change to route them down a fault-tolerant pathway? |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
The concept behind the coding of BOINC is that it should be fault-tolerant, but cheating-intolerant. Checking Time Stamps would definitely solve any sequence confusion. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
The concept behind the coding of BOINC is that it should be fault-tolerant, but cheating-intolerant. My inclination would be for the scheduler to simply take no action at all on an out-of-sequence request, other than perhaps to send a response back to the requesting host that such a request was received. It would neither accept any reported completed tasks nor send out any new tasks when the request is out of sequence, and it certainly wouldn't abort everything in progress without alerting the host to that action. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Hah! that's what I get for getting some sleep (and need a bit more yet), someone goes and breaks the toys :D Superficially I feel the key was the successful rpc's in between. That would allow the server's image of the host to update (including sequence number), consider it a virtual representation of the client [that wasn't getting updated in prior attempts]. In any case, the process and details there should at the very least help identify which of the two known abandonment paths is being reached, at least when my eyeballs and brain are awake and cooperating. I don't see either codepath labelled in any way suggesting it's there for cheat prevention purposes, only explicitly detach/reattach and host migration purposes. Doesn't mean it couldn't have such an intended function, but frankly there'd be far easier ways to cheat than this IMO (not that I'll describe any ;) ) For fault tolerance, yeah I agree the server side is supposed to be fault tolerant and assume everything up to it might be broken or tampered with. From that perspective I suspect a better response might be a better check of the assumptions being made with known good info on hand. The new cpid would seem to be telling in that regard. needless database bloat and whacking of perfectly good tasks in progress are probably counterproductive to fault tolerance, at least when multiplied by some percentage of active hosts. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.