Suddenly BOINC Decides to Abandon 71 APs...WTH?

Author	Message
jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696848 - Posted: 29 Jun 2015, 22:52:33 UTC - in response to Message 1696833. True. What I'm trying to picture is any situation that scaled across many occurrences would indicate a source of compounding bloat. To me, in that context, freeing up the tasks makes sense, but only really after the detach/reattach is certain. Doing nothing is of course simpler to code and maintain :D "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696848 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1696859 - Posted: 29 Jun 2015, 23:40:50 UTC Whatever is done, it would be nice if it were soon; 29 Jun 2015, 23:22:28 UTC - Abandoned ID: 1696859 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1696862 - Posted: 30 Jun 2015, 0:56:57 UTC - in response to Message 1696848. True. What I'm trying to picture is any situation that scaled across many occurrences would indicate a source of compounding bloat. I guess you'd have to have some way of finding out how often legitimate detach/reattach events get caught by this code trap. But if database bloat was a serious concern of the powers that be, I can think of several other ongoing problem areas that could probably cut into that load significantly, IF they were ever addressed, starting with all the runaway hosts that maintain a revolving stash of thousands of Invalid tasks. To me, in that context, freeing up the tasks makes sense, but only really after the detach/reattach is certain. Which it certainly doesn't seem like it's currently accomplishing. By the way, have you figured out why the first group of tests with both doctored hostid and rpc_seqno fields didn't trigger the abandonment, while my final test with the lower rpc_seqno but an untouched hostid field was successful? Was the hostid check executed first, and then the rpc_seqno check bypassed after the hostid was corrected? Doing nothing is of course simpler to code and maintain :D Now there's a worthy goal!! ID: 1696862 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696865 - Posted: 30 Jun 2015, 1:32:52 UTC - in response to Message 1696862. Last modified: 30 Jun 2015, 1:46:10 UTC Which it certainly doesn't seem like it's currently accomplishing. By the way, have you figured out why the first group of tests with both doctored hostid and rpc_seqno fields didn't trigger the abandonment, while my final test with the lower rpc_seqno but an untouched hostid field was successful? Was the hostid check executed first, and then the rpc_seqno check bypassed after the hostid was corrected? yeah hostid lookup is first. Personally I would have made user authentication first so as to reduce the exposure to DoS attacks, but that's a side issue for these purposes. In the first case [no abandonment]: - lookup by hostid (fails) -- lookup by rpc seqno in users hosts (fails, goto (!) lookup_user_and_make_new_host) lookup_user_and_make_new_host: - lookup user, match authenticators - if cpid is present, scan the the user's hosts and match it. ( succeeds, last ditch attempt) In the second case [tasks abandoned]: - lookup by hostid (succeeds) - lookup the user (succeeds) - Authenticate (succeeds) - rpc seqeunce number check (fails, goto (!) make_new_host ) make_new_host: - Final attempt to locate host by scanning back through user's hosts matching hostname, IP, processor and amount of RAM. (succeeds next do *) * if found (it was), use the existing record AND mark results as over (except if allow_multiple_clients is enabled) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696865 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1696869 - Posted: 30 Jun 2015, 1:45:05 UTC - in response to Message 1696865. Is there anyway something simple could work. Such as having the client send a cc: to the Server when it Times Out a Request? You know, when it logs a timeout on the host send a copy to the Server informing the Server it is canceling the request. ID: 1696869 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696873 - Posted: 30 Jun 2015, 1:49:16 UTC - in response to Message 1696869. Last modified: 30 Jun 2015, 1:53:07 UTC Not sure. The transactions need to be atomic, and a two way dialog going on over several requests might be out. but again, not sure on that. Have to keep existing client behaviour in mind too. Separate issues raised by my prior post: I'm thinking adding the cpid search into the second procedure somewhere would salvage the situation for some cases, but possibly it's left out intentionally to generate a new host if the IP etc changed. This happening if you are on DHCP could create new hostids spontaneously, possibly a bit too easily IMO "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696873 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1696876 - Posted: 30 Jun 2015, 2:00:38 UTC - in response to Message 1696865. Which it certainly doesn't seem like it's currently accomplishing. By the way, have you figured out why the first group of tests with both doctored hostid and rpc_seqno fields didn't trigger the abandonment, while my final test with the lower rpc_seqno but an untouched hostid field was successful? Was the hostid check executed first, and then the rpc_seqno check bypassed after the hostid was corrected? yeah hostid lookup is first. Personally I would have made user authentication first so as to reduce the exposure to DoS attacks, but that's a side issue for these purposes. In the first case [no abandonment]: - lookup by hostid (fails) -- lookup by rpc seqno in users hosts (fails, goto (!) lookup_user_and_make_new_host) lookup_user_and_make_new_host: - lookup user, match authenticators - if cpid is present, scan the the user's hosts and match it. ( succeeds, last ditch attempt) In the second case [tasks abandoned]: - lookup by hostid (succeeds) - lookup the user (succeeds) - Authenticate (succeeds) - rpc seqeunce number check (fails, goto (!) make_new_host ) make_new_host: - Final attempt to locate host by scanning back through user's hosts matching hostname, IP, processor and amount of RAM. (succeeds next do *) * if found (it was), use the existing record AND mark results as over (except if allow_multiple_clients is enabled) LOL! Yep, that explains it. It seems kind of mystifying in the second case for it to have to try to "locate" the host after the rpc sequence number check fails, when it's already succeeded in looking up the hostid and the user, and authenticating the request. Why in the world does it have to do all that additional scanning and then, only when it succeeds, trash the tasks in progress? Definite weirdness! I certainly hope you can convince "someone" to make some changes! ;^) ID: 1696876 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696877 - Posted: 30 Jun 2015, 2:10:07 UTC - in response to Message 1696876. Last modified: 30 Jun 2015, 2:11:37 UTC LOL! Yep, that explains it. It seems kind of mystifying in the second case for it to have to try to "locate" the host after the rpc sequence number check fails, when it's already succeeded in looking up the hostid and the user, and authenticating the request. Why in the world does it have to do all that additional scanning and then, only when it succeeds, trash the tasks in progress? Definite weirdness! I certainly hope you can convince "someone" to make some changes! ;^) Yeah definitely oddball logic. The best I can fathom of the intent from the comments and code, is that the idea is to punish you for moving the client state to another host. I'd have to think if that's the case, then the collateral damage for legitimately scrambled rpc sequence is too high. A Less destructive choice in my mind, is they could use cpid match, AND token match other elements, but leave out local IP as they can be dynamic, and especially voltatile under communication stresses (that may cause a scrambled rpc sequence number). "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696877 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1696879 - Posted: 30 Jun 2015, 2:20:53 UTC - in response to Message 1696869. Is there anyway something simple could work. Such as having the client send a cc: to the Server when it Times Out a Request? You know, when it logs a timeout on the host send a copy to the Server informing the Server it is canceling the request. In a sense, simply sending the next request should serve as that kind of notification, IF the higher rpc_seqno would cause the scheduler to ignore any request that it receives later but with the lower sequence number. Then, again, who's to say that the second request (or some other notification like you suggest) will always get to the scheduler before the first request. Even with the timeout, the first one could still conceivably get there first. The bottleneck might not cause a 9+ minute delay but maybe just a long enough delay that clears about the same time the host reaches its timeout deadline or, for that matter, anytime during that minute and a half between the timeout and the sending of the next request. Of course, the next request might also happen to hit a similar bottleneck. I don't really know how they could reliably synchronize requests for every possible situation. ID: 1696879 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1696880 - Posted: 30 Jun 2015, 2:32:58 UTC - in response to Message 1696879. Last modified: 30 Jun 2015, 2:39:02 UTC Is there anyway something simple could work. Such as having the client send a cc: to the Server when it Times Out a Request? You know, when it logs a timeout on the host send a copy to the Server informing the Server it is canceling the request. In a sense, simply sending the next request should serve as that kind of notification, IF the higher rpc_seqno would cause the scheduler to ignore any request that it receives later but with the lower sequence number. Then, again, who's to say that the second request (or some other notification like you suggest) will always get to the scheduler before the first request. Even with the timeout, the first one could still conceivably get there first. The bottleneck might not cause a 9+ minute delay but maybe just a long enough delay that clears about the same time the host reaches its timeout deadline or, for that matter, anytime during that minute and a half between the timeout and the sending of the next request. Of course, the next request might also happen to hit a similar bottleneck. I don't really know how they could reliably synchronize requests for every possible situation. Seems the only solution would be to establish contact with the Server and then make the request. If they want to be that particular about it, go the whole nine yards. That way people won't get punished over a simple delayed packet. Here's a few you may have missed; 30 Jun 2015, 0:50:17 UTC - Abandoned 30 Jun 2015, 1:02:51 UTC - Abandoned 30 Jun 2015, 2:30:13 UTC - Abandoned http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71714 There's still a Lot of 30 Jun left to go. ID: 1696880 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1696881 - Posted: 30 Jun 2015, 2:34:31 UTC - in response to Message 1696877. LOL! Yep, that explains it. It seems kind of mystifying in the second case for it to have to try to "locate" the host after the rpc sequence number check fails, when it's already succeeded in looking up the hostid and the user, and authenticating the request. Why in the world does it have to do all that additional scanning and then, only when it succeeds, trash the tasks in progress? Definite weirdness! I certainly hope you can convince "someone" to make some changes! ;^) Yeah definitely oddball logic. The best I can fathom of the intent from the comments and code, is that the idea is to punish you for moving the client state to another host. I'd have to think if that's the case, then the collateral damage for legitimately scrambled rpc sequence is too high. A Less destructive choice in my mind, is they could use cpid match, AND token match other elements, but leave out local IP as they can be dynamic, and especially voltatile under communication stresses (that may cause a scrambled rpc sequence number). It'd be one thing if it failed one of those matches for "hostname, IP, processor and amount of RAM". but to abandon tasks when it was successful seems awfully strange. I agree with you, too, about relying on the IP lookup as part of the validation. Personally, I don't use DHCP, and the static IPs I've assigned rarely change. (That host I tested with shows "same the last 1874 times".) But I could conceivably shuffle some IPs if I make a change, and DHCP would certainly seem like a crapshoot for those using it, especially when adding or deleting a device. ID: 1696881 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696885 - Posted: 30 Jun 2015, 3:02:01 UTC - in response to Message 1696881. Last modified: 30 Jun 2015, 3:12:32 UTC hmmm, yeah definitely seems backwards. perhaps it wasn't fully though out. In any case, I think the basic trigger of assuming a low rpc number, followed by host match, means the user is juggling hosts/folders, the reason to leave out cpid search in this path, is pretty thin logic. If you transfer the data folder to an identical host [name it the same], adjust the local IP to the old one, and the key hardware is the same, why care ? Maybe it's assuming you copied the client state and forgot the rest of the data folder ? [Even then, the sequence number would be fine...] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696885 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1696888 - Posted: 30 Jun 2015, 3:24:17 UTC - in response to Message 1696885. hmmm, yeah definitely seems backwards. perhaps it wasn't fully though out. In any case, I think the basic trigger of assuming a low rpc number, followed by host match, means the user is juggling hosts/folders, the reason to leave out cpid search in this path, is pretty thin logic. If you transfer the data folder to an identical host [name it the same], adjust the local IP to the old one, and the key hardware is the same, why care ? Maybe it's assuming you copied the client state and forgot the rest of the data folder ? [Even then, the sequence number would be fine...] It almost seems like that "make_new_host" logic was originally written for another purpose, then just co-opted later for use by the rpc_seqno checking. Are there other routines that perform, or "goto", that code? (BTW, I'm a retired dinosaur, and if you ever really want to try a brain-bender, take a crack at following the logic of an old COBOL program with ALTER statements in it, or the equivalent in ALC. AAAAARRRRGGGGHHH!) ID: 1696888 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696890 - Posted: 30 Jun 2015, 3:37:01 UTC - in response to Message 1696888. lol, yeah thankfully I skipped COBOL, and started with Fortran, Pascal and C well after various assembly languages. I'm not uncomfortable with branching etc, but fail to see why the higher level language features ( like functions, lol, inlined for efficiency if necessary) aren't used for this purpose. I've known of two legitimate uses for goto statements from back in comp sci days. The first was in jumping out of a deeply nested parsing or lexing piece of code. The other is in memory management cleanup, but even that has been supplanted by switch/case statements lately. Definitely not generally used AFAIK in authentication logic, or anything else go/go-go like it. Yeah looks like a number of locations branching down to those, after which successful retrievals there is a goto got_host (lol). Could just all be signs of bandaid induced entropy. It is sometimes hard to know when to throw out semi-working code and replace it. That authentication procedure should be rethought and rewritten at a higher level (while keeping it working with existing clients). Not that big a deal when you see how I was able to list the steps relatively easily. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696890 ·

Jeff Buck Volunteer tester Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0	Message 1696892 - Posted: 30 Jun 2015, 4:00:12 UTC - in response to Message 1696890. Could just all be signs of bandaid induced entropy. I think that description pretty much fits any programs that have been around for more than, oh say, six months, especially if they have more than one person's fingerprints on them. Well, I don't think there's anything more I can really offer here so, as I said before, I sure hope you can convince "someone" to make, or at least implement, some changes. ;^) Good luck! ID: 1696892 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696896 - Posted: 30 Jun 2015, 4:06:06 UTC - in response to Message 1696892. Last modified: 30 Jun 2015, 4:06:28 UTC Yeah, I'll probably bounce it around for a while with the team, then eventually post something to boinc_dev. Not sure of the best solution yet, but I think we can all see that an out of sequence rpc alone isn't a good trigger for radical actions. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696896 ·

Cactus Bob Send message Joined: 19 May 99 Posts: 209 Credit: 10,924,287 RAC: 29	Message 1696918 - Posted: 30 Jun 2015, 6:03:02 UTC In the late 80's my first programming experience was with GW Basic. I seem to remember using got a LOT. If Then and GOTO line. I dabbled in hexi a tiny bit. Used windows basic for a couple months but never jumped to C. SO I guess that makes me quaint..lol. More likely just old codger who never mastered anything useful. I did write a few useful programs in GW basic and even made several hundred bucks on a couple programs. One I remember was a scheduling program for employees at a convenience store. Been too busy doing graphic design for the last 20 years to program anything. Oh well maybe now that I am "retired / divorced" I should look at the whole programming thing again. Bob Sometimes I wonder, what happened to all the people I gave directions to? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ SETI@home classic workunits 4,321 SETI@home classic CPU time 22,169 hours ID: 1696918 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 1696935 - Posted: 30 Jun 2015, 7:21:17 UTC - in response to Message 1696877. Yeah definitely oddball logic. The best I can fathom of the intent from the comments and code, is that the idea is to punish you for moving the client state to another host. I'd have to think if that's the case, then the collateral damage for legitimately scrambled rpc sequence is too high. I don't think there is any problem with moving client state (plus the rest of the data folder) to another machine - I've certainly done that after hardware failures, to clean up tasks in progress while I fix the hardware. And then I've moved it back again to its original hardware home when the repair is complete. Where I can see that problems might occur, and some sort of problem resolution is needed in code, is when client state is copied to another machine, and both instances are left active. BOINC requires that HostIDs are unique (and that's probably a good idea): maybe the problem is that the server code is too good at finding and re-using old HostIDs - a copied state file is one situation where one of them really does need to be assigned a new identity. ID: 1696935 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696951 - Posted: 30 Jun 2015, 8:58:51 UTC - in response to Message 1696935. If that's the intention, then there are still options. timeout scheduler transactions (ensuring they are atomic and completely rolled back on failure) before the client rpc timeout interval expires. This ensures that for the normal non copied or moved case, that both sides agree on the state. Not that hard to do. Anything that makes a modifcation just gets queued and done at once, as a quick (and as narrow as possible) string of tasks, as opposed to an assortment of updates scattered amongst reads/lookups and processing. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696951 ·

William Volunteer tester Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0	Message 1696953 - Posted: 30 Jun 2015, 9:09:48 UTC - in response to Message 1696935. Last modified: 30 Jun 2015, 9:11:01 UTC Yeah definitely oddball logic. The best I can fathom of the intent from the comments and code, is that the idea is to punish you for moving the client state to another host. I'd have to think if that's the case, then the collateral damage for legitimately scrambled rpc sequence is too high. I don't think there is any problem with moving client state (plus the rest of the data folder) to another machine - I've certainly done that after hardware failures, to clean up tasks in progress while I fix the hardware. And then I've moved it back again to its original hardware home when the repair is complete. Where I can see that problems might occur, and some sort of problem resolution is needed in code, is when client state is copied to another machine, and both instances are left active. BOINC requires that HostIDs are unique (and that's probably a good idea): maybe the problem is that the server code is too good at finding and re-using old HostIDs - a copied state file is one situation where one of them really does need to be assigned a new identity. moving complete data folders around hosts isn;t a problem. Then again I run them CLI only. And I certainly didn;t run the same folder (and with that hostID) on two different hosts at the same time. A person who won't read has no advantage over one who can't read. (Mark Twain) ID: 1696953 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.