Message boards :
Number crunching :
Suddenly BOINC Decides to Abandon 71 APs...WTH?
Message board moderation
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 15 · Next
Author | Message |
---|---|
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
True. What I'm trying to picture is any situation that scaled across many occurrences would indicate a source of compounding bloat. To me, in that context, freeing up the tasks makes sense, but only really after the detach/reattach is certain. Doing nothing is of course simpler to code and maintain :D "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Whatever is done, it would be nice if it were soon; 29 Jun 2015, 23:22:28 UTC - Abandoned |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
True. What I'm trying to picture is any situation that scaled across many occurrences would indicate a source of compounding bloat.I guess you'd have to have some way of finding out how often legitimate detach/reattach events get caught by this code trap. But if database bloat was a serious concern of the powers that be, I can think of several other ongoing problem areas that could probably cut into that load significantly, IF they were ever addressed, starting with all the runaway hosts that maintain a revolving stash of thousands of Invalid tasks. To me, in that context, freeing up the tasks makes sense, but only really after the detach/reattach is certain.Which it certainly doesn't seem like it's currently accomplishing. By the way, have you figured out why the first group of tests with both doctored hostid and rpc_seqno fields didn't trigger the abandonment, while my final test with the lower rpc_seqno but an untouched hostid field was successful? Was the hostid check executed first, and then the rpc_seqno check bypassed after the hostid was corrected? Doing nothing is of course simpler to code and maintain :DNow there's a worthy goal!! |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Which it certainly doesn't seem like it's currently accomplishing. By the way, have you figured out why the first group of tests with both doctored hostid and rpc_seqno fields didn't trigger the abandonment, while my final test with the lower rpc_seqno but an untouched hostid field was successful? Was the hostid check executed first, and then the rpc_seqno check bypassed after the hostid was corrected? yeah hostid lookup is first. Personally I would have made user authentication first so as to reduce the exposure to DoS attacks, but that's a side issue for these purposes. In the first case [no abandonment]: - lookup by hostid (fails) -- lookup by rpc seqno in users hosts (fails, goto (!) lookup_user_and_make_new_host) lookup_user_and_make_new_host: - lookup user, match authenticators - if cpid is present, scan the the user's hosts and match it. ( succeeds, last ditch attempt) In the second case [tasks abandoned]: - lookup by hostid (succeeds) - lookup the user (succeeds) - Authenticate (succeeds) - rpc seqeunce number check (fails, goto (!) make_new_host ) make_new_host: - Final attempt to locate host by scanning back through user's hosts matching hostname, IP, processor and amount of RAM. (succeeds next do ***) *** if found (it was), use the existing record AND mark results as over (except if allow_multiple_clients is enabled) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Is there anyway something simple could work. Such as having the client send a cc: to the Server when it Times Out a Request? You know, when it logs a timeout on the host send a copy to the Server informing the Server it is canceling the request. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Not sure. The transactions need to be atomic, and a two way dialog going on over several requests *might* be out. but again, not sure on that. Have to keep existing client behaviour in mind too. Separate issues raised by my prior post: I'm thinking adding the cpid search into the second procedure somewhere would salvage the situation for some cases, but possibly it's left out intentionally to generate a new host if the IP etc changed. **This happening if you are on DHCP could create new hostids spontaneously**, possibly a bit too easily IMO "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Which it certainly doesn't seem like it's currently accomplishing. By the way, have you figured out why the first group of tests with both doctored hostid and rpc_seqno fields didn't trigger the abandonment, while my final test with the lower rpc_seqno but an untouched hostid field was successful? Was the hostid check executed first, and then the rpc_seqno check bypassed after the hostid was corrected? LOL! Yep, that explains it. It seems kind of mystifying in the second case for it to have to try to "locate" the host after the rpc sequence number check fails, when it's already succeeded in looking up the hostid and the user, and authenticating the request. Why in the world does it have to do all that additional scanning and then, only when it succeeds, trash the tasks in progress? Definite weirdness! I certainly hope you can convince "someone" to make some changes! ;^) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
LOL! Yep, that explains it. It seems kind of mystifying in the second case for it to have to try to "locate" the host after the rpc sequence number check fails, when it's already succeeded in looking up the hostid and the user, and authenticating the request. Why in the world does it have to do all that additional scanning and then, only when it succeeds, trash the tasks in progress? Definite weirdness! I certainly hope you can convince "someone" to make some changes! ;^) Yeah definitely oddball logic. The best I can fathom of the intent from the comments and code, is that the idea is to punish you for moving the client state to another host. I'd have to think if that's the case, then the collateral damage for legitimately scrambled rpc sequence is too high. A Less destructive choice in my mind, is they could use cpid match, AND token match other elements, but leave out local IP as they can be dynamic, and especially voltatile under communication stresses (that may cause a scrambled rpc sequence number). "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Is there anyway something simple could work. Such as having the client send a cc: to the Server when it Times Out a Request? You know, when it logs a timeout on the host send a copy to the Server informing the Server it is canceling the request. In a sense, simply sending the next request should serve as that kind of notification, IF the higher rpc_seqno would cause the scheduler to ignore any request that it receives later but with the lower sequence number. Then, again, who's to say that the second request (or some other notification like you suggest) will always get to the scheduler before the first request. Even with the timeout, the first one could still conceivably get there first. The bottleneck might not cause a 9+ minute delay but maybe just a long enough delay that clears about the same time the host reaches its timeout deadline or, for that matter, anytime during that minute and a half between the timeout and the sending of the next request. Of course, the next request might also happen to hit a similar bottleneck. I don't really know how they could reliably synchronize requests for every possible situation. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Is there anyway something simple could work. Such as having the client send a cc: to the Server when it Times Out a Request? You know, when it logs a timeout on the host send a copy to the Server informing the Server it is canceling the request. Seems the only solution would be to establish contact with the Server and then make the request. If they want to be that particular about it, go the whole nine yards. That way people won't get punished over a simple delayed packet. Here's a few you may have missed; 30 Jun 2015, 0:50:17 UTC - Abandoned 30 Jun 2015, 1:02:51 UTC - Abandoned 30 Jun 2015, 2:30:13 UTC - Abandoned http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71714 There's still a Lot of 30 Jun left to go. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
LOL! Yep, that explains it. It seems kind of mystifying in the second case for it to have to try to "locate" the host after the rpc sequence number check fails, when it's already succeeded in looking up the hostid and the user, and authenticating the request. Why in the world does it have to do all that additional scanning and then, only when it succeeds, trash the tasks in progress? Definite weirdness! I certainly hope you can convince "someone" to make some changes! ;^) It'd be one thing if it failed one of those matches for "hostname, IP, processor and amount of RAM". but to abandon tasks when it was successful seems awfully strange. I agree with you, too, about relying on the IP lookup as part of the validation. Personally, I don't use DHCP, and the static IPs I've assigned rarely change. (That host I tested with shows "same the last 1874 times".) But I could conceivably shuffle some IPs if I make a change, and DHCP would certainly seem like a crapshoot for those using it, especially when adding or deleting a device. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
hmmm, yeah definitely seems backwards. perhaps it wasn't fully though out. In any case, I think the basic trigger of assuming a low rpc number, followed by host match, means the user is juggling hosts/folders, the reason to leave out cpid search in this path, is pretty thin logic. If you transfer the data folder to an identical host [name it the same], adjust the local IP to the old one, and the key hardware is the same, why care ? Maybe it's assuming you copied the client state and forgot the rest of the data folder ? [Even then, the sequence number would be fine...] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
hmmm, yeah definitely seems backwards. perhaps it wasn't fully though out. It almost seems like that "make_new_host" logic was originally written for another purpose, then just co-opted later for use by the rpc_seqno checking. Are there other routines that perform, or "goto", that code? (BTW, I'm a retired dinosaur, and if you ever really want to try a brain-bender, take a crack at following the logic of an old COBOL program with ALTER statements in it, or the equivalent in ALC. AAAAARRRRGGGGHHH!) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
lol, yeah thankfully I skipped COBOL, and started with Fortran, Pascal and C well after various assembly languages. I'm not uncomfortable with branching etc, but fail to see why the higher level language features ( like functions, lol, inlined for efficiency if necessary) aren't used for this purpose. I've known of two legitimate uses for goto statements from back in comp sci days. The first was in jumping out of a deeply nested parsing or lexing piece of code. The other is in memory management cleanup, but even that has been supplanted by switch/case statements lately. Definitely not generally used AFAIK in authentication logic, or anything else go/go-go like it. Yeah looks like a number of locations branching down to those, after which successful retrievals there is a goto got_host (lol). Could just all be signs of bandaid induced entropy. It is sometimes hard to know when to throw out semi-working code and replace it. That authentication procedure should be rethought and rewritten at a higher level (while keeping it working with existing clients). Not that big a deal when you see how I was able to list the steps relatively easily. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
Could just all be signs of bandaid induced entropy. I think that description pretty much fits any programs that have been around for more than, oh say, six months, especially if they have more than one person's fingerprints on them. Well, I don't think there's anything more I can really offer here so, as I said before, I sure hope you can convince "someone" to make, or at least implement, some changes. ;^) Good luck! |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Yeah, I'll probably bounce it around for a while with the team, then eventually post something to boinc_dev. Not sure of the best solution yet, but I think we can all see that an out of sequence rpc alone isn't a good trigger for radical actions. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Cactus Bob Send message Joined: 19 May 99 Posts: 209 Credit: 10,924,287 RAC: 29 |
In the late 80's my first programming experience was with GW Basic. I seem to remember using got a LOT. If Then and GOTO line. I dabbled in hexi a tiny bit. Used windows basic for a couple months but never jumped to C. SO I guess that makes me quaint..lol. More likely just old codger who never mastered anything useful. I did write a few useful programs in GW basic and even made several hundred bucks on a couple programs. One I remember was a scheduling program for employees at a convenience store. Been too busy doing graphic design for the last 20 years to program anything. Oh well maybe now that I am "retired / divorced" I should look at the whole programming thing again. Bob Sometimes I wonder, what happened to all the people I gave directions to? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ SETI@home classic workunits 4,321 SETI@home classic CPU time 22,169 hours |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Yeah definitely oddball logic. The best I can fathom of the intent from the comments and code, is that the idea is to punish you for moving the client state to another host. I'd have to think if that's the case, then the collateral damage for legitimately scrambled rpc sequence is too high. I don't think there is any problem with moving client state (plus the rest of the data folder) to another machine - I've certainly done that after hardware failures, to clean up tasks in progress while I fix the hardware. And then I've moved it back again to its original hardware home when the repair is complete. Where I can see that problems might occur, and some sort of problem resolution is needed in code, is when client state is copied to another machine, and both instances are left active. BOINC requires that HostIDs are unique (and that's probably a good idea): maybe the problem is that the server code is too good at finding and re-using old HostIDs - a copied state file is one situation where one of them really does need to be assigned a new identity. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
If that's the intention, then there are still options. timeout scheduler transactions (ensuring they are atomic and completely rolled back on failure) before the client rpc timeout interval expires. This ensures that for the normal non copied or moved case, that both sides agree on the state. Not that hard to do. Anything that makes a modifcation just gets queued and done at once, as a quick (and as narrow as possible) string of tasks, as opposed to an assortment of updates scattered amongst reads/lookups and processing. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
Yeah definitely oddball logic. The best I can fathom of the intent from the comments and code, is that the idea is to punish you for moving the client state to another host. I'd have to think if that's the case, then the collateral damage for legitimately scrambled rpc sequence is too high. moving complete data folders around hosts isn;t a problem. Then again I run them CLI only. And I certainly didn;t run the same folder (and with that hostID) on two different hosts at the same time. A person who won't read has no advantage over one who can't read. (Mark Twain) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.