Suddenly BOINC Decides to Abandon 71 APs...WTH?

Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 15 · Next

AuthorMessage
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1696848 - Posted: 29 Jun 2015, 22:52:33 UTC - in response to Message 1696833.  

True. What I'm trying to picture is any situation that scaled across many occurrences would indicate a source of compounding bloat. To me, in that context, freeing up the tasks makes sense, but only really after the detach/reattach is certain. Doing nothing is of course simpler to code and maintain :D
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1696848 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1696859 - Posted: 29 Jun 2015, 23:40:50 UTC

Whatever is done, it would be nice if it were soon;
29 Jun 2015, 23:22:28 UTC - Abandoned
ID: 1696859 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1696862 - Posted: 30 Jun 2015, 0:56:57 UTC - in response to Message 1696848.  

True. What I'm trying to picture is any situation that scaled across many occurrences would indicate a source of compounding bloat.
I guess you'd have to have some way of finding out how often legitimate detach/reattach events get caught by this code trap. But if database bloat was a serious concern of the powers that be, I can think of several other ongoing problem areas that could probably cut into that load significantly, IF they were ever addressed, starting with all the runaway hosts that maintain a revolving stash of thousands of Invalid tasks.

To me, in that context, freeing up the tasks makes sense, but only really after the detach/reattach is certain.
Which it certainly doesn't seem like it's currently accomplishing. By the way, have you figured out why the first group of tests with both doctored hostid and rpc_seqno fields didn't trigger the abandonment, while my final test with the lower rpc_seqno but an untouched hostid field was successful? Was the hostid check executed first, and then the rpc_seqno check bypassed after the hostid was corrected?

Doing nothing is of course simpler to code and maintain :D
Now there's a worthy goal!!
ID: 1696862 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1696865 - Posted: 30 Jun 2015, 1:32:52 UTC - in response to Message 1696862.  
Last modified: 30 Jun 2015, 1:46:10 UTC

Which it certainly doesn't seem like it's currently accomplishing. By the way, have you figured out why the first group of tests with both doctored hostid and rpc_seqno fields didn't trigger the abandonment, while my final test with the lower rpc_seqno but an untouched hostid field was successful? Was the hostid check executed first, and then the rpc_seqno check bypassed after the hostid was corrected?


yeah hostid lookup is first. Personally I would have made user authentication first so as to reduce the exposure to DoS attacks, but that's a side issue for these purposes.

In the first case [no abandonment]:
- lookup by hostid (fails)
-- lookup by rpc seqno in users hosts (fails, goto (!) lookup_user_and_make_new_host)
lookup_user_and_make_new_host:
- lookup user, match authenticators
- if cpid is present, scan the the user's hosts and match it. ( succeeds, last ditch attempt)

In the second case [tasks abandoned]:
- lookup by hostid (succeeds)
- lookup the user (succeeds)
- Authenticate (succeeds)
- rpc seqeunce number check (fails, goto (!) make_new_host )
make_new_host:
- Final attempt to locate host by scanning back through user's hosts matching hostname, IP, processor and amount of RAM. (succeeds next do ***)
*** if found (it was), use the existing record AND mark results as over (except if allow_multiple_clients is enabled)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1696865 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1696869 - Posted: 30 Jun 2015, 1:45:05 UTC - in response to Message 1696865.  

Is there anyway something simple could work. Such as having the client send a cc: to the Server when it Times Out a Request? You know, when it logs a timeout on the host send a copy to the Server informing the Server it is canceling the request.
ID: 1696869 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1696873 - Posted: 30 Jun 2015, 1:49:16 UTC - in response to Message 1696869.  
Last modified: 30 Jun 2015, 1:53:07 UTC

Not sure. The transactions need to be atomic, and a two way dialog going on over several requests *might* be out. but again, not sure on that. Have to keep existing client behaviour in mind too.

Separate issues raised by my prior post:
I'm thinking adding the cpid search into the second procedure somewhere would salvage the situation for some cases, but possibly it's left out intentionally to generate a new host if the IP etc changed.

**This happening if you are on DHCP could create new hostids spontaneously**, possibly a bit too easily IMO
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1696873 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1696876 - Posted: 30 Jun 2015, 2:00:38 UTC - in response to Message 1696865.  

Which it certainly doesn't seem like it's currently accomplishing. By the way, have you figured out why the first group of tests with both doctored hostid and rpc_seqno fields didn't trigger the abandonment, while my final test with the lower rpc_seqno but an untouched hostid field was successful? Was the hostid check executed first, and then the rpc_seqno check bypassed after the hostid was corrected?


yeah hostid lookup is first. Personally I would have made user authentication first so as to reduce the exposure to DoS attacks, but that's a side issue for these purposes.

In the first case [no abandonment]:
- lookup by hostid (fails)
-- lookup by rpc seqno in users hosts (fails, goto (!) lookup_user_and_make_new_host)
lookup_user_and_make_new_host:
- lookup user, match authenticators
- if cpid is present, scan the the user's hosts and match it. ( succeeds, last ditch attempt)

In the second case [tasks abandoned]:
- lookup by hostid (succeeds)
- lookup the user (succeeds)
- Authenticate (succeeds)
- rpc seqeunce number check (fails, goto (!) make_new_host )
make_new_host:
- Final attempt to locate host by scanning back through user's hosts matching hostname, IP, processor and amount of RAM. (succeeds next do ***)
*** if found (it was), use the existing record AND mark results as over (except if allow_multiple_clients is enabled)

LOL! Yep, that explains it. It seems kind of mystifying in the second case for it to have to try to "locate" the host after the rpc sequence number check fails, when it's already succeeded in looking up the hostid and the user, and authenticating the request. Why in the world does it have to do all that additional scanning and then, only when it succeeds, trash the tasks in progress? Definite weirdness! I certainly hope you can convince "someone" to make some changes! ;^)
ID: 1696876 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1696877 - Posted: 30 Jun 2015, 2:10:07 UTC - in response to Message 1696876.  
Last modified: 30 Jun 2015, 2:11:37 UTC

LOL! Yep, that explains it. It seems kind of mystifying in the second case for it to have to try to "locate" the host after the rpc sequence number check fails, when it's already succeeded in looking up the hostid and the user, and authenticating the request. Why in the world does it have to do all that additional scanning and then, only when it succeeds, trash the tasks in progress? Definite weirdness! I certainly hope you can convince "someone" to make some changes! ;^)


Yeah definitely oddball logic. The best I can fathom of the intent from the comments and code, is that the idea is to punish you for moving the client state to another host. I'd have to think if that's the case, then the collateral damage for legitimately scrambled rpc sequence is too high.

A Less destructive choice in my mind, is they could use cpid match, AND token match other elements, but leave out local IP as they can be dynamic, and especially voltatile under communication stresses (that may cause a scrambled rpc sequence number).
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1696877 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1696879 - Posted: 30 Jun 2015, 2:20:53 UTC - in response to Message 1696869.  

Is there anyway something simple could work. Such as having the client send a cc: to the Server when it Times Out a Request? You know, when it logs a timeout on the host send a copy to the Server informing the Server it is canceling the request.

In a sense, simply sending the next request should serve as that kind of notification, IF the higher rpc_seqno would cause the scheduler to ignore any request that it receives later but with the lower sequence number. Then, again, who's to say that the second request (or some other notification like you suggest) will always get to the scheduler before the first request. Even with the timeout, the first one could still conceivably get there first. The bottleneck might not cause a 9+ minute delay but maybe just a long enough delay that clears about the same time the host reaches its timeout deadline or, for that matter, anytime during that minute and a half between the timeout and the sending of the next request. Of course, the next request might also happen to hit a similar bottleneck. I don't really know how they could reliably synchronize requests for every possible situation.
ID: 1696879 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1696880 - Posted: 30 Jun 2015, 2:32:58 UTC - in response to Message 1696879.  
Last modified: 30 Jun 2015, 2:39:02 UTC

Is there anyway something simple could work. Such as having the client send a cc: to the Server when it Times Out a Request? You know, when it logs a timeout on the host send a copy to the Server informing the Server it is canceling the request.

In a sense, simply sending the next request should serve as that kind of notification, IF the higher rpc_seqno would cause the scheduler to ignore any request that it receives later but with the lower sequence number. Then, again, who's to say that the second request (or some other notification like you suggest) will always get to the scheduler before the first request. Even with the timeout, the first one could still conceivably get there first. The bottleneck might not cause a 9+ minute delay but maybe just a long enough delay that clears about the same time the host reaches its timeout deadline or, for that matter, anytime during that minute and a half between the timeout and the sending of the next request. Of course, the next request might also happen to hit a similar bottleneck. I don't really know how they could reliably synchronize requests for every possible situation.

Seems the only solution would be to establish contact with the Server and then make the request. If they want to be that particular about it, go the whole nine yards. That way people won't get punished over a simple delayed packet.

Here's a few you may have missed;
30 Jun 2015, 0:50:17 UTC - Abandoned
30 Jun 2015, 1:02:51 UTC - Abandoned
30 Jun 2015, 2:30:13 UTC - Abandoned
http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71714
There's still a Lot of 30 Jun left to go.
ID: 1696880 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1696881 - Posted: 30 Jun 2015, 2:34:31 UTC - in response to Message 1696877.  

LOL! Yep, that explains it. It seems kind of mystifying in the second case for it to have to try to "locate" the host after the rpc sequence number check fails, when it's already succeeded in looking up the hostid and the user, and authenticating the request. Why in the world does it have to do all that additional scanning and then, only when it succeeds, trash the tasks in progress? Definite weirdness! I certainly hope you can convince "someone" to make some changes! ;^)


Yeah definitely oddball logic. The best I can fathom of the intent from the comments and code, is that the idea is to punish you for moving the client state to another host. I'd have to think if that's the case, then the collateral damage for legitimately scrambled rpc sequence is too high.

A Less destructive choice in my mind, is they could use cpid match, AND token match other elements, but leave out local IP as they can be dynamic, and especially voltatile under communication stresses (that may cause a scrambled rpc sequence number).

It'd be one thing if it failed one of those matches for "hostname, IP, processor and amount of RAM". but to abandon tasks when it was successful seems awfully strange.

I agree with you, too, about relying on the IP lookup as part of the validation. Personally, I don't use DHCP, and the static IPs I've assigned rarely change. (That host I tested with shows "same the last 1874 times".) But I could conceivably shuffle some IPs if I make a change, and DHCP would certainly seem like a crapshoot for those using it, especially when adding or deleting a device.
ID: 1696881 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1696885 - Posted: 30 Jun 2015, 3:02:01 UTC - in response to Message 1696881.  
Last modified: 30 Jun 2015, 3:12:32 UTC

hmmm, yeah definitely seems backwards. perhaps it wasn't fully though out.

In any case, I think the basic trigger of assuming a low rpc number, followed by host match, means the user is juggling hosts/folders, the reason to leave out cpid search in this path, is pretty thin logic.

If you transfer the data folder to an identical host [name it the same], adjust the local IP to the old one, and the key hardware is the same, why care ? Maybe it's assuming you copied the client state and forgot the rest of the data folder ?

[Even then, the sequence number would be fine...]
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1696885 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1696888 - Posted: 30 Jun 2015, 3:24:17 UTC - in response to Message 1696885.  

hmmm, yeah definitely seems backwards. perhaps it wasn't fully though out.

In any case, I think the basic trigger of assuming a low rpc number, followed by host match, means the user is juggling hosts/folders, the reason to leave out cpid search in this path, is pretty thin logic.

If you transfer the data folder to an identical host [name it the same], adjust the local IP to the old one, and the key hardware is the same, why care ? Maybe it's assuming you copied the client state and forgot the rest of the data folder ?

[Even then, the sequence number would be fine...]

It almost seems like that "make_new_host" logic was originally written for another purpose, then just co-opted later for use by the rpc_seqno checking. Are there other routines that perform, or "goto", that code? (BTW, I'm a retired dinosaur, and if you ever really want to try a brain-bender, take a crack at following the logic of an old COBOL program with ALTER statements in it, or the equivalent in ALC. AAAAARRRRGGGGHHH!)
ID: 1696888 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1696890 - Posted: 30 Jun 2015, 3:37:01 UTC - in response to Message 1696888.  

lol, yeah thankfully I skipped COBOL, and started with Fortran, Pascal and C well after various assembly languages. I'm not uncomfortable with branching etc, but fail to see why the higher level language features ( like functions, lol, inlined for efficiency if necessary) aren't used for this purpose.

I've known of two legitimate uses for goto statements from back in comp sci days. The first was in jumping out of a deeply nested parsing or lexing piece of code. The other is in memory management cleanup, but even that has been supplanted by switch/case statements lately. Definitely not generally used AFAIK in authentication logic, or anything else go/go-go like it.

Yeah looks like a number of locations branching down to those, after which successful retrievals there is a goto got_host (lol).

Could just all be signs of bandaid induced entropy. It is sometimes hard to know when to throw out semi-working code and replace it. That authentication procedure should be rethought and rewritten at a higher level (while keeping it working with existing clients). Not that big a deal when you see how I was able to list the steps relatively easily.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1696890 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1696892 - Posted: 30 Jun 2015, 4:00:12 UTC - in response to Message 1696890.  

Could just all be signs of bandaid induced entropy.

I think that description pretty much fits any programs that have been around for more than, oh say, six months, especially if they have more than one person's fingerprints on them.

Well, I don't think there's anything more I can really offer here so, as I said before, I sure hope you can convince "someone" to make, or at least implement, some changes. ;^) Good luck!
ID: 1696892 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1696896 - Posted: 30 Jun 2015, 4:06:06 UTC - in response to Message 1696892.  
Last modified: 30 Jun 2015, 4:06:28 UTC

Yeah, I'll probably bounce it around for a while with the team, then eventually post something to boinc_dev. Not sure of the best solution yet, but I think we can all see that an out of sequence rpc alone isn't a good trigger for radical actions.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1696896 · Report as offensive
Profile Cactus Bob
Avatar

Send message
Joined: 19 May 99
Posts: 209
Credit: 10,924,287
RAC: 29
Canada
Message 1696918 - Posted: 30 Jun 2015, 6:03:02 UTC

In the late 80's my first programming experience was with GW Basic. I seem to remember using got a LOT. If Then and GOTO line. I dabbled in hexi a tiny bit. Used windows basic for a couple months but never jumped to C. SO I guess that makes me quaint..lol. More likely just old codger who never mastered anything useful.

I did write a few useful programs in GW basic and even made several hundred bucks on a couple programs. One I remember was a scheduling program for employees at a convenience store.

Been too busy doing graphic design for the last 20 years to program anything. Oh well maybe now that I am "retired / divorced" I should look at the whole programming thing again.

Bob
Sometimes I wonder, what happened to all the people I gave directions to?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SETI@home classic workunits 4,321
SETI@home classic CPU time 22,169 hours
ID: 1696918 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1696935 - Posted: 30 Jun 2015, 7:21:17 UTC - in response to Message 1696877.  

Yeah definitely oddball logic. The best I can fathom of the intent from the comments and code, is that the idea is to punish you for moving the client state to another host. I'd have to think if that's the case, then the collateral damage for legitimately scrambled rpc sequence is too high.

I don't think there is any problem with moving client state (plus the rest of the data folder) to another machine - I've certainly done that after hardware failures, to clean up tasks in progress while I fix the hardware. And then I've moved it back again to its original hardware home when the repair is complete.

Where I can see that problems might occur, and some sort of problem resolution is needed in code, is when client state is copied to another machine, and both instances are left active. BOINC requires that HostIDs are unique (and that's probably a good idea): maybe the problem is that the server code is too good at finding and re-using old HostIDs - a copied state file is one situation where one of them really does need to be assigned a new identity.
ID: 1696935 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1696951 - Posted: 30 Jun 2015, 8:58:51 UTC - in response to Message 1696935.  

If that's the intention, then there are still options. timeout scheduler transactions (ensuring they are atomic and completely rolled back on failure) before the client rpc timeout interval expires. This ensures that for the normal non copied or moved case, that both sides agree on the state.

Not that hard to do. Anything that makes a modifcation just gets queued and done at once, as a quick (and as narrow as possible) string of tasks, as opposed to an assortment of updates scattered amongst reads/lookups and processing.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1696951 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1696953 - Posted: 30 Jun 2015, 9:09:48 UTC - in response to Message 1696935.  
Last modified: 30 Jun 2015, 9:11:01 UTC

Yeah definitely oddball logic. The best I can fathom of the intent from the comments and code, is that the idea is to punish you for moving the client state to another host. I'd have to think if that's the case, then the collateral damage for legitimately scrambled rpc sequence is too high.

I don't think there is any problem with moving client state (plus the rest of the data folder) to another machine - I've certainly done that after hardware failures, to clean up tasks in progress while I fix the hardware. And then I've moved it back again to its original hardware home when the repair is complete.

Where I can see that problems might occur, and some sort of problem resolution is needed in code, is when client state is copied to another machine, and both instances are left active. BOINC requires that HostIDs are unique (and that's probably a good idea): maybe the problem is that the server code is too good at finding and re-using old HostIDs - a copied state file is one situation where one of them really does need to be assigned a new identity.

moving complete data folders around hosts isn;t a problem. Then again I run them CLI only. And I certainly didn;t run the same folder (and with that hostID) on two different hosts at the same time.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1696953 · Report as offensive
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 15 · Next

Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.