Suddenly BOINC Decides to Abandon 71 APs...WTH?

Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 15 · Next

AuthorMessage
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1696957 - Posted: 30 Jun 2015, 9:16:31 UTC - in response to Message 1696953.  
Last modified: 30 Jun 2015, 9:17:04 UTC

moving complete data folders around hosts isn;t a problem. Then again I run them CLI only. And I certainly didn;t run the same folder (and with that hostID) on two different hosts at the same time.


here we're concerned about a specific situation, where an initial rpc times out on the client but for unknown reasons takes its time getting on the server, then the client initiates another request (and it hadn't received success so hasn't incremented the RPC).

Some (arbitrary) time later, whichever of the two requests completes first, the second will have a lower sequence number than current, so trigger the logic discussed.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1696957 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1696958 - Posted: 30 Jun 2015, 9:22:53 UTC - in response to Message 1696957.  

moving complete data folders around hosts isn;t a problem. Then again I run them CLI only. And I certainly didn;t run the same folder (and with that hostID) on two different hosts at the same time.

here we're concerned about a specific situation, where an initial rpc times out on the client but for unknown reasons takes its time getting on the server, then the client initiates another request (and it hadn't received success so hasn't incremented the RPC).

Some (arbitrary) time later, whichever of the two requests completes first, the second will have a lower sequence number than current, so trigger the logic discussed.

I wonder if we need to pay attention to the sched_request files, as well as client_state. That's what's sent to the server, after all. It will have some subset of CS data, though exactly what's in/ex-cluded, I don't know. And I imagine I'd go crossed-eyed trying to compare them.
ID: 1696958 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1696959 - Posted: 30 Jun 2015, 9:23:07 UTC

Doesn't 'lower sequence number' trigger 'new host ID' ?
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1696959 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1696961 - Posted: 30 Jun 2015, 9:27:15 UTC - in response to Message 1696959.  

Doesn't 'lower sequence number' trigger 'new host ID' ?


Not quite straight away. There's a bunch of attempts to locate the original hostID first and let you through, then ( in the second logic example I stepped out for Jeff Buck earlier), it decides that lower RPC means keep the same host and abandon all the tasks.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1696961 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1696962 - Posted: 30 Jun 2015, 9:27:55 UTC - in response to Message 1696959.  

Doesn't 'lower sequence number' trigger 'new host ID' ?

I remember that being a problem too, but I haven't seen a report about it for years. I wonder if the increasing thoroughness of the 'find existing host in the database' search (when were <host_cpid> introduced?) means that 'make new HostID (and bloat the database)' has effectively been replaced by 're-use old ID, but wipe the slate'?
ID: 1696962 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1696963 - Posted: 30 Jun 2015, 9:30:12 UTC - in response to Message 1696958.  

I wonder if we need to pay attention to the sched_request files, as well as client_state. That's what's sent to the server, after all. It will have some subset of CS data, though exactly what's in/ex-cluded, I don't know. And I imagine I'd go crossed-eyed trying to compare them.


Yes, later after mulling the logic over more it's be good to have the rpc sequence numbers grabbed from a willing test victim, for illustration/communication. We did illustrate the process to ourselves clearly enough I feel, that for us what's broken is clear. What isn't clear are the complete set of options for solutions, without removing the intention. That needs more thought.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1696963 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1696964 - Posted: 30 Jun 2015, 9:33:02 UTC - in response to Message 1696963.  

I wonder if we need to pay attention to the sched_request files, as well as client_state. That's what's sent to the server, after all. It will have some subset of CS data, though exactly what's in/ex-cluded, I don't know. And I imagine I'd go crossed-eyed trying to compare them.

Yes, later after mulling the logic over more it's be good to have the rpc sequence numbers grabbed from a willing test victim, for illustration/communication. We did illustrate the process to ourselves clearly enough I feel, that for us what's broken is clear. What isn't clear are the complete set of options for solutions, without removing the intention. That needs more thought.

And remember no copy of sched_request is ever saved - a new one is (over-)written for each RPC. So we would want a host which isn't constantly needing to top-up a cache.
ID: 1696964 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1696965 - Posted: 30 Jun 2015, 9:33:13 UTC - in response to Message 1696962.  

Doesn't 'lower sequence number' trigger 'new host ID' ?

I remember that being a problem too, but I haven't seen a report about it for years. I wonder if the increasing thoroughness of the 'find existing host in the database' search (when were <host_cpid> introduced?) means that 'make new HostID (and bloat the database)' has effectively been replaced by 're-use old ID, but wipe the slate'?



There is a second codepath where the cpid isn;t present either, that will result in a new hostID. As found earlier, this codepath will also result in a new HostID, if you happen to get a different IP via your network DHCP server/gateway. In that (slightly different) case, A likely conjunction of netowork problems or machine restart, after an RPC failure, will spawn a new hostID.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1696965 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1696966 - Posted: 30 Jun 2015, 9:36:03 UTC - in response to Message 1696964.  

And remember no copy of sched_request is ever saved - a new one is (over-)written for each RPC. So we would want a host which isn't constantly needing to top-up a cache.


yeah. In my experience, with Boinc involved, 'De Nile' is a river in egypt also. Because he understands the triggers, I'd suggest we mount a thorough campaign involving Jeff Buck if possible, drafting a complete report on the issues.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1696966 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1696967 - Posted: 30 Jun 2015, 9:38:05 UTC - in response to Message 1696951.  

If that's the intention, then there are still options. timeout scheduler transactions (ensuring they are atomic and completely rolled back on failure) before the client rpc timeout interval expires. This ensures that for the normal non copied or moved case, that both sides agree on the state.

Not that hard to do. Anything that makes a modifcation just gets queued and done at once, as a quick (and as narrow as possible) string of tasks, as opposed to an assortment of updates scattered amongst reads/lookups and processing.

I would suggest if the Server is going to take punitive action on a Host over a Request which went Unacknowledged by the Server then force the Server to acknowledge the Request. This would ensure both sides are aware of the status of the Request. Having the Host continue after an assumed Request Failure doesn't seem to work very well. Have the Host contact the Server, make the Request, and then have the Server Acknowledge the Request before contact has ended. If the Server doesn't Acknowledge the Request Do Not Make another request. If someone can devise a method where the Host isn't operating on assumptions then fine, but don't penalize the host when the Server Fails to Acknowledge a Request.
ID: 1696967 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1696968 - Posted: 30 Jun 2015, 9:41:03 UTC

The only case I can think of in the normal way of things, is if you (have to) restore the boinc data folder from a backup.
In that case it makes sense to wipe the slate, especially since 'send lost tasks' is optional.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1696968 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1696969 - Posted: 30 Jun 2015, 9:41:32 UTC - in response to Message 1696965.  

Doesn't 'lower sequence number' trigger 'new host ID' ?

I remember that being a problem too, but I haven't seen a report about it for years. I wonder if the increasing thoroughness of the 'find existing host in the database' search (when were <host_cpid> introduced?) means that 'make new HostID (and bloat the database)' has effectively been replaced by 're-use old ID, but wipe the slate'?

There is a second codepath where the cpid isn;t present either, that will result in a new hostID. As found earlier, this codepath will also result in a new HostID, if you happen to get a different IP via your network DHCP server/gateway. In that (slightly different) case, A likely conjunction of netowork problems or machine restart, after an RPC failure, will spawn a new hostID.

My machines are on DHCP for both internal (LAN IPs assigned by my router) and external (WAN IP assigned to router by ISP) IP addresses. Is it clear which is checked by code? (or are both?). Again, I have had spontaneous new HostIDs in the past - and now you come to mention it, I think I once saw a case here at SETI where a working machine appeared under a new HostID in my computer list, but then went back spontaneously to the old HostID and carried on working on current tasks. But I think that was years ago, when the lab had comms problems.

I think I deleted the 'ghost hosts', but I'll check.
ID: 1696969 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1696970 - Posted: 30 Jun 2015, 9:43:49 UTC - in response to Message 1696969.  

It'll be the LAN IP as displayed on the host details, along with the hostname, CPU, and RAM/
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1696970 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1696972 - Posted: 30 Jun 2015, 9:46:10 UTC

Somebody complained about ghost hosts not too long ago.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1696972 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1696975 - Posted: 30 Jun 2015, 9:48:16 UTC - in response to Message 1696967.  

If that's the intention, then there are still options. timeout scheduler transactions (ensuring they are atomic and completely rolled back on failure) before the client rpc timeout interval expires. This ensures that for the normal non copied or moved case, that both sides agree on the state.

Not that hard to do. Anything that makes a modifcation just gets queued and done at once, as a quick (and as narrow as possible) string of tasks, as opposed to an assortment of updates scattered amongst reads/lookups and processing.

I would suggest if the Server is going to take punitive action on a Host over a Request which went Unacknowledged by the Server then force the Server to acknowledge the Request. This would ensure both sides are aware of the status of the Request. Having the Host continue after an assumed Request Failure doesn't seem to work very well. Have the Host contact the Server, make the Request, and then have the Server Acknowledge the Request before contact has ended. If the Server doesn't Acknowledge the Request Do Not Make another request. If someone can devise a method where the Host isn't operating on assumptions then fine, but don't penalize the host when the Server Fails to Acknowledge a Request.

That's pretty much what happens already. Request. Reply. You have to have some sort of timeout/'reset comms' fallback, otherwise every time a lightning strike hits a router, we'd all stop talking to the server for ever.

The gap in the protocol is the client not sending an ack for the sched-reply: that's what leads to ghost tasks. Maybe the client should be sending ack (got your reply) or nack (I've given up listening - timeout), and the server should 'repeat last reply' on nack. What are the overheads in that?
ID: 1696975 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1696976 - Posted: 30 Jun 2015, 9:50:19 UTC - in response to Message 1696972.  

Somebody complained about ghost hosts not too long ago.


Yeah, we're in that section of spaghetti code, just a different strand. See what you think of the code in the scheduler handle_request.cpp, function authenticate_user(). I get a giggle from the liberal use of goto statements, which is pretty unusual in C code.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1696976 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1696977 - Posted: 30 Jun 2015, 9:53:54 UTC - in response to Message 1696975.  

What are the overheads in that?


Basically a fair amount of moving randomly spread record updates into a single queue for monolithic execution, preferably using a lock/flag during the single update, that isn;t executed if the timer's expired (should be shorter than the client timeout value)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1696977 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14655
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1696978 - Posted: 30 Jun 2015, 9:57:17 UTC - in response to Message 1696970.  

It'll be the LAN IP as displayed on the host details, along with the hostname, CPU, and RAM/

The external IP address is shown on the host details page too, but doesn't appear in sched_request. BOINC must get it from Apache/ngnix on receipt.
ID: 1696978 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1696979 - Posted: 30 Jun 2015, 9:58:20 UTC - in response to Message 1696975.  

If that's the intention, then there are still options. timeout scheduler transactions (ensuring they are atomic and completely rolled back on failure) before the client rpc timeout interval expires. This ensures that for the normal non copied or moved case, that both sides agree on the state.

Not that hard to do. Anything that makes a modifcation just gets queued and done at once, as a quick (and as narrow as possible) string of tasks, as opposed to an assortment of updates scattered amongst reads/lookups and processing.

I would suggest if the Server is going to take punitive action on a Host over a Request which went Unacknowledged by the Server then force the Server to acknowledge the Request. This would ensure both sides are aware of the status of the Request. Having the Host continue after an assumed Request Failure doesn't seem to work very well. Have the Host contact the Server, make the Request, and then have the Server Acknowledge the Request before contact has ended. If the Server doesn't Acknowledge the Request Do Not Make another request. If someone can devise a method where the Host isn't operating on assumptions then fine, but don't penalize the host when the Server Fails to Acknowledge a Request.

That's pretty much what happens already. Request. Reply. You have to have some sort of timeout/'reset comms' fallback, otherwise every time a lightning strike hits a router, we'd all stop talking to the server for ever.

The gap in the protocol is the client not sending an ack for the sched-reply: that's what leads to ghost tasks. Maybe the client should be sending ack (got your reply) or nack (I've given up listening - timeout), and the server should 'repeat last reply' on nack. What are the overheads in that?

...make the Request, and then have the Server Acknowledge the Request before contact has ended. If the Server doesn't Acknowledge the Request Do Not Make another NEW request, keep Hammering on the Old request.
Fixed it for you.
If the request is acknowledged by both parties I don't see any need for the client to reply with another acknowledgement.
ID: 1696979 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1696980 - Posted: 30 Jun 2015, 9:59:09 UTC
Last modified: 30 Jun 2015, 10:03:30 UTC

I'm wondering if 'mark_results_over' is ever called...

edit: ah found it...
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1696980 · Report as offensive
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 15 · Next

Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.