Message boards :
Number crunching :
Suddenly BOINC Decides to Abandon 71 APs...WTH?
Message board moderation
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 15 · Next
Author | Message |
---|---|
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
moving complete data folders around hosts isn;t a problem. Then again I run them CLI only. And I certainly didn;t run the same folder (and with that hostID) on two different hosts at the same time. here we're concerned about a specific situation, where an initial rpc times out on the client but for unknown reasons takes its time getting on the server, then the client initiates another request (and it hadn't received success so hasn't incremented the RPC). Some (arbitrary) time later, whichever of the two requests completes first, the second will have a lower sequence number than current, so trigger the logic discussed. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14655 Credit: 200,643,578 RAC: 874 |
moving complete data folders around hosts isn;t a problem. Then again I run them CLI only. And I certainly didn;t run the same folder (and with that hostID) on two different hosts at the same time. I wonder if we need to pay attention to the sched_request files, as well as client_state. That's what's sent to the server, after all. It will have some subset of CS data, though exactly what's in/ex-cluded, I don't know. And I imagine I'd go crossed-eyed trying to compare them. |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
Doesn't 'lower sequence number' trigger 'new host ID' ? A person who won't read has no advantage over one who can't read. (Mark Twain) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Doesn't 'lower sequence number' trigger 'new host ID' ? Not quite straight away. There's a bunch of attempts to locate the original hostID first and let you through, then ( in the second logic example I stepped out for Jeff Buck earlier), it decides that lower RPC means keep the same host and abandon all the tasks. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14655 Credit: 200,643,578 RAC: 874 |
Doesn't 'lower sequence number' trigger 'new host ID' ? I remember that being a problem too, but I haven't seen a report about it for years. I wonder if the increasing thoroughness of the 'find existing host in the database' search (when were <host_cpid> introduced?) means that 'make new HostID (and bloat the database)' has effectively been replaced by 're-use old ID, but wipe the slate'? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
I wonder if we need to pay attention to the sched_request files, as well as client_state. That's what's sent to the server, after all. It will have some subset of CS data, though exactly what's in/ex-cluded, I don't know. And I imagine I'd go crossed-eyed trying to compare them. Yes, later after mulling the logic over more it's be good to have the rpc sequence numbers grabbed from a willing test victim, for illustration/communication. We did illustrate the process to ourselves clearly enough I feel, that for us what's broken is clear. What isn't clear are the complete set of options for solutions, without removing the intention. That needs more thought. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14655 Credit: 200,643,578 RAC: 874 |
I wonder if we need to pay attention to the sched_request files, as well as client_state. That's what's sent to the server, after all. It will have some subset of CS data, though exactly what's in/ex-cluded, I don't know. And I imagine I'd go crossed-eyed trying to compare them. And remember no copy of sched_request is ever saved - a new one is (over-)written for each RPC. So we would want a host which isn't constantly needing to top-up a cache. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Doesn't 'lower sequence number' trigger 'new host ID' ? There is a second codepath where the cpid isn;t present either, that will result in a new hostID. As found earlier, this codepath will also result in a new HostID, if you happen to get a different IP via your network DHCP server/gateway. In that (slightly different) case, A likely conjunction of netowork problems or machine restart, after an RPC failure, will spawn a new hostID. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
And remember no copy of sched_request is ever saved - a new one is (over-)written for each RPC. So we would want a host which isn't constantly needing to top-up a cache. yeah. In my experience, with Boinc involved, 'De Nile' is a river in egypt also. Because he understands the triggers, I'd suggest we mount a thorough campaign involving Jeff Buck if possible, drafting a complete report on the issues. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
If that's the intention, then there are still options. timeout scheduler transactions (ensuring they are atomic and completely rolled back on failure) before the client rpc timeout interval expires. This ensures that for the normal non copied or moved case, that both sides agree on the state. I would suggest if the Server is going to take punitive action on a Host over a Request which went Unacknowledged by the Server then force the Server to acknowledge the Request. This would ensure both sides are aware of the status of the Request. Having the Host continue after an assumed Request Failure doesn't seem to work very well. Have the Host contact the Server, make the Request, and then have the Server Acknowledge the Request before contact has ended. If the Server doesn't Acknowledge the Request Do Not Make another request. If someone can devise a method where the Host isn't operating on assumptions then fine, but don't penalize the host when the Server Fails to Acknowledge a Request. |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
The only case I can think of in the normal way of things, is if you (have to) restore the boinc data folder from a backup. In that case it makes sense to wipe the slate, especially since 'send lost tasks' is optional. A person who won't read has no advantage over one who can't read. (Mark Twain) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14655 Credit: 200,643,578 RAC: 874 |
Doesn't 'lower sequence number' trigger 'new host ID' ? My machines are on DHCP for both internal (LAN IPs assigned by my router) and external (WAN IP assigned to router by ISP) IP addresses. Is it clear which is checked by code? (or are both?). Again, I have had spontaneous new HostIDs in the past - and now you come to mention it, I think I once saw a case here at SETI where a working machine appeared under a new HostID in my computer list, but then went back spontaneously to the old HostID and carried on working on current tasks. But I think that was years ago, when the lab had comms problems. I think I deleted the 'ghost hosts', but I'll check. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
It'll be the LAN IP as displayed on the host details, along with the hostname, CPU, and RAM/ "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
Somebody complained about ghost hosts not too long ago. A person who won't read has no advantage over one who can't read. (Mark Twain) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14655 Credit: 200,643,578 RAC: 874 |
If that's the intention, then there are still options. timeout scheduler transactions (ensuring they are atomic and completely rolled back on failure) before the client rpc timeout interval expires. This ensures that for the normal non copied or moved case, that both sides agree on the state. That's pretty much what happens already. Request. Reply. You have to have some sort of timeout/'reset comms' fallback, otherwise every time a lightning strike hits a router, we'd all stop talking to the server for ever. The gap in the protocol is the client not sending an ack for the sched-reply: that's what leads to ghost tasks. Maybe the client should be sending ack (got your reply) or nack (I've given up listening - timeout), and the server should 'repeat last reply' on nack. What are the overheads in that? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Somebody complained about ghost hosts not too long ago. Yeah, we're in that section of spaghetti code, just a different strand. See what you think of the code in the scheduler handle_request.cpp, function authenticate_user(). I get a giggle from the liberal use of goto statements, which is pretty unusual in C code. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
What are the overheads in that? Basically a fair amount of moving randomly spread record updates into a single queue for monolithic execution, preferably using a lock/flag during the single update, that isn;t executed if the timer's expired (should be shorter than the client timeout value) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14655 Credit: 200,643,578 RAC: 874 |
It'll be the LAN IP as displayed on the host details, along with the hostname, CPU, and RAM/ The external IP address is shown on the host details page too, but doesn't appear in sched_request. BOINC must get it from Apache/ngnix on receipt. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
If that's the intention, then there are still options. timeout scheduler transactions (ensuring they are atomic and completely rolled back on failure) before the client rpc timeout interval expires. This ensures that for the normal non copied or moved case, that both sides agree on the state. ...make the Request, and then have the Server Acknowledge the Request before contact has ended. If the Server doesn't Acknowledge the Request Do Not Make another NEW request, keep Hammering on the Old request. Fixed it for you. If the request is acknowledged by both parties I don't see any need for the client to reply with another acknowledgement. |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
I'm wondering if 'mark_results_over' is ever called... edit: ah found it... A person who won't read has no advantage over one who can't read. (Mark Twain) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.