Suddenly BOINC Decides to Abandon 71 APs...WTH?

Author	Message
jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1695744 - Posted: 25 Jun 2015, 23:40:52 UTC - in response to Message 1695730. I find interesting the stressor on that when it's a host has detached. The mind naturally wanders to the old spontaneous detach problem. I think the crucial (but probably false) word is 'evidence'. True 'Evidence' is a fantastic thing to call on when you have control of proceedings. In the common cases you will need to either subordinate to the provided evidence, or provide your own. which still doesn't contest to the truth unless you are god. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1695744 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1695750 - Posted: 26 Jun 2015, 0:13:03 UTC - in response to Message 1695741. Last modified: 26 Jun 2015, 0:58:22 UTC ...We do have one member of the 'team' who has access to server logs, and I caught his attention with a possibly related case a couple of weeks ago: Immediate timeout? Missing deadline?. But so far, as you can see, no diagnosis or resolution. To me that appears to be a different problem, something occurring with the deadline. The problem I had occurred when after a Timed Out scheduler contact the Server decided My host had been detached and trashed All of my tasks. Possibly something wrong with the method the Server uses to determine if a host has detached. Maybe the Server should wait for another contact before making such a rash decision? !!!!!!!!!!!!!!!!!!!!!!!!!!!!! Yep, he just got Whacked again, 26 Jun 2015, 0:47:27 UTC Abandoned I hate it when that happens... ID: 1695750 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1695790 - Posted: 26 Jun 2015, 4:33:59 UTC - in response to Message 1695750. Last modified: 26 Jun 2015, 4:34:47 UTC Yeah Richard's right in that we need evidence really, but we'd need some theories to throw in the ring first. Deadline issues sounds like a theory, which gels with my wacky estimate one, and spontaneous detach is something we've seen before. Not sure they are connected to one another, though I suppose that doesn't mean they can't be related underneath, or that a host couldn't manifest multiple symptoms at the same time. I wonder how I would go about inducing a spontaneous detach ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1695790 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1695797 - Posted: 26 Jun 2015, 5:25:46 UTC - in response to Message 1695790. A while back when this was being discussed it was found that the tasks were Abandoned after the following two events; 1) Scheduler request failed: Timeout was reached [sched_op] Deferring communication for 00:01:25 [sched_op] Reason: Scheduler request failed 2) Not sending work - last request too recent: 149 sec Always there isn't any record of that 'Ghost" request that caused the other request to be too recent. It was theorized that the Timed Out request arrived late and out of sequence causing BOINC to Abandon the tasks. A solution would be to find out what is contained in that 'Ghost' request and Bar BOINC from Abandoning tasks based on that type of 'Ghost' whatever. You might find out what is in that 'Ghost' whatever by looking at the Server Log of that host on Beta. He will be suffering the Event in a few more hours if the pattern continues. You need to work fast though, I doubt the owner is aware that his machine has been spending most of it's time working Abandoned, Worthless Tasks and will probably be angry when he discovers all the waste. He might just stop working any tasks... ID: 1695797 ·

Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13	Message 1695813 - Posted: 26 Jun 2015, 5:55:27 UTC Something is standing out to me from the first post in this thread. Look at the timestamps for the chain of events. 14:26:47 Request 1 14:32:01 Reply 1 timed-out (5m 14s after above, backing off for 1m 25s) 14:33:31 Request 2 (1m 30s after above) 14:33:34 Reply 2 completed (3s after above) 14:36:13 server marked a pile of tasks as abandoned. 14:38:40 Request 3 (5m 6s after Reply 2) 14:38:42 Reply 3 not sending work, last request was 149s ago (????) (2s after above) 14:43:47 Request 4 (5m 5s after above) 14:43:49 Reply 4 (2s after above) Best I can gather.. that first request that timed-out for whatever reason.. BOINC gave up on it, but the request ended up finally being processed by the server 9m 34s after it originated. *AND* there was a complete request that happened between request 1 and everything being abandoned. The only thing that seems to make sense is that the scheduler received a request that had an out-of-sequence counter/timestamp in it (because it was older), so it just defaults to purging everything because obviously something must be very wrong. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) ID: 1695813 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1695820 - Posted: 26 Jun 2015, 6:00:28 UTC - in response to Message 1695797. True. That host is a beast, and I'd be pretty disappointed munching away for nothing also. If multiplied across enough hosts, Could something going on with the server logic here explain at least partially the radically increased server load when resend lost tasks is enabled ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1695820 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1695822 - Posted: 26 Jun 2015, 6:03:16 UTC - in response to Message 1695813. Last modified: 26 Jun 2015, 6:04:42 UTC The only thing that seems to make sense is that the scheduler received a request that had an out-of-sequence counter/timestamp in it (because it was older), so it just defaults to purging everything because obviously something must be very wrong. As resend lost tasks is disabled, does the client [on a later request] at least get something to indicate the tasks should be aborted ? [Edit: I imagine not, might take a stroll through scheduler logic a bit later, when beer stocks are in] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1695822 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51540 Credit: 1,018,363,574 RAC: 1,004	Message 1695836 - Posted: 26 Jun 2015, 6:28:25 UTC Ahh,,, the WTF syndrome.... Kitties, come hither...and witness this. Meow? Meow? Meow? Pay attention, kitties, this could happen to YOU. "Time is simply the mechanism that keeps everything from happening all at once." ID: 1695836 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874	Message 1695893 - Posted: 26 Jun 2015, 8:17:27 UTC - in response to Message 1695790. Yeah Richard's right in that we need evidence really, but we'd need some theories to throw in the ring first. Deadline issues sounds like a theory, which gels with my wacky estimate one, and spontaneous detach is something we've seen before. Not sure they are connected to one another, though I suppose that doesn't mean they can't be related underneath, or that a host couldn't manifest multiple symptoms at the same time. I wonder how I would go about inducing a spontaneous detach ? My logic in raising (and linking) the Einstein deadline issue was something like: Einstein don't do variable deadlines. Some something has changed the deadline - changed the database record - after the task was first allocated. Einstein also run very much older server code. So, perhaps there's a common cause - something like the out-of-sequence RPC processing, which I think Cosmic_Ocean has analysed very clearly - which cause the server to modify the task records. But the different server generations perform different actions - we abandon, they re-deadline. But unfortunately for that theory, mark_results_over() was added to the server code on 6 June 2006 (SVN 10258), which is probably too long ago. ID: 1695893 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874	Message 1695897 - Posted: 26 Jun 2015, 8:34:32 UTC I had a PM about a first-hand observation of the 'abandon' phenomenon overnight. I think the contents were intended as a contribution to the bug-hunt - the PM was just because the writer was too busy to be drawn into the extended conversation (as was said in the bit I'm not publishing). I had a couple of these problems, about a year ago. My son had installed Steam as required by a certain game. We concluded the internet connection problem to Seti was caused by the Steam player to player comms sub-program. We stopped Steam loading by default, so that it was only used when playing the game, which we set to stop BOINC in the "exclusive programs" tab. It's not happened since, even though some here were not convinced that Steam was the problem. ID: 1695897 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1695899 - Posted: 26 Jun 2015, 8:45:44 UTC - in response to Message 1695897. Last modified: 26 Jun 2015, 8:47:30 UTC In that case I'd have a good look at what version(s) of curl or c-ares are being used for the communications [and how they are being used...] . As it updates frequently I'd suspect steam would be using a fairly new version (of those or something similar), and there may be compatibility issues between the new and the old running at the same time. So it wouldn't necessarily be only steam that could raise the symptoms. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1695899 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874	Message 1695903 - Posted: 26 Jun 2015, 8:55:59 UTC - in response to Message 1695899. In that case I'd have a good look at what version(s) of curl or c-ares are being used for the communications [and how they are being used...] . As it updates frequently I'd suspect steam would be using a fairly new version (of those or something similar), and there may be compatibility issues between the new and the old running at the same time. So it wouldn't necessarily be only steam that could raise the symptoms. And since curl/c-ares are distributed pre-compiled within BOINC, that brings into play "how often do people update their BOINC clients?" - in other words, is there any correlation (either way) between abandonment and the version of BOINC in use, either older or newer. ID: 1695903 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1695937 - Posted: 26 Jun 2015, 11:25:22 UTC - in response to Message 1695903. Yes that came to mind, though the linked Einstein machine said I think 7.4.42 or 7.2.42 ? Well the development world (including mine) is changing in many ways to address the need for continual improvement right through to the user. Whether Boinc decides to wait to adopt the methods, or stick with the "download major traumatic update" route probably isn't on their agenda. It'll be interesting to see if they'll be able to keep up with the OSes they're targeting... "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1695937 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1696004 - Posted: 26 Jun 2015, 16:50:21 UTC Last modified: 26 Jun 2015, 16:58:20 UTC The old Steam case. That sounds good until you discover others are not using Steam, or even playing games on their computer for that matter. Same with the WiFi, sounds good until someone using Fiber Optics has the same problem. BTW, with FIOS the TV uses the router and will not work correctly if there is something wrong with the router, not to mention the other 3 computers using the same router. Then you find it happens with Old versions of BOINC as well as New. I think it would be more constructive to determine why BOINC suddenly, minutes after a successful contact, determines a Host has detached even though No Detach signals have been sent by the Host. Instead of Immediately trashing All work perhaps BOINC can wait a few seconds and conduct another contact? It would also be nice if a notification could be sent to the Host if All the tasks are Trashed. Currently if the Owner doesn't look at the Host Results Webpage he has No Clue All the tasks have been trashed, the event log just continues as normal as if all is well. Not Nice. BOINC certainly doesn't have any trouble sending meaningless notifications such as this; SETI@home Beta Test: Notice from BOINC Your app_config.xml file refers to an unknown application 'astropulse_v7'. Known applications: 'setiathome_v7' Friday, June 26, 2015 'AMt' 02:08:03 AM You would think the fact your computer is wasting resources on worthless tasks would be higher on the priority list. ID: 1696004 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874	Message 1696012 - Posted: 26 Jun 2015, 17:56:44 UTC - in response to Message 1696004. I think it would be more constructive to determine why BOINC suddenly, minutes after a successful contact, determines a Host has detached even though No Detach signals have been sent by the Host. Yes. That's what we're trying to do. Do note that if you actively, deliberately, detach a host - no signal is sent. Try it sometime - any tasks in progress will just sit there until their deadline. As far as the server is concerned, the host has just been switched off for an extended holiday. Of course, if you re-attach the host, the server does get a signal, and reacts accordingly. Maybe it sends you the old tasks back again (but not if 'resend lost results' is turned off, as it usually seems to be these days). Or it might send you new tasks, and leave the old ones to meet their deadline. But I don't think that anyone has ever reported seeing the tasks marked as 'abandoned' when they have consciously and deliberately clicked the 'detach' button. Hence my emphasis on the word 'evidence' in my reply to Jason about the source code comment. The server is actioning a different sort of evidence-based detach from the normal, silent one that occurs when the user clicks the detach button. So, what is that evidence, and what causes it to appear without deliberate action by the user? ID: 1696012 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696013 - Posted: 26 Jun 2015, 18:19:42 UTC - in response to Message 1696004. Last modified: 26 Jun 2015, 18:20:27 UTC Yes, to clarify the procedure it's one of 'differential diagnosis', whereby you throw as many possibilities forward that could explain the symptoms, and discount them until you are either left with one standing, or none (which means you missed something or need to try another method altogether). The possible obscure connection of a detach and marking tasks as abandoned could be a clue, because I'd expect those two situations to occur nearby in few places, so can look. There's still the possibility that it's a conjunction of two or more unrelated issues, as opposed to one condition, but eliminating the easier ones first narrows the field. The 'Steam' one I would call a symptom rather than a cause, and remains on the board. If steam were doing that on one particular host or many isn't really the point so much as that other programs happily use the network alongside steam, so why not Boinc? IMO more than likely just one situation that triggers underlying problems with Boinc client &/or server, rather than Steam's fault. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696013 ·

TBar Volunteer tester Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768	Message 1696016 - Posted: 26 Jun 2015, 18:32:18 UTC - in response to Message 1696012. I think it would be more constructive to determine why BOINC suddenly, minutes after a successful contact, determines a Host has detached even though No Detach signals have been sent by the Host. Yes. That's what we're trying to do. Do note that if you actively, deliberately, detach a host - no signal is sent. Try it sometime - any tasks in progress will just sit there until their deadline. As far as the server is concerned, the host has just been switched off for an extended holiday. Of course, if you re-attach the host, the server does get a signal, and reacts accordingly. Maybe it sends you the old tasks back again (but not if 'resend lost results' is turned off, as it usually seems to be these days). Or it might send you new tasks, and leave the old ones to meet their deadline. But I don't think that anyone has ever reported seeing the tasks marked as 'abandoned' when they have consciously and deliberately clicked the 'detach' button. Hence my emphasis on the word 'evidence' in my reply to Jason about the source code comment. The server is actioning a different sort of evidence-based detach from the normal, silent one that occurs when the user clicks the detach button. So, what is that evidence, and what causes it to appear without deliberate action by the user? Maybe you can rephrase the question. To me the evidence is the Server saying it received a request that doesn't appear in the Host log; Thu Jun 25 14:32:01 2015 \| SETI@home \| Scheduler request failed: Timeout was reached Thu Jun 25 14:32:01 2015 \| SETI@home \| [sched_op] Deferring communication for 00:01:25 Thu Jun 25 14:32:01 2015 \| SETI@home \| [sched_op] Reason: Scheduler request failed Thu Jun 25 14:33:31 2015 \| SETI@home \| [sched_op] Starting scheduler request Thu Jun 25 14:33:31 2015 \| SETI@home \| Sending scheduler request: To report completed tasks. Thu Jun 25 14:33:31 2015 \| SETI@home \| Reporting 1 completed tasks Thu Jun 25 14:33:31 2015 \| SETI@home \| Requesting new tasks for AMD/ATI GPU Thu Jun 25 14:33:31 2015 \| SETI@home \| [sched_op] CPU work request: 0.00 seconds; 0.00 devices Thu Jun 25 14:33:31 2015 \| SETI@home \| [sched_op] AMD/ATI GPU work request: 660690.66 seconds; 0.00 devices Thu Jun 25 14:33:34 2015 \| SETI@home \| Scheduler request completed: got 0 new tasks Thu Jun 25 14:33:34 2015 \| SETI@home \| [sched_op] Server version 707 Thu Jun 25 14:33:34 2015 \| SETI@home \| No tasks sent Thu Jun 25 14:33:34 2015 \| SETI@home \| No tasks are available for AstroPulse v7 Thu Jun 25 14:33:34 2015 \| SETI@home \| Tasks for CPU are available, but your preferences are set to not accept them Thu Jun 25 14:33:34 2015 \| SETI@home \| Project requested delay of 303 seconds Thu Jun 25 14:33:34 2015 \| SETI@home \| [sched_op] handle_scheduler_reply(): got ack for task ap_07fe15aa_B2_P1_00161_20150624_15414.wu_1 Thu Jun 25 14:33:34 2015 \| SETI@home \| [sched_op] Deferring communication for 00:05:03 Thu Jun 25 14:33:34 2015 \| SETI@home \| [sched_op] Reason: requested by project Thu Jun 25 14:38:19 2015 \| SETI@home \| Message from task: 0 Thu Jun 25 14:38:19 2015 \| SETI@home \| Computation for task ap_07fe15aa_B3_P0_00092_20150624_17213.wu_1 finished Thu Jun 25 14:38:19 2015 \| SETI@home \| Starting task ap_06fe15aa_B6_P1_00291_20150624_15160.wu_1 Thu Jun 25 14:38:22 2015 \| SETI@home \| Started upload of ap_07fe15aa_B3_P0_00092_20150624_17213.wu_1_0 Thu Jun 25 14:38:25 2015 \| SETI@home \| Finished upload of ap_07fe15aa_B3_P0_00092_20150624_17213.wu_1_0 Thu Jun 25 14:38:40 2015 \| SETI@home \| [sched_op] Starting scheduler request Thu Jun 25 14:38:40 2015 \| SETI@home \| Sending scheduler request: To report completed tasks. Thu Jun 25 14:38:40 2015 \| SETI@home \| Reporting 1 completed tasks Thu Jun 25 14:38:40 2015 \| SETI@home \| Requesting new tasks for AMD/ATI GPU Thu Jun 25 14:38:40 2015 \| SETI@home \| [sched_op] CPU work request: 0.00 seconds; 0.00 devices Thu Jun 25 14:38:40 2015 \| SETI@home \| [sched_op] AMD/ATI GPU work request: 662485.94 seconds; 0.00 devices Thu Jun 25 14:38:42 2015 \| SETI@home \| Scheduler request completed: got 0 new tasks Thu Jun 25 14:38:42 2015 \| SETI@home \| [sched_op] Server version 707 Thu Jun 25 14:38:42 2015 \| SETI@home \| Not sending work - last request too recent: 149 sec In every case I've seen where a log was supplied it is the same. There wasn't any action initiated by the host at the stipulated time. The Server is literally pulling a request out of thin air without deliberate action by the user. ID: 1696016 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696020 - Posted: 26 Jun 2015, 18:51:14 UTC Last modified: 26 Jun 2015, 19:03:27 UTC One interesting & probably unrelated eyeopener has turned up already, looking at the basic authentication. // If the seqno from the host is less than what we expect, // the user must have copied the state file to a different host. // Make a new host record. // It'd take a lot of convincing for me to believe that was a good way to handle out of sequence requests, rather than a recipe for strange issues. [Edit:] and further down, a possible strike: // If host CPID is present, // scan backwards through this user's hosts, // looking for one with the same host CPID. // If we find one, it means the user detached and reattached. // Use the existing host record, // and mark in-progress results as over. Can probably work some theories from this chestnut. This was in the section where no hostid was given or it was invalid. I wonder how long it would take to scan the whole hosts table looking for an entry that wasn't there, then check all the user's hosts for cpids, then mark the results as over. Long enough to explain the time gap ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696020 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1696022 - Posted: 26 Jun 2015, 19:09:13 UTC - in response to Message 1696020. One interesting & probably unrelated eyeopener has turned up already, looking at the basic authentication. // If the seqno from the host is less than what we expect, // the user must have copied the state file to a different host. // Make a new host record. // It'd take a lot of convincing for me to believe that was a good way to handle out of sequence requests, rather than a recipe for strange issues. [Edit:] and further down, a possible strike: // If host CPID is present, // scan backwards through this user's hosts, // looking for one with the same host CPID. // If we find one, it means the user detached and reattached. // Use the existing host record, // and mark in-progress results as over. Can probably work some theories from this chestnut. This was in the section where no hostid was given or it was invalid. I wonder how long it would take to scan the whole hosts table looking for an entry that wasn't there, then check all the user's hosts for cpids, then mark the results as over. Long enough to explain the time gap ? I wonder how often a false "host not found" occurs. Which would possibly explain why the server will sometimes randomly assign a new CPID to a host. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1696022 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1696024 - Posted: 26 Jun 2015, 19:13:04 UTC - in response to Message 1696022. I wonder how often a false "host not found" occurs. Which would possibly explain why the server will sometimes randomly assign a new CPID to a host. Yes, I'm trying to get my mind around the assumptions those comments indicate, along with the corresponding code. I would argue that an out of sequence sequence number indicates nothing more than something was out of sequence (for any number of reasons), and that a missing hostid means a missing hostid. I can see the logic in then looking at the CPID, but not the additional assumptions. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1696024 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.