Message boards :
Number crunching :
Suddenly BOINC Decides to Abandon 71 APs...WTH?
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 15 · Next
Author | Message |
---|---|
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
I find interesting the stressor on that when it's a host has detached. The mind naturally wanders to the old spontaneous detach problem. True 'Evidence' is a fantastic thing to call on when you have control of proceedings. In the common cases you will need to either subordinate to the provided evidence, or provide your own. which still doesn't contest to the truth unless you are god. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
...We do have one member of the 'team' who has access to server logs, and I caught his attention with a possibly related case a couple of weeks ago: Immediate timeout? Missing deadline?. But so far, as you can see, no diagnosis or resolution. To me that appears to be a different problem, something occurring with the deadline. The problem I had occurred when after a Timed Out scheduler contact the Server decided My host had been detached and trashed All of my tasks. Possibly something wrong with the method the Server uses to determine if a host has detached. Maybe the Server should wait for another contact before making such a rash decision? !!!!!!!!!!!!!!!!!!!!!!!!!!!!! Yep, he just got Whacked again, 26 Jun 2015, 0:47:27 UTC Abandoned I hate it when that happens... |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
Yeah Richard's right in that we need evidence really, but we'd need some theories to throw in the ring first. Deadline issues sounds like a theory, which gels with my wacky estimate one, and spontaneous detach is something we've seen before. Not sure they are connected to one another, though I suppose that doesn't mean they can't be related underneath, or that a host couldn't manifest multiple symptoms at the same time. I wonder how I would go about inducing a spontaneous detach ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
A while back when this was being discussed it was found that the tasks were Abandoned after the following two events; 1) Scheduler request failed: Timeout was reached [sched_op] Deferring communication for 00:01:25 [sched_op] Reason: Scheduler request failed 2) Not sending work - last request too recent: 149 sec Always there isn't any record of that 'Ghost" request that caused the other request to be too recent. It was theorized that the Timed Out request arrived late and out of sequence causing BOINC to Abandon the tasks. A solution would be to find out what is contained in that 'Ghost' request and Bar BOINC from Abandoning tasks based on that type of 'Ghost' whatever. You might find out what is in that 'Ghost' whatever by looking at the Server Log of that host on Beta. He will be suffering the Event in a few more hours if the pattern continues. You need to work fast though, I doubt the owner is aware that his machine has been spending most of it's time working Abandoned, Worthless Tasks and will probably be angry when he discovers all the waste. He might just stop working any tasks... |
Cosmic_Ocean ![]() Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 ![]() ![]() |
Something is standing out to me from the first post in this thread. Look at the timestamps for the chain of events. 14:26:47 Request 1 14:32:01 Reply 1 timed-out (5m 14s after above, backing off for 1m 25s) 14:33:31 Request 2 (1m 30s after above) 14:33:34 Reply 2 completed (3s after above) 14:36:13 server marked a pile of tasks as abandoned. 14:38:40 Request 3 (5m 6s after Reply 2) 14:38:42 Reply 3 not sending work, last request was 149s ago (????) (2s after above) 14:43:47 Request 4 (5m 5s after above) 14:43:49 Reply 4 (2s after above) Best I can gather.. that first request that timed-out for whatever reason.. BOINC gave up on it, but the request ended up finally being processed by the server 9m 34s after it originated. AND there was a complete request that happened between request 1 and everything being abandoned. The only thing that seems to make sense is that the scheduler received a request that had an out-of-sequence counter/timestamp in it (because it was older), so it just defaults to purging everything because obviously something must be very wrong. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
True. That host is a beast, and I'd be pretty disappointed munching away for nothing also. If multiplied across enough hosts, Could something going on with the server logic here explain at least partially the radically increased server load when resend lost tasks is enabled ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
The only thing that seems to make sense is that the scheduler received a request that had an out-of-sequence counter/timestamp in it (because it was older), so it just defaults to purging everything because obviously something must be very wrong. As resend lost tasks is disabled, does the client [on a later request] at least get something to indicate the tasks should be aborted ? [Edit: I imagine not, might take a stroll through scheduler logic a bit later, when beer stocks are in] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
kittyman ![]() ![]() ![]() ![]() Send message Joined: 9 Jul 00 Posts: 51540 Credit: 1,018,363,574 RAC: 1,004 ![]() ![]() |
Ahh,,, the WTF syndrome.... Kitties, come hither...and witness this. Meow? Meow? Meow? Pay attention, kitties, this could happen to YOU. "Time is simply the mechanism that keeps everything from happening all at once." ![]() |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
Yeah Richard's right in that we need evidence really, but we'd need some theories to throw in the ring first. Deadline issues sounds like a theory, which gels with my wacky estimate one, and spontaneous detach is something we've seen before. Not sure they are connected to one another, though I suppose that doesn't mean they can't be related underneath, or that a host couldn't manifest multiple symptoms at the same time. My logic in raising (and linking) the Einstein deadline issue was something like: Einstein don't do variable deadlines. Some something has changed the deadline - changed the database record - after the task was first allocated. Einstein also run very much older server code. So, perhaps there's a common cause - something like the out-of-sequence RPC processing, which I think Cosmic_Ocean has analysed very clearly - which cause the server to modify the task records. But the different server generations perform different actions - we abandon, they re-deadline. But unfortunately for that theory, mark_results_over() was added to the server code on 6 June 2006 (SVN 10258), which is probably too long ago. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
I had a PM about a first-hand observation of the 'abandon' phenomenon overnight. I think the contents were intended as a contribution to the bug-hunt - the PM was just because the writer was too busy to be drawn into the extended conversation (as was said in the bit I'm not publishing). I had a couple of these problems, about a year ago. My son had installed Steam as required by a certain game. We concluded the internet connection problem to Seti was caused by the Steam player to player comms sub-program. We stopped Steam loading by default, so that it was only used when playing the game, which we set to stop BOINC in the "exclusive programs" tab. |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
In that case I'd have a good look at what version(s) of curl or c-ares are being used for the communications [and how they are being used...] . As it updates frequently I'd suspect steam would be using a fairly new version (of those or something similar), and there may be compatibility issues between the new and the old running at the same time. So it wouldn't necessarily be only steam that could raise the symptoms. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
In that case I'd have a good look at what version(s) of curl or c-ares are being used for the communications [and how they are being used...] . As it updates frequently I'd suspect steam would be using a fairly new version (of those or something similar), and there may be compatibility issues between the new and the old running at the same time. So it wouldn't necessarily be only steam that could raise the symptoms. And since curl/c-ares are distributed pre-compiled within BOINC, that brings into play "how often do people update their BOINC clients?" - in other words, is there any correlation (either way) between abandonment and the version of BOINC in use, either older or newer. |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
Yes that came to mind, though the linked Einstein machine said I think 7.4.42 or 7.2.42 ? Well the development world (including mine) is changing in many ways to address the need for continual improvement right through to the user. Whether Boinc decides to wait to adopt the methods, or stick with the "download major traumatic update" route probably isn't on their agenda. It'll be interesting to see if they'll be able to keep up with the OSes they're targeting... "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
The old Steam case. That sounds good until you discover others are not using Steam, or even playing games on their computer for that matter. Same with the WiFi, sounds good until someone using Fiber Optics has the same problem. BTW, with FIOS the TV uses the router and will not work correctly if there is something wrong with the router, not to mention the other 3 computers using the same router. Then you find it happens with Old versions of BOINC as well as New. I think it would be more constructive to determine why BOINC suddenly, minutes after a successful contact, determines a Host has detached even though No Detach signals have been sent by the Host. Instead of Immediately trashing All work perhaps BOINC can wait a few seconds and conduct another contact? It would also be nice if a notification could be sent to the Host if All the tasks are Trashed. Currently if the Owner doesn't look at the Host Results Webpage he has No Clue All the tasks have been trashed, the event log just continues as normal as if all is well. Not Nice. BOINC certainly doesn't have any trouble sending meaningless notifications such as this; SETI@home Beta Test: Notice from BOINC You would think the fact your computer is wasting resources on worthless tasks would be higher on the priority list. |
Richard Haselgrove ![]() Send message Joined: 4 Jul 99 Posts: 14690 Credit: 200,643,578 RAC: 874 ![]() ![]() |
I think it would be more constructive to determine why BOINC suddenly, minutes after a successful contact, determines a Host has detached even though No Detach signals have been sent by the Host. Yes. That's what we're trying to do. Do note that if you actively, deliberately, detach a host - no signal is sent. Try it sometime - any tasks in progress will just sit there until their deadline. As far as the server is concerned, the host has just been switched off for an extended holiday. Of course, if you re-attach the host, the server does get a signal, and reacts accordingly. Maybe it sends you the old tasks back again (but not if 'resend lost results' is turned off, as it usually seems to be these days). Or it might send you new tasks, and leave the old ones to meet their deadline. But I don't think that anyone has ever reported seeing the tasks marked as 'abandoned' when they have consciously and deliberately clicked the 'detach' button. Hence my emphasis on the word 'evidence' in my reply to Jason about the source code comment. The server is actioning a different sort of evidence-based detach from the normal, silent one that occurs when the user clicks the detach button. So, what is that evidence, and what causes it to appear without deliberate action by the user? |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
Yes, to clarify the procedure it's one of 'differential diagnosis', whereby you throw as many possibilities forward that could explain the symptoms, and discount them until you are either left with one standing, or none (which means you missed something or need to try another method altogether). The possible obscure connection of a detach and marking tasks as abandoned could be a clue, because I'd expect those two situations to occur nearby in few places, so can look. There's still the possibility that it's a conjunction of two or more unrelated issues, as opposed to one condition, but eliminating the easier ones first narrows the field. The 'Steam' one I would call a symptom rather than a cause, and remains on the board. If steam were doing that on one particular host or many isn't really the point so much as that other programs happily use the network alongside steam, so why not Boinc? IMO more than likely just one situation that triggers underlying problems with Boinc client &/or server, rather than Steam's fault. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 ![]() ![]() |
I think it would be more constructive to determine why BOINC suddenly, minutes after a successful contact, determines a Host has detached even though No Detach signals have been sent by the Host. Maybe you can rephrase the question. To me the evidence is the Server saying it received a request that doesn't appear in the Host log; Thu Jun 25 14:32:01 2015 | SETI@home | Scheduler request failed: Timeout was reached In every case I've seen where a log was supplied it is the same. There wasn't any action initiated by the host at the stipulated time. The Server is literally pulling a request out of thin air without deliberate action by the user. |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
One interesting & probably unrelated eyeopener has turned up already, looking at the basic authentication. // If the seqno from the host is less than what we expect, It'd take a lot of convincing for me to believe that was a good way to handle out of sequence requests, rather than a recipe for strange issues. [Edit:] and further down, a possible strike: // If host CPID is present, Can probably work some theories from this chestnut. This was in the section where no hostid was given or it was invalid. I wonder how long it would take to scan the whole hosts table looking for an entry that wasn't there, then check all the user's hosts for cpids, then mark the results as over. Long enough to explain the time gap ? "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
![]() ![]() Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 ![]() ![]() |
One interesting & probably unrelated eyeopener has turned up already, looking at the basic authentication. I wonder how often a false "host not found" occurs. Which would possibly explain why the server will sometimes randomly assign a new CPID to a host. SETI@home classic workunits: 93,865 CPU time: 863,447 hours ![]() |
![]() ![]() Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 ![]() |
I wonder how often a false "host not found" occurs. Which would possibly explain why the server will sometimes randomly assign a new CPID to a host. Yes, I'm trying to get my mind around the assumptions those comments indicate, along with the corresponding code. I would argue that an out of sequence sequence number indicates nothing more than something was out of sequence (for any number of reasons), and that a missing hostid means a missing hostid. I can see the logic in then looking at the CPID, but not the additional assumptions. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
©2025 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.