Message boards :
Number crunching :
Suddenly BOINC Decides to Abandon 71 APs...WTH?
Message board moderation
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · Next
Author | Message |
---|---|
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
My goodness you boys have been busy while some of us slept! Seems to have gotten a lot more complicated since the last installment I read, where we were simply discussing the handling of out-of-sequence requests. Let's see, what caught me eye.... edit2: I still think it's exceedingly impertinent to insinuate that you were doing something dodgy, when the most probable cause is having reverted to a backup for some reason.Actually, the probable cause that triggered this discussion was having an initial request get hung up in transmission, timeout, then have a second request arrive at the scheduler before the first one eventually trundles in. My successful test of the abandonment simply used a backup copy to simulate the out-of-sequence condition. exactly - so just check the host really hasn't anything running before we ditch the lot.I like that even better than my "do nothing except report the out-of-sequence condition" suggestion. If active tasks are included in the request, why abandon them? I have a few other thoughts, but with the outage looming, I think I'll just post this quick. EDIT: Actually, one minor issue with that last suggestion just occurred to me. If the second request (arriving first) generated new tasks, those tasks would not show up in the later-arriving first request. You wouldn't want to abandon those new ones. |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
well we actually have two conditions - one is the backup scenario the other is the 'ask twice - rpcs get out of order' that keeps killing the cache of some people here. the bits of code triggered are the same. A person who won't read has no advantage over one who can't read. (Mark Twain) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
since we know that one easy way to generate a new hostid (and thereby wipe silly APR entries) was to trigger the low rpc seqno code, we know that area of code is fairly new. Except - Testing at CPDN (Computers for UserID 216408), I did create a new HostID simply by changing the seqno (Yay!), but it didn't mark the ghosts I created earlier as abandoned (Boo!) Testing at FiND (Computers for UserID 124464), even faking HostID wasn't enough, but setting a fake HostID with allow_multiple_clients was (and revealed a Boinc Bug along the way). In both cases, the host marker for this experiment is the 8-processor Xeon E5320 - I've only got one of those, the others are clones. |
Jeff Buck Send message Joined: 11 Feb 00 Posts: 1441 Credit: 148,764,870 RAC: 0 |
well we actually have two conditions - one is the backup scenario the other is the 'ask twice - rpcs get out of order' that keeps killing the cache of some people here. the bits of code triggered are the same. Yes, so should the response to those two conditions be different? And where does the fraud-blocking that was discussed earlier come in? |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Tested at Einstein just before the outrage started. Computers for UserID 144054 Same outcome as at CPDN: messing the seqno produced a new HostID (always on default venue, not inheriting previous settings), but didn't abandon the previous tasks. I've now reverted it back to the previous HostID, and am successfully crunching both pre-existing and newly-fetched work. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Note Ivan's tale of woe in 'rm -rf *' considered harmful, especially the [edit] to the opening post. In that case, deleting all (or a substantial proportion) of the files was equivalent to a true 'detach', so the action taken was appropriate and indeed desirable. We need to be sure we keep the handler for that situation. |
ivan Send message Joined: 5 Mar 01 Posts: 783 Credit: 348,560,338 RAC: 223 |
Note Ivan's tale of woe in 'rm -rf *' considered harmful, especially the [edit] to the opening post. I'm glad someone found some good in the situation. :-) Me, I'm off-campus for two days, will revisit the problems on Friday. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Expanding on my earlier suggestion I see the problem that started this thread as the Client making an Assumption. After the Request timed out the Client Assumed the Request did not and will not be acted upon by the Server, this is a Very Bad assumption. It then made the mistake of sending another request while the current one was in progress. One way to solve the problem would be to Not send another request until it hears from the Server or a suitable period of time passes. In the cases I'm aware of, 10 minutes is usually the maximum time it takes to clear the request. So, a safe procedure would resemble; 1) Request Times Out, Request Server Acknowledge Current Request, wait 2.5 minutes 2) Request Server Acknowledge current Request, wait 2.5 minutes 3) Request Server Acknowledge current Request, wait 2.5 minutes 4) Send New Request That should be enough time for the first Request to clear, if it's going to clear. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Yes, ideally timeouts should occur both ends, server first preferably, and any changes rolled back. The code and database isn't quite structured that way I think, but something to think about for the future. To clarify the concept perhaps, The scheduler request represnts a virtual image of your host sitting in line on the server. It's not a fire sale at K-Mart, so you get bored and leave. unfortunately your clone is sitting in line still. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Now What!!! I just had Beta Abandon a slew of CPU tasks on a Different system. I've never had this before, now within a Week it's happened while running Mavericks and Mountain Lion. Here, http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71141 This is Not looking good. Here are the 'Events' from 30 Jun from another computer on Beta; 30 Jun 2015, 0:50:17 UTC Abandoned 30 Jun 2015, 1:02:51 UTC Abandoned 30 Jun 2015, 2:30:13 UTC Abandoned 30 Jun 2015, 4:18:31 UTC Abandoned 30 Jun 2015, 6:09:59 UTC Abandoned 30 Jun 2015, 7:12:37 UTC Abandoned 30 Jun 2015, 11:01:06 UTC Abandoned 30 Jun 2015, 12:06:22 UTC Abandoned 30 Jun 2015, 13:03:15 UTC Abandoned http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71714 All I see in the Log is the Failed Server request...nothing else. 01-Jul-2015 00:30:26 [SETI@home Beta Test] [sched_op] Starting scheduler request |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Not much you can do afaict so far. I looked to see if there was a client config rpc timeout setting, or similar. Nothing visible in the docs I saw, but I'll take a look and see if there is anything undocumented hiding in the code a bit later. We can have all sorts of questions/theories about why you wouldn't be getting a timely response to start with, and adding server timeouts at great effort. I'm still considering locking, refusing subsequent requests until prior ones are complete, as possibly the simplest and most solid way to keep the weird client state copy logic, but protect legit client rpc timeouts. Haven't come across gotchas yet in that, but still looking. [Edit:] the fact that the original request may have completed, and you never received a reply, is something that would be of concern, as locking alone wouldn't help (probably not without server side atomic transactions, and timeout-rollback). would setting your clients to disable network, and scripting a periodic enable+update-disable help ? seems a bit rubbish to me too, if that's any consolation. I'm yet to fathom a pattern in the hosts that explains why some might get the issue and not others. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
@TBar kindly mention your UTC offset when posting log messages. I can work out the offset for Europe and the UK but when a country has several timezones... @Jeff I don't see what kind of fraud might mess up rpc seqno. Being me, i.e. not paranoid, my base assumption is that whatever people do is down to ignorance, stupidity and bad luck and not to criminal energy. The science has to be protected from criminals [though we don;t really do a good job at that ourselves, whenever money (or fame) is involved, some people will cheat]. There are other parts of the code that deal with willful cheating. Until somebody comes up with an example how fraud would mess up rpsseqno that's not completely far-fetched, I will consider the matter as resulting from the timeout problem discussed earlier or from using a backup and therefore not in need of cheatproofing. @ivan rm -rf * eh? been there done that - not on purpose, slight typo, ended up with a ' ' inbetween the filename start and the * ... Thankfully just my user dir vanished and we were doing daily backups. But walk up to the sysadmin and tell him why you need your backup restored... @Richard please run the scenario where task abandonmend was the right chioce past me again. I didn;t really become involved until yesterday. @Jason I agree, improving client-server comms is desirable. Getting that BOINC improvement coded and accepted however is a different kettle of fish. I'll take the path of least resistance first. @TBar Oh it's your thread. Nevermind. A person who won't read has no advantage over one who can't read. (Mark Twain) |
TBar Send message Joined: 22 May 99 Posts: 5204 Credit: 840,779,836 RAC: 2,768 |
Not much you can do afaict so far... Well, I thought about it for a while. Thought about how it's only happening on this one machine. Thought about any recent changes...and then took action; 01-Jul-2015 02:04:49 [---] Local time is UTC -4 hours Never had it happen with 7.2.33. The Only reason I upgraded was because SETI has some BS going on making Macs have at least 7.2.42 before they can download any Stock Apps or download work for stock apps. So, I suppose I won't be testing any Stock Apps at Beta anytime soon. Unless someone fixes whatever it is requiring Macs to have 7.2.42. I always run Anonymous platform on Main anyway, and you Don't Need 7.2.42 to run Anonymous platform on Main or Beta proving it's something in the Server making Macs run 7.2.42....something Deep, Dark, and Secret. So, we'll see how 7.2.33 works. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Not much you can do afaict so far. I looked to see if there was a client config rpc timeout setting, or similar. Nothing visible in the docs I saw, but I'll take a look and see if there is anything undocumented hiding in the code a bit later. RPC timeout is controlled by <http_transfer_timeout>seconds</http_transfer_timeout> Client configuration I tested and reported that during some previous crisis, when Eric mistakenly thought that the timeout message referred to a server config, and tweaked that instead. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
alright, sounds reasonable, as it's quite possible they fiddled with the RPC timeouts, meither making them shorter or breaking them. For the sakes of eliminating the easy stuff, that we're all very certain there is no problem with... Is there any Mac equivalent to the windows pathping or Linux mtr command ? some example pinging your local router that looks like this would just confirm It's Boinc's/Setis issues, for everyone to see: C:\Users\Jason>pathping 192.168.0.1 Tracing route to 192.168.0.1 over a maximum of 30 hops 0 Apollo [192.168.0.10] 1 192.168.0.1 Computing statistics for 25 seconds... Source to Here This Node/Link Hop RTT Lost/Sent = Pct Lost/Sent = Pct Address 0 Apollo [192.168.0.10] 0/ 100 = 0% | 1 0ms 0/ 100 = 0% 0/ 100 = 0% 192.168.0.1 Trace complete. C:\Users\Jason>pathping setiathome.berkeley.edu Tracing route to setiathome.berkeley.edu [169.229.217.150] over a maximum of 30 hops: 0 Apollo [192.168.0.10] 1 192.168.0.1 2 * * * Computing statistics for 25 seconds... Source to Here This Node/Link Hop RTT Lost/Sent = Pct Lost/Sent = Pct Address 0 Apollo [192.168.0.10] 0/ 100 = 0% | 1 0ms 0/ 100 = 0% 0/ 100 = 0% 192.168.0.1 Trace complete. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Not much you can do afaict so far. I looked to see if there was a client config rpc timeout setting, or similar. Nothing visible in the docs I saw, but I'll take a look and see if there is anything undocumented hiding in the code a bit later. hmmm, that's where I looked for the word timeout, duh, must be time for coffee I would set that to half an hour. If the responses never arrive in that time then something's broken. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
hmmm, that's where I looked for the word timeout, duh, must be time for coffee ROFLMAO just a side note, IIRC while boinc waits for a response, nothing much else happens, especially not contacts to other projects. You might want to keep that in mind if you run more than just Seti :) A person who won't read has no advantage over one who can't read. (Mark Twain) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
@Richard please run the scenario where task abandonmend was the right chioce past me again. I didn;t really become involved until yesterday. If the client has been genuinely detached, or for some other reason (like Ivan's finger-fumble, or installing an anonymous platform app with the wrong version/plan_class combo) all tasks once present on the host are no longer available, the question arises 'what to do with the records of those tasks still present in the server's database'? There are three choices: Wait until they time out naturally at deadline Send them back to the host as lost tasks (requires costly server resources to identify) Bring forward deadline, or otherwise mark them as 'never going to report', so that wingmates can take over and complete the WU. |
William Send message Joined: 14 Feb 13 Posts: 2037 Credit: 17,689,662 RAC: 0 |
@Richard please run the scenario where task abandonmend was the right chioce past me again. I didn;t really become involved until yesterday. The safetycheck I am proposing checks whether the client reports 'other_tasks' - unless you managed to lose the files but keep the CS entries, you then still have files and I propose to not to abandon them. If the client doesn't have anything, it's truly lost and the server can kill it. files present in CS but not physically are covered by 'resend lost tasks'. edit good point - I'll recheck the 'genuine detach' logic again and test it. When I'm more awake and some coffee has kicked in. A person who won't read has no advantage over one who can't read. (Mark Twain) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
@Richard please run the scenario where task abandonmend was the right chioce past me again. I didn;t really become involved until yesterday. I still haven't had time to study the code/pseudocode samples buried a long way back down this thread. At the moment, 'resend lost tasks' is, I believe, a global switch on the server - either it's on for all RPCs, which clobbers server performance, or it's off. Edit - and when it's off, tasks hang around for the very long deadlines we run here. What we need - and it might well be in your code - is a one-off "resend lost tasks" check triggered as part of the seqno/detach/authentication path we're exploring. Edit2: depending on the sequence of validation checks, there may need to be extra validation for the one-off 'resend lost tasks' that the reported 'tasks on host' validly belong to the HostID which is being recycled. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.