Suddenly BOINC Decides to Abandon 71 APs...WTH?

Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · Next

AuthorMessage
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1697076 - Posted: 30 Jun 2015, 15:38:18 UTC
Last modified: 30 Jun 2015, 15:46:34 UTC

My goodness you boys have been busy while some of us slept! Seems to have gotten a lot more complicated since the last installment I read, where we were simply discussing the handling of out-of-sequence requests. Let's see, what caught me eye....

edit2: I still think it's exceedingly impertinent to insinuate that you were doing something dodgy, when the most probable cause is having reverted to a backup for some reason.
Actually, the probable cause that triggered this discussion was having an initial request get hung up in transmission, timeout, then have a second request arrive at the scheduler before the first one eventually trundles in. My successful test of the abandonment simply used a backup copy to simulate the out-of-sequence condition.

exactly - so just check the host really hasn't anything running before we ditch the lot.
If you want to be more sophisticated, clean out what's really not there.
I like that even better than my "do nothing except report the out-of-sequence condition" suggestion. If active tasks are included in the request, why abandon them?

I have a few other thoughts, but with the outage looming, I think I'll just post this quick.

EDIT: Actually, one minor issue with that last suggestion just occurred to me. If the second request (arriving first) generated new tasks, those tasks would not show up in the later-arriving first request. You wouldn't want to abandon those new ones.
ID: 1697076 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1697077 - Posted: 30 Jun 2015, 15:42:48 UTC

well we actually have two conditions - one is the backup scenario the other is the 'ask twice - rpcs get out of order' that keeps killing the cache of some people here. the bits of code triggered are the same.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1697077 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1697078 - Posted: 30 Jun 2015, 15:45:04 UTC - in response to Message 1697068.  

since we know that one easy way to generate a new hostid (and thereby wipe silly APR entries) was to trigger the low rpc seqno code, we know that area of code is fairly new.
I expect CPDN and Einstein to hand out fresh hostid - actually then marking the old ones on the old hostid as abandoned makes sense, since you are not acessing that DB entry any more. But it still leaves the problem that you have stale tasks on the host.

new, better server code doesn't reach conservative projects.
new better client code doesn't reach conservative users.

As Richard suggested I think it's best to check out other projects and then try several independent improvements.

Small, easy to understand, easy to do things have the best chance of getting done ;) [at least if you're not doin it yourself and going through the whole 'git-pull' diplomacy nightmare]

Except -

Testing at CPDN (Computers for UserID 216408), I did create a new HostID simply by changing the seqno (Yay!), but it didn't mark the ghosts I created earlier as abandoned (Boo!)

Testing at FiND (Computers for UserID 124464), even faking HostID wasn't enough, but setting a fake HostID with allow_multiple_clients was (and revealed a Boinc Bug along the way).

In both cases, the host marker for this experiment is the 8-processor Xeon E5320 - I've only got one of those, the others are clones.
ID: 1697078 · Report as offensive
Profile Jeff Buck Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 11 Feb 00
Posts: 1441
Credit: 148,764,870
RAC: 0
United States
Message 1697079 - Posted: 30 Jun 2015, 15:50:55 UTC - in response to Message 1697077.  

well we actually have two conditions - one is the backup scenario the other is the 'ask twice - rpcs get out of order' that keeps killing the cache of some people here. the bits of code triggered are the same.

Yes, so should the response to those two conditions be different? And where does the fraud-blocking that was discussed earlier come in?
ID: 1697079 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1697081 - Posted: 30 Jun 2015, 21:39:49 UTC - in response to Message 1697078.  

Tested at Einstein just before the outrage started.

Computers for UserID 144054

Same outcome as at CPDN: messing the seqno produced a new HostID (always on default venue, not inheriting previous settings), but didn't abandon the previous tasks. I've now reverted it back to the previous HostID, and am successfully crunching both pre-existing and newly-fetched work.
ID: 1697081 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1697119 - Posted: 30 Jun 2015, 22:58:30 UTC

Note Ivan's tale of woe in 'rm -rf *' considered harmful, especially the [edit] to the opening post.

In that case, deleting all (or a substantial proportion) of the files was equivalent to a true 'detach', so the action taken was appropriate and indeed desirable. We need to be sure we keep the handler for that situation.
ID: 1697119 · Report as offensive
Profile ivan
Volunteer tester
Avatar

Send message
Joined: 5 Mar 01
Posts: 783
Credit: 348,560,338
RAC: 223
United Kingdom
Message 1697126 - Posted: 30 Jun 2015, 23:06:21 UTC - in response to Message 1697119.  

Note Ivan's tale of woe in 'rm -rf *' considered harmful, especially the [edit] to the opening post.

In that case, deleting all (or a substantial proportion) of the files was equivalent to a true 'detach', so the action taken was appropriate and indeed desirable. We need to be sure we keep the handler for that situation.

I'm glad someone found some good in the situation. :-) Me, I'm off-campus for two days, will revisit the problems on Friday.
ID: 1697126 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1697135 - Posted: 30 Jun 2015, 23:23:50 UTC

Expanding on my earlier suggestion I see the problem that started this thread as the Client making an Assumption.
After the Request timed out the Client Assumed the Request did not and will not be acted upon by the Server, this is a Very Bad assumption. It then made the mistake of sending another request while the current one was in progress. One way to solve the problem would be to Not send another request until it hears from the Server or a suitable period of time passes. In the cases I'm aware of, 10 minutes is usually the maximum time it takes to clear the request.
So, a safe procedure would resemble;
1) Request Times Out, Request Server Acknowledge Current Request, wait 2.5 minutes
2) Request Server Acknowledge current Request, wait 2.5 minutes
3) Request Server Acknowledge current Request, wait 2.5 minutes
4) Send New Request
That should be enough time for the first Request to clear, if it's going to clear.
ID: 1697135 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1697150 - Posted: 1 Jul 2015, 2:26:37 UTC - in response to Message 1697135.  

Yes, ideally timeouts should occur both ends, server first preferably, and any changes rolled back. The code and database isn't quite structured that way I think, but something to think about for the future.

To clarify the concept perhaps, The scheduler request represnts a virtual image of your host sitting in line on the server. It's not a fire sale at K-Mart, so you get bored and leave. unfortunately your clone is sitting in line still.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1697150 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1697202 - Posted: 1 Jul 2015, 5:24:33 UTC
Last modified: 1 Jul 2015, 6:24:29 UTC

Now What!!!
I just had Beta Abandon a slew of CPU tasks on a Different system. I've never had this before, now within a Week it's happened while running Mavericks and Mountain Lion. Here, http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71141

This is Not looking good.
Here are the 'Events' from 30 Jun from another computer on Beta;
30 Jun 2015, 0:50:17 UTC Abandoned
30 Jun 2015, 1:02:51 UTC Abandoned
30 Jun 2015, 2:30:13 UTC Abandoned
30 Jun 2015, 4:18:31 UTC Abandoned
30 Jun 2015, 6:09:59 UTC Abandoned
30 Jun 2015, 7:12:37 UTC Abandoned
30 Jun 2015, 11:01:06 UTC Abandoned
30 Jun 2015, 12:06:22 UTC Abandoned
30 Jun 2015, 13:03:15 UTC Abandoned
http://setiweb.ssl.berkeley.edu/beta/results.php?hostid=71714


All I see in the Log is the Failed Server request...nothing else.

01-Jul-2015 00:30:26 [SETI@home Beta Test] [sched_op] Starting scheduler request
01-Jul-2015 00:30:26 [SETI@home Beta Test] Sending scheduler request: To fetch work.
01-Jul-2015 00:30:26 [SETI@home Beta Test] Requesting new tasks for CPU
01-Jul-2015 00:30:26 [SETI@home Beta Test] [sched_op] CPU work request: 12699.08 seconds; 0.00 devices
01-Jul-2015 00:30:26 [SETI@home Beta Test] [sched_op] AMD/ATI GPU work request: 0.00 seconds; 0.00 devices
01-Jul-2015 00:32:24 [SETI@home Beta Test] Message from task: 0
01-Jul-2015 00:32:24 [SETI@home Beta Test] Computation for task 12jl12ab.5489.28594.261993005064.16.255_2 finished
01-Jul-2015 00:32:24 [SETI@home Beta Test] Starting task 12jl12ab.5489.28594.261993005064.16.254_1
01-Jul-2015 00:32:26 [SETI@home Beta Test] Started upload of 12jl12ab.5489.28594.261993005064.16.255_2_0
01-Jul-2015 00:32:28 [SETI@home Beta Test] Finished upload of 12jl12ab.5489.28594.261993005064.16.255_2_0
01-Jul-2015 00:35:16 [SETI@home Beta Test] Message from task: 0
01-Jul-2015 00:35:16 [SETI@home Beta Test] Computation for task 12jl12ab.5489.75183.438086664200.16.25_2 finished
01-Jul-2015 00:35:16 [SETI@home Beta Test] Starting task 12jl12ab.5489.75183.438086664200.16.128_0
01-Jul-2015 00:35:18 [SETI@home Beta Test] Started upload of 12jl12ab.5489.75183.438086664200.16.25_2_0
01-Jul-2015 00:35:21 [SETI@home Beta Test] Message from task: 0
01-Jul-2015 00:35:21 [SETI@home Beta Test] Finished upload of 12jl12ab.5489.75183.438086664200.16.25_2_0
01-Jul-2015 00:35:21 [SETI@home Beta Test] Computation for task 12jl12ab.5489.75183.438086664200.16.128_0 finished
01-Jul-2015 00:35:21 [SETI@home Beta Test] Starting task 12jl12ab.5489.75183.438086664200.16.133_0
01-Jul-2015 00:35:23 [SETI@home Beta Test] Started upload of 12jl12ab.5489.75183.438086664200.16.128_0_0
01-Jul-2015 00:35:25 [SETI@home Beta Test] Finished upload of 12jl12ab.5489.75183.438086664200.16.128_0_0
01-Jul-2015 00:35:39 [SETI@home Beta Test] Scheduler request failed: Timeout was reached
01-Jul-2015 00:35:39 [SETI@home Beta Test] [sched_op] Deferring communication for 00:01:20
01-Jul-2015 00:35:39 [SETI@home Beta Test] [sched_op] Reason: Scheduler request failed
01-Jul-2015 00:37:00 [SETI@home Beta Test] [sched_op] Starting scheduler request
01-Jul-2015 00:37:00 [SETI@home Beta Test] Sending scheduler request: To fetch work.
01-Jul-2015 00:37:00 [SETI@home Beta Test] Reporting 3 completed tasks
01-Jul-2015 00:37:00 [SETI@home Beta Test] Requesting new tasks for CPU
01-Jul-2015 00:37:00 [SETI@home Beta Test] [sched_op] CPU work request: 117965.96 seconds; 0.00 devices
01-Jul-2015 00:37:00 [SETI@home Beta Test] [sched_op] AMD/ATI GPU work request: 0.00 seconds; 0.00 devices
01-Jul-2015 00:37:03 [SETI@home Beta Test] Scheduler request completed: got 1 new tasks
01-Jul-2015 00:37:03 [SETI@home Beta Test] [sched_op] Server version 707
01-Jul-2015 00:37:03 [SETI@home Beta Test] Project requested delay of 7 seconds
01-Jul-2015 00:37:03 [SETI@home Beta Test] [sched_op] estimated total CPU task duration: 49142 seconds
01-Jul-2015 00:37:03 [SETI@home Beta Test] [sched_op] estimated total AMD/ATI GPU task duration: 0 seconds
01-Jul-2015 00:37:03 [SETI@home Beta Test] [sched_op] handle_scheduler_reply(): got ack for task 12jl12ab.5489.75183.438086664200.16.25_2
01-Jul-2015 00:37:03 [SETI@home Beta Test] [sched_op] handle_scheduler_reply(): got ack for task 12jl12ab.5489.28594.261993005064.16.255_2
01-Jul-2015 00:37:03 [SETI@home Beta Test] [sched_op] handle_scheduler_reply(): got ack for task 12jl12ab.5489.75183.438086664200.16.128_0
01-Jul-2015 00:37:03 [SETI@home Beta Test] [sched_op] Deferring communication for 00:00:07
01-Jul-2015 00:37:03 [SETI@home Beta Test] [sched_op] Reason: requested by project
01-Jul-2015 00:37:05 [SETI@home Beta Test] Started download of 12jl12ab.15887.28594.261993005066.16.133
01-Jul-2015 00:37:07 [SETI@home Beta Test] Finished download of 12jl12ab.15887.28594.261993005066.16.133
01-Jul-2015 00:37:13 [SETI@home Beta Test] [sched_op] Starting scheduler request
01-Jul-2015 00:37:13 [SETI@home Beta Test] Sending scheduler request: To fetch work.
01-Jul-2015 00:37:13 [SETI@home Beta Test] Requesting new tasks for CPU
01-Jul-2015 00:37:13 [SETI@home Beta Test] [sched_op] CPU work request: 71372.66 seconds; 0.00 devices
01-Jul-2015 00:37:13 [SETI@home Beta Test] [sched_op] AMD/ATI GPU work request: 0.00 seconds; 0.00 devices
01-Jul-2015 00:37:15 [SETI@home Beta Test] Scheduler request completed: got 1 new tasks
01-Jul-2015 00:37:15 [SETI@home Beta Test] [sched_op] Server version 707
01-Jul-2015 00:37:15 [SETI@home Beta Test] Project requested delay of 7 seconds
01-Jul-2015 00:37:15 [SETI@home Beta Test] [sched_op] estimated total CPU task duration: 49128 seconds
01-Jul-2015 00:37:15 [SETI@home Beta Test] [sched_op] estimated total AMD/ATI GPU task duration: 0 seconds
01-Jul-2015 00:37:15 [SETI@home Beta Test] [sched_op] Deferring communication for 00:00:07
01-Jul-2015 00:37:15 [SETI@home Beta Test] [sched_op] Reason: requested by project
01-Jul-2015 00:37:17 [SETI@home Beta Test] Started download of 12jl12ab.15887.28594.261993005066.16.115
01-Jul-2015 00:37:19 [SETI@home Beta Test] Finished download of 12jl12ab.15887.28594.261993005066.16.115
01-Jul-2015 00:37:25 [SETI@home Beta Test] [sched_op] Starting scheduler request
01-Jul-2015 00:37:25 [SETI@home Beta Test] Sending scheduler request: To fetch work.
01-Jul-2015 00:37:25 [SETI@home Beta Test] Requesting new tasks for CPU
01-Jul-2015 00:37:25 [SETI@home Beta Test] [sched_op] CPU work request: 29321.88 seconds; 0.00 devices
01-Jul-2015 00:37:25 [SETI@home Beta Test] [sched_op] AMD/ATI GPU work request: 0.00 seconds; 0.00 devices
01-Jul-2015 00:37:26 [SETI@home Beta Test] Scheduler request completed: got 1 new tasks
01-Jul-2015 00:37:26 [SETI@home Beta Test] [sched_op] Server version 707
01-Jul-2015 00:37:26 [SETI@home Beta Test] Project requested delay of 7 seconds
01-Jul-2015 00:37:26 [SETI@home Beta Test] [sched_op] estimated total CPU task duration: 49121 seconds
01-Jul-2015 00:37:26 [SETI@home Beta Test] [sched_op] estimated total AMD/ATI GPU task duration: 0 seconds
01-Jul-2015 00:37:26 [SETI@home Beta Test] [sched_op] Deferring communication for 00:00:07
01-Jul-2015 00:37:26 [SETI@home Beta Test] [sched_op] Reason: requested by project
01-Jul-2015 00:37:28 [SETI@home Beta Test] Started download of 12jl12ab.15887.28594.261993005066.16.216
01-Jul-2015 00:37:30 [SETI@home Beta Test] Finished download of 12jl12ab.15887.28594.261993005066.16.216
01-Jul-2015 00:39:36 [SETI@home Beta Test] [sched_op] Starting scheduler request
01-Jul-2015 00:39:36 [SETI@home Beta Test] Sending scheduler request: To fetch work.
01-Jul-2015 00:39:36 [SETI@home Beta Test] Requesting new tasks for CPU
01-Jul-2015 00:39:36 [SETI@home Beta Test] [sched_op] CPU work request: 9594.96 seconds; 0.00 devices
01-Jul-2015 00:39:36 [SETI@home Beta Test] [sched_op] AMD/ATI GPU work request: 0.00 seconds; 0.00 devices
01-Jul-2015 00:39:37 [SETI@home Beta Test] Scheduler request completed: got 1 new tasks
01-Jul-2015 00:39:37 [SETI@home Beta Test] [sched_op] Server version 707
01-Jul-2015 00:39:37 [SETI@home Beta Test] Resent lost task 12jl12ab.15887.28594.261993005066.16.120_0
01-Jul-2015 00:39:37 [SETI@home Beta Test] Project requested delay of 7 seconds
01-Jul-2015 00:39:37 [SETI@home Beta Test] [sched_op] estimated total CPU task duration: 49031 seconds
01-Jul-2015 00:39:37 [SETI@home Beta Test] [sched_op] estimated total AMD/ATI GPU task duration: 0 seconds
01-Jul-2015 00:39:37 [SETI@home Beta Test] [sched_op] Deferring communication for 00:00:07
01-Jul-2015 00:39:37 [SETI@home Beta Test] [sched_op] Reason: requested by project
01-Jul-2015 00:39:39 [SETI@home Beta Test] Started download of 12jl12ab.15887.28594.261993005066.16.120
01-Jul-2015 00:39:41 [SETI@home Beta Test] Finished download of 12jl12ab.15887.28594.261993005066.16.120
01-Jul-2015 00:42:00 [SETI@home Beta Test] [sched_op] Starting scheduler request
01-Jul-2015 00:42:00 [SETI@home Beta Test] Sending scheduler request: To fetch work.
01-Jul-2015 00:42:00 [SETI@home Beta Test] Requesting new tasks for CPU
01-Jul-2015 00:42:00 [SETI@home Beta Test] [sched_op] CPU work request: 9333.99 seconds; 0.00 devices
01-Jul-2015 00:42:00 [SETI@home Beta Test] [sched_op] AMD/ATI GPU work request: 0.00 seconds; 0.00 devices
01-Jul-2015 00:42:01 [SETI@home Beta Test] Scheduler request completed: got 1 new tasks
01-Jul-2015 00:42:01 [SETI@home Beta Test] [sched_op] Server version 707
01-Jul-2015 00:42:01 [SETI@home Beta Test] Project requested delay of 7 seconds
01-Jul-2015 00:42:01 [SETI@home Beta Test] [sched_op] estimated total CPU task duration: 48935 seconds
01-Jul-2015 00:42:01 [SETI@home Beta Test] [sched_op] estimated total AMD/ATI GPU task duration: 0 seconds
01-Jul-2015 00:42:01 [SETI@home Beta Test] [sched_op] Deferring communication for 00:00:07
01-Jul-2015 00:42:01 [SETI@home Beta Test] [sched_op] Reason: requested by project
01-Jul-2015 00:42:03 [SETI@home Beta Test] Started download of 12jl12ab.15887.28594.261993005066.16.145
01-Jul-2015 00:42:22 [SETI@home Beta Test] Finished download of 12jl12ab.15887.28594.261993005066.16.145
01-Jul-2015 00:48:02 [SETI@home Beta Test] [sched_op] Starting scheduler request
01-Jul-2015 00:48:02 [SETI@home Beta Test] Sending scheduler request: To fetch work.
01-Jul-2015 00:48:02 [SETI@home Beta Test] Requesting new tasks for CPU
01-Jul-2015 00:48:02 [SETI@home Beta Test] [sched_op] CPU work request: 13843.06 seconds; 0.00 devices
01-Jul-2015 00:48:02 [SETI@home Beta Test] [sched_op] AMD/ATI GPU work request: 0.00 seconds; 0.00 devices
01-Jul-2015 00:48:04 [SETI@home Beta Test] Scheduler request completed: got 1 new tasks
01-Jul-2015 00:48:04 [SETI@home Beta Test] [sched_op] Server version 707
01-Jul-2015 00:48:04 [SETI@home Beta Test] Project requested delay of 7 seconds
01-Jul-2015 00:48:04 [SETI@home Beta Test] [sched_op] estimated total CPU task duration: 48632 seconds
01-Jul-2015 00:48:04 [SETI@home Beta Test] [sched_op] estimated total AMD/ATI GPU task duration: 0 seconds
01-Jul-2015 00:48:04 [SETI@home Beta Test] [sched_op] Deferring communication for 00:00:07
01-Jul-2015 00:48:04 [SETI@home Beta Test] [sched_op] Reason: requested by project
01-Jul-2015 00:48:06 [SETI@home Beta Test] Started download of 12jl12ab.15887.28594.261993005066.16.232
01-Jul-2015 00:48:09 [SETI@home Beta Test] Finished download of 12jl12ab.15887.28594.261993005066.16.232
ID: 1697202 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1697230 - Posted: 1 Jul 2015, 6:46:18 UTC - in response to Message 1697202.  
Last modified: 1 Jul 2015, 7:02:26 UTC

Not much you can do afaict so far. I looked to see if there was a client config rpc timeout setting, or similar. Nothing visible in the docs I saw, but I'll take a look and see if there is anything undocumented hiding in the code a bit later.

We can have all sorts of questions/theories about why you wouldn't be getting a timely response to start with, and adding server timeouts at great effort.

I'm still considering locking, refusing subsequent requests until prior ones are complete, as possibly the simplest and most solid way to keep the weird client state copy logic, but protect legit client rpc timeouts. Haven't come across gotchas yet in that, but still looking.

[Edit:] the fact that the original request may have completed, and you never received a reply, is something that would be of concern, as locking alone wouldn't help (probably not without server side atomic transactions, and timeout-rollback).

would setting your clients to disable network, and scripting a periodic enable+update-disable help ? seems a bit rubbish to me too, if that's any consolation. I'm yet to fathom a pattern in the hosts that explains why some might get the issue and not others.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1697230 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1697239 - Posted: 1 Jul 2015, 7:06:22 UTC

@TBar kindly mention your UTC offset when posting log messages. I can work out the offset for Europe and the UK but when a country has several timezones...

@Jeff I don't see what kind of fraud might mess up rpc seqno. Being me, i.e. not paranoid, my base assumption is that whatever people do is down to ignorance, stupidity and bad luck and not to criminal energy. The science has to be protected from criminals [though we don;t really do a good job at that ourselves, whenever money (or fame) is involved, some people will cheat]. There are other parts of the code that deal with willful cheating.
Until somebody comes up with an example how fraud would mess up rpsseqno that's not completely far-fetched, I will consider the matter as resulting from the timeout problem discussed earlier or from using a backup and therefore not in need of cheatproofing.

@ivan rm -rf * eh? been there done that - not on purpose, slight typo, ended up with a ' ' inbetween the filename start and the * ... Thankfully just my user dir vanished and we were doing daily backups. But walk up to the sysadmin and tell him why you need your backup restored...

@Richard please run the scenario where task abandonmend was the right chioce past me again. I didn;t really become involved until yesterday.

@Jason I agree, improving client-server comms is desirable. Getting that BOINC improvement coded and accepted however is a different kettle of fish. I'll take the path of least resistance first.

@TBar Oh it's your thread. Nevermind.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1697239 · Report as offensive
TBar
Volunteer tester

Send message
Joined: 22 May 99
Posts: 5204
Credit: 840,779,836
RAC: 2,768
United States
Message 1697240 - Posted: 1 Jul 2015, 7:06:31 UTC - in response to Message 1697230.  
Last modified: 1 Jul 2015, 7:12:14 UTC

Not much you can do afaict so far...

Well, I thought about it for a while. Thought about how it's only happening on this one machine. Thought about any recent changes...and then took action;
01-Jul-2015 02:04:49 [---] Local time is UTC -4 hours
01-Jul-2015 02:04:49 [---] Version change (7.4.36 -> 7.2.33)

Never had it happen with 7.2.33. The Only reason I upgraded was because SETI has some BS going on making Macs have at least 7.2.42 before they can download any Stock Apps or download work for stock apps. So, I suppose I won't be testing any Stock Apps at Beta anytime soon. Unless someone fixes whatever it is requiring Macs to have 7.2.42. I always run Anonymous platform on Main anyway, and you Don't Need 7.2.42 to run Anonymous platform on Main or Beta proving it's something in the Server making Macs run 7.2.42....something Deep, Dark, and Secret.

So, we'll see how 7.2.33 works.
ID: 1697240 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1697241 - Posted: 1 Jul 2015, 7:09:04 UTC - in response to Message 1697230.  

Not much you can do afaict so far. I looked to see if there was a client config rpc timeout setting, or similar. Nothing visible in the docs I saw, but I'll take a look and see if there is anything undocumented hiding in the code a bit later.

RPC timeout is controlled by

<http_transfer_timeout>seconds</http_transfer_timeout>
Abort HTTP transfers if idle for this many seconds; default 300.

Client configuration

I tested and reported that during some previous crisis, when Eric mistakenly thought that the timeout message referred to a server config, and tweaked that instead.
ID: 1697241 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1697243 - Posted: 1 Jul 2015, 7:13:23 UTC - in response to Message 1697240.  

alright, sounds reasonable, as it's quite possible they fiddled with the RPC timeouts, meither making them shorter or breaking them.

For the sakes of eliminating the easy stuff, that we're all very certain there is no problem with... Is there any Mac equivalent to the windows pathping or Linux mtr command ?

some example pinging your local router that looks like this would just confirm It's Boinc's/Setis issues, for everyone to see:
C:\Users\Jason>pathping 192.168.0.1

Tracing route to 192.168.0.1 over a maximum of 30 hops

  0  Apollo [192.168.0.10]
  1  192.168.0.1

Computing statistics for 25 seconds...
            Source to Here   This Node/Link
Hop  RTT    Lost/Sent = Pct  Lost/Sent = Pct  Address
  0                                           Apollo [192.168.0.10]
                                0/ 100 =  0%   |
  1    0ms     0/ 100 =  0%     0/ 100 =  0%  192.168.0.1

Trace complete.

C:\Users\Jason>pathping setiathome.berkeley.edu

Tracing route to setiathome.berkeley.edu [169.229.217.150]
over a maximum of 30 hops:
  0  Apollo [192.168.0.10]
  1  192.168.0.1
  2     *        *        *
Computing statistics for 25 seconds...
            Source to Here   This Node/Link
Hop  RTT    Lost/Sent = Pct  Lost/Sent = Pct  Address
  0                                           Apollo [192.168.0.10]
                                0/ 100 =  0%   |
  1    0ms     0/ 100 =  0%     0/ 100 =  0%  192.168.0.1

Trace complete.

"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1697243 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1697244 - Posted: 1 Jul 2015, 7:15:10 UTC - in response to Message 1697241.  
Last modified: 1 Jul 2015, 7:15:51 UTC

Not much you can do afaict so far. I looked to see if there was a client config rpc timeout setting, or similar. Nothing visible in the docs I saw, but I'll take a look and see if there is anything undocumented hiding in the code a bit later.

RPC timeout is controlled by

<http_transfer_timeout>seconds</http_transfer_timeout>
Abort HTTP transfers if idle for this many seconds; default 300.

Client configuration

I tested and reported that during some previous crisis, when Eric mistakenly thought that the timeout message referred to a server config, and tweaked that instead.



hmmm, that's where I looked for the word timeout, duh, must be time for coffee

I would set that to half an hour. If the responses never arrive in that time then something's broken.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1697244 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1697247 - Posted: 1 Jul 2015, 7:20:25 UTC - in response to Message 1697244.  

hmmm, that's where I looked for the word timeout, duh, must be time for coffee

I would set that to half an hour. If the responses never arrive in that time then something's broken.

ROFLMAO

just a side note, IIRC while boinc waits for a response, nothing much else happens, especially not contacts to other projects. You might want to keep that in mind if you run more than just Seti :)
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1697247 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1697249 - Posted: 1 Jul 2015, 7:20:55 UTC - in response to Message 1697239.  

@Richard please run the scenario where task abandonmend was the right chioce past me again. I didn;t really become involved until yesterday.

If the client has been genuinely detached, or for some other reason (like Ivan's finger-fumble, or installing an anonymous platform app with the wrong version/plan_class combo) all tasks once present on the host are no longer available, the question arises 'what to do with the records of those tasks still present in the server's database'?

There are three choices:

Wait until they time out naturally at deadline
Send them back to the host as lost tasks (requires costly server resources to identify)
Bring forward deadline, or otherwise mark them as 'never going to report', so that wingmates can take over and complete the WU.
ID: 1697249 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1697253 - Posted: 1 Jul 2015, 7:29:40 UTC - in response to Message 1697249.  
Last modified: 1 Jul 2015, 7:31:26 UTC

@Richard please run the scenario where task abandonmend was the right chioce past me again. I didn;t really become involved until yesterday.

If the client has been genuinely detached, or for some other reason (like Ivan's finger-fumble, or installing an anonymous platform app with the wrong version/plan_class combo) all tasks once present on the host are no longer available, the question arises 'what to do with the records of those tasks still present in the server's database'?

There are three choices:

Wait until they time out naturally at deadline
Send them back to the host as lost tasks (requires costly server resources to identify)
Bring forward deadline, or otherwise mark them as 'never going to report', so that wingmates can take over and complete the WU.

The safetycheck I am proposing checks whether the client reports 'other_tasks' - unless you managed to lose the files but keep the CS entries, you then still have files and I propose to not to abandon them. If the client doesn't have anything, it's truly lost and the server can kill it. files present in CS but not physically are covered by 'resend lost tasks'.

edit good point - I'll recheck the 'genuine detach' logic again and test it. When I'm more awake and some coffee has kicked in.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1697253 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1697255 - Posted: 1 Jul 2015, 7:34:50 UTC - in response to Message 1697253.  
Last modified: 1 Jul 2015, 7:45:29 UTC

@Richard please run the scenario where task abandonmend was the right chioce past me again. I didn;t really become involved until yesterday.

If the client has been genuinely detached, or for some other reason (like Ivan's finger-fumble, or installing an anonymous platform app with the wrong version/plan_class combo) all tasks once present on the host are no longer available, the question arises 'what to do with the records of those tasks still present in the server's database'?

There are three choices:

Wait until they time out naturally at deadline
Send them back to the host as lost tasks (requires costly server resources to identify)
Bring forward deadline, or otherwise mark them as 'never going to report', so that wingmates can take over and complete the WU.

The safetycheck I am proposing checks whether the client reports 'other_tasks' - unless you managed to lose the files but keep the CS entries, you then still have files and I propose to not to abandon them. If the client doesn't have anything, it's truly lost and the server can kill it. files present in CS but not physically are covered by 'resend lost tasks'.

I still haven't had time to study the code/pseudocode samples buried a long way back down this thread. At the moment, 'resend lost tasks' is, I believe, a global switch on the server - either it's on for all RPCs, which clobbers server performance, or it's off. Edit - and when it's off, tasks hang around for the very long deadlines we run here.

What we need - and it might well be in your code - is a one-off "resend lost tasks" check triggered as part of the seqno/detach/authentication path we're exploring.

Edit2: depending on the sequence of validation checks, there may need to be extra validation for the one-off 'resend lost tasks' that the reported 'tasks on host' validly belong to the HostID which is being recycled.
ID: 1697255 · Report as offensive
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · Next

Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.