Suddenly BOINC Decides to Abandon 71 APs...WTH?

Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 · Next

AuthorMessage
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1697031 - Posted: 30 Jun 2015, 12:17:37 UTC - in response to Message 1697030.  

I could be mistaken, but I believe he wants to invalidate all of them, because he believes you did something dodgy.

restoring from a backup isn't dodgy!

Especially for a multi-week or multi-month project like CPDN (which will generate RPCseqnos via trickle reports while running). There is detailled advice on what to edit, somewhere on their boards.

And since when does the average user edit CS? Not to mention read the boards?
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1697031 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1697032 - Posted: 30 Jun 2015, 12:19:55 UTC - in response to Message 1697024.  

I might be mistaken but iirc the client sends a list of all tasks (of that project) on board in the request.
So what you actually need to do is to only mark those tasks as abandoned that are not present (because you really transfered some stale folder).
the ones that are there can peacefully continue being processed.

The server will currently mark all current tasks as abandoned, unconditionally. Nothing left over.

So on the first successful RPC after that, the nack/abort reply could be sent for all tasks listed as present/running, before assessing the need for new tasks to replace them.

But better not to abandon them in the first place, of course.
ID: 1697032 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1697033 - Posted: 30 Jun 2015, 12:21:18 UTC

I mean the original problem is 'tasks are being marked as abandoned that are still present on the host' [and being processed there quite in vain]

Either tell the client to wipe them [bad for AP], or have the server do something more sensible than now.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1697033 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1697034 - Posted: 30 Jun 2015, 12:21:39 UTC - in response to Message 1697031.  
Last modified: 30 Jun 2015, 12:21:53 UTC

I could be mistaken, but I believe he wants to invalidate all of them, because he believes you did something dodgy.

restoring from a backup isn't dodgy!

Especially for a multi-week or multi-month project like CPDN (which will generate RPCseqnos via trickle reports while running). There is detailled advice on what to edit, somewhere on their boards.

And since when does the average user edit CS? Not to mention read the boards?

Some people at CPDN used to nurse their long ones (up to four months) very assiduously, even rewinding them and trying again if they crashed. It's a different culture in the different projects.
ID: 1697034 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1697035 - Posted: 30 Jun 2015, 12:30:46 UTC - in response to Message 1697034.  

I could be mistaken, but I believe he wants to invalidate all of them, because he believes you did something dodgy.

restoring from a backup isn't dodgy!

Especially for a multi-week or multi-month project like CPDN (which will generate RPCseqnos via trickle reports while running). There is detailled advice on what to edit, somewhere on their boards.

And since when does the average user edit CS? Not to mention read the boards?

Some people at CPDN used to nurse their long ones (up to four months) very assiduously, even rewinding them and trying again if they crashed. It's a different culture in the different projects.

Even at CPDN the users editing CS will be a minority.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1697035 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1697036 - Posted: 30 Jun 2015, 12:42:09 UTC - in response to Message 1697027.  

hmmm. yeah to me the lock option is still looking the most likely candidate, and not that hard if you can do it without recording active locks in the host table. I can;t see a reason the client should have 2 or more requests in progress

Any idea how many simulatneous rpc's would need to be active at one time ? (concurrent connections can be capped using the same shortlist anyway)

Only one RPC at a time. Even if a client is asked to contact several projects at the same time (multi-select and update), it asks them sequentially - won't start a new one until the previous one has either replied or timed out. Uploads and downloads can be multiplexed, but not RPCs. Probably a question of managing the listening socket and buffering/parsing the reply - dumping packets to a disk file is cheap by comparison.

OK, taking a break for lunch and a walk in our surprisingly nice sunshine. Back to pick up the threads in about an hour.



The way I would do it is something along these lines:

in some_config.h:
#define DEFAULT_RPCHOSTLOCK_TIMEOUT  1800  // seconds, generous for crusty ol'servers
#define DEFAULT_RPCHOSTLOCK_CHECK_INTERVAL  10 // garbage collect locks

// volatiles because rpc threads could be running on different Threads\CPUs etc
volatile int rpclockinterval = DEFAULT_RPCHOSTLOCK_CHECK_INTERVAL;
volatile int rpclocktimeout = DEFAULT_RPCHOSTLOCK_TIMEOUT;
  
at start of authentication, as soon as you have a valid userid and hostid:


if ( acquire_some_mutex() && myrpclockhostlist.hasentry(hostid) && release_some_mutex() )
{
    reject_rpcWithMessage(...);
}
else
{
    myrpclockhostlist.addentry(hostid, timestamp) ); // uses mutexes inside  to access the shared lock list
    do_rpc_things();
    myrpclockhostlist.removeentry(hostid) ); // uses mutexes inside to access the shared lock list
}


and in a timer thread, garbage collect the locks at rpclockinterval intervals :

some_timer_thread()
{
   acquire_some_mutex();
   for ( i = myrpclockhostlist.begin(); i <= myrpclockhostlist.end(); i++)
   {
    if ( currenttimedate-myrpclockhostlist.gettimestamp(i) > rpclocktimeout )
            myrpclockhostlist.removeentry(i);
   }
   release_some_mutex();
}

"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1697036 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1697037 - Posted: 30 Jun 2015, 12:48:49 UTC

sched/handle_request.cpp#L387

has more consistent logic - it actually checks whether the client reported tasks before abandoning the lot.

but THAT is the code that should get called if we are looking at a detach/reattach where no tasks should be present. That's when you really want to mark tasks as abandoned - you detached (silent, server never gets told, tasks idle out) you reattach and the DB is cleaned up.
So he does a security check there but not when the rpc_seqno goes out of sync ?!

So what if

if ((g_request->allow_multiple_clients != 1)
                    && (g_request->other_results.size() == 0)
                ) {
                    mark_results_over(host);


i.e. the check for tasks on the host is put into line 426 that currently has no such security check?
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1697037 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1697038 - Posted: 30 Jun 2015, 12:54:49 UTC - in response to Message 1697037.  
Last modified: 30 Jun 2015, 12:55:06 UTC

We considered adding the cpid search into the other case where it uses the IPs etc, but doing so would destroy the copying client state prevention logic. (because cpid would match)
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1697038 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1697039 - Posted: 30 Jun 2015, 12:58:28 UTC - in response to Message 1697036.  


The way I would do it is something along these lines:

looks complicated ;)

yes it deals with the problem of server and client talk going out of synch.

I'd prefer my more simple approach of simply making sure the server does proper checks ;)

You can still smoothen out server client connections - that might help with some of the other hiccups we keep getting with flaky comms.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1697039 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1697040 - Posted: 30 Jun 2015, 13:00:29 UTC - in response to Message 1697038.  
Last modified: 30 Jun 2015, 13:06:31 UTC

We considered adding the cpid search into the other case where it uses the IPs etc, but doing so would destroy the copying client state prevention logic. (because cpid would match)

Also ich find es immer noch eine bodenlose Unverschaemtheit zu unterstellen, dass man schummelt, wenn das wahrscheinlichste ist, dass man ein sch**** Backup eingespielt hat.

edit: yes yes, I'm translating that outburst

edit2: I still think it's exceedingly impertinent to insinuate that you were doing something dodgy, when the most probable cause is having reverted to a backup for some reason.

edit3: and I certainly have no trouble telling him THAT
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1697040 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1697041 - Posted: 30 Jun 2015, 13:04:53 UTC - in response to Message 1697039.  

looks complicated ;)


Not as complicated as it looks. It's one of those spinny latches labelled óccupied'' like on a public toilet door. I presume they have public toilets in Germany.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1697041 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1697042 - Posted: 30 Jun 2015, 13:06:25 UTC - in response to Message 1697040.  
Last modified: 30 Jun 2015, 13:08:27 UTC

edit2: I still think it's exceedingly impertinent to insinuate that you were doing something dodgy, when the most probable cause is having reverted to a backup for some reason.


No the whole is built around the idea that hosts and users are unreliable. I can live with that. What I can;t live with is the insinuation that the server is right/complete in every common case it should handle reasonably [when the code clearly says it cannot].
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1697042 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1697043 - Posted: 30 Jun 2015, 13:10:11 UTC - in response to Message 1697042.  

edit2: I still think it's exceedingly impertinent to insinuate that you were doing something dodgy, when the most probable cause is having reverted to a backup for some reason.


No the whole is built around the idea that hosts and users are unreliable. I can live with that. What I can;t live with is the insinuation that the server is right/complete in every common case it should handle reasonably.


exactly - so just check the host really hasn't anything running before we ditch the lot.
If you want to be more sophisticated, clean out what's really not there.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1697043 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1697044 - Posted: 30 Jun 2015, 13:18:32 UTC - in response to Message 1697043.  
Last modified: 30 Jun 2015, 13:20:08 UTC

edit2: I still think it's exceedingly impertinent to insinuate that you were doing something dodgy, when the most probable cause is having reverted to a backup for some reason.


No the whole is built around the idea that hosts and users are unreliable. I can live with that. What I can;t live with is the insinuation that the server is right/complete in every common case it should handle reasonably.


exactly - so just check the host really hasn't anything running before we ditch the lot.
If you want to be more sophisticated, clean out what's really not there.


Well except for the legitimate no move/copy case, The entrance to the public toilet cloned you, you walked into a public toilet with no door, and your clone followed you in a couple of minutes later, Or you walked in on the clone. Which one should security shoot ?

[Hint: always use public toilets with latching doors ]
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1697044 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1697046 - Posted: 30 Jun 2015, 13:23:16 UTC - in response to Message 1697044.  

Well except for the legitimate no move/copy case, The entrance to the public toilet cloned you, you walked into a public toilet with no door, and your clone followed you in a couple of minutes later, Or you walked in on the clone. Which one should security shoot ?

[Hint: always use public toilets with latching doors ]

Doesn't help if you are using the urinal, pardon me, row of buckets.

I'd shoot the door and merge the clones...
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1697046 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1697047 - Posted: 30 Jun 2015, 13:27:11 UTC - in response to Message 1697046.  
Last modified: 30 Jun 2015, 13:35:22 UTC

I'd shoot the door and merge the clones...


Well the Boinc technique appears to be no doors, make another clone and shoot the first two occupants, leaving the mess behind.

[Edit:] Zombie hosts! lol
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1697047 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1697048 - Posted: 30 Jun 2015, 13:35:01 UTC - in response to Message 1697035.  
Last modified: 30 Jun 2015, 13:39:29 UTC

I could be mistaken, but I believe he wants to invalidate all of them, because he believes you did something dodgy.

restoring from a backup isn't dodgy!

Especially for a multi-week or multi-month project like CPDN (which will generate RPCseqnos via trickle reports while running). There is detailled advice on what to edit, somewhere on their boards.

And since when does the average user edit CS? Not to mention read the boards?

Some people at CPDN used to nurse their long ones (up to four months) very assiduously, even rewinding them and trying again if they crashed. It's a different culture in the different projects.

Even at CPDN the users editing CS will be a minority.

Thinking while I walked - it would be interesting to see how older server code (like Einsten - old - and CPDN - even older) handle the 'faked seqno' test. I can do CPDN (probably got most experience with that project, out of this little group) - anyone willing for Einstein, or should I do that myself, as well?

But I still need to eat the fruits of my walk - lunch!

Edit - talking of CPDN, they've launched a new set of tasks today:

http://www.climateprediction.net/new-experiment-launched-weatherhome-2015-western-us-drought/
ID: 1697048 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1697049 - Posted: 30 Jun 2015, 13:38:48 UTC - in response to Message 1697048.  


Thinking while I walked - it would be interesting to see how older server code (like Einsten - old - and CPDN - even older) handle the 'faked seqno' test. I can do CPDN (probably got most experience with that project, out of this little group) - anyone willing for Einstein, or should I do that myself, as well?


I'm not running either right now and any experiments would have to wait for the next window of opportunity.
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1697049 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1697050 - Posted: 30 Jun 2015, 13:39:18 UTC - in response to Message 1697048.  

Thinking while I walked - it would be interesting to see how older server code (like Einsten - old - and CPDN - even older) handle the 'faked seqno' test. I can do CPDN (probably got most experience with that project, out of this little group) - anyone willing for Einstein, or should I do that myself, as well?

But I still need to eat the fruits of my walk - lunch!


You're it :). Could be worth bouncing my pseudocode off Oliver/Bernd if they are still prodding at the issue. Feel free. It's not complete code but they'd understand it, and the concept of resource locks, I believe, well and truly enough to adapt to their needs.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1697050 · Report as offensive
Profile William
Volunteer tester
Avatar

Send message
Joined: 14 Feb 13
Posts: 2037
Credit: 17,689,662
RAC: 0
Message 1697068 - Posted: 30 Jun 2015, 15:07:00 UTC

since we know that one easy way to generate a new hostid (and thereby wipe silly APR entries) was to trigger the low rpc seqno code, we know that area of code is fairly new.
I expect CPDN and Einstein to hand out fresh hostid - actually then marking the old ones on the old hostid as abandoned makes sense, since you are not acessing that DB entry any more. But it still leaves the problem that you have stale tasks on the host.

new, better server code doesn't reach conservative projects.
new better client code doesn't reach conservative users.

As Richard suggested I think it's best to check out other projects and then try several independent improvements.

Small, easy to understand, easy to do things have the best chance of getting done ;) [at least if you're not doin it yourself and going through the whole 'git-pull' diplomacy nightmare]
A person who won't read has no advantage over one who can't read. (Mark Twain)
ID: 1697068 · Report as offensive
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 · Next

Message boards : Number crunching : Suddenly BOINC Decides to Abandon 71 APs...WTH?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.