Abandoned tasks - Ongoing issue

Message boards : Number crunching : Abandoned tasks - Ongoing issue
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1347979 - Posted: 18 Mar 2013, 4:53:58 UTC

After a week working with the scripts, Ive found that the root cause for this abandoned tasks is a network issue... In the last 4 days Ive been using only one of my ISPs and it was working fine until yesterday...
Then it started to lost the link and for times it was very, very slow, and in the middle of this issues one of the hosts got a bunch of abandoned tasks after almost 4 days with no issues...

It makes sense also to explain why I get this so often while others don't... I doubt anybody else has such crappy ISPs... Sadly, there is no other ISPs in my area besides those Im using...

But, the question remains (and probably it will for ever)... Why a network error confuses the servers so badly that they think that the tasks have to be marked as abandoned?
(BTW, It doesnt happens with Einstein, if the network is not working well, it just fails the RPCs or the transfers, but Ive never seen an abandoned task there...)
ID: 1347979 · Report as offensive
Profile trader
Volunteer tester

Send message
Joined: 25 Jun 00
Posts: 126
Credit: 4,968,173
RAC: 0
United States
Message 1347983 - Posted: 18 Mar 2013, 5:40:35 UTC - in response to Message 1341666.  

The "last request too recent" messages are certainly suggestive of some other computer attempting to contact the scheduler with the same HostID number. For those off you afflicted with this problem - it doesn't seem to affect all of us - it just might be helpful to keep an eye on the IP addresses shown on the host details page for that host (available to logged-in users only): if there is a 'ghost host' contacting the scheduler, that should change, and the interloper's IP address might help the staff to track it down.



I think ican answer this one (for once i'm answering and not asking)

i just built this new rig. so now i have 2 crunchers. when looking at the log file i saw this many many times i went serching for the answer and this what i found. all my crunchers are accessing the internet through 1 router and by default the same IP. so cruncher a asks for work gets work... 10 seconds later cruncher b asks for work gets message. wait a minute and it asks again and gets work. i noticed in my log files and on my router activity that almost every time i got this message my other cruncher had just contacted seti.

now i could be wrong but i think this is the reason
I RTFM and it was WYSIWYG then i found out it was a PEBKAC error
ID: 1347983 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1348068 - Posted: 18 Mar 2013, 13:49:52 UTC - in response to Message 1347979.  

After a week working with the scripts, Ive found that the root cause for this abandoned tasks is a network issue... In the last 4 days Ive been using only one of my ISPs and it was working fine until yesterday...
Then it started to lost the link and for times it was very, very slow, and in the middle of this issues one of the hosts got a bunch of abandoned tasks after almost 4 days with no issues...

It makes sense also to explain why I get this so often while others don't... I doubt anybody else has such crappy ISPs... Sadly, there is no other ISPs in my area besides those Im using...

But, the question remains (and probably it will for ever)... Why a network error confuses the servers so badly that they think that the tasks have to be marked as abandoned?
(BTW, It doesnt happens with Einstein, if the network is not working well, it just fails the RPCs or the transfers, but Ive never seen an abandoned task there...)

SETI@Home is a test bed for the latest, & not always greatest, BOINC server code. So we can have issues here that no other projects will ever see.

SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1348068 · Report as offensive
Profile Bernie Vine
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 26 May 99
Posts: 9954
Credit: 103,452,613
RAC: 328
United Kingdom
Message 1348071 - Posted: 18 Mar 2013, 13:59:52 UTC

Perhaps it is due to a combination of network problems at the client end and totally maxed network at the servers. I don't believe any other project has the same network congestion as SETI@Home, so it is difficult to compare.


ID: 1348071 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1348077 - Posted: 18 Mar 2013, 14:10:26 UTC

How many people are seeing this problem, or are seeing it currently / recently / repeatedly?

I'll go through the thread - Horatio comes to mind, of course - and I've had another notification by PM - one which doesn't seem to match the 'bad network' theory, strangely.

If I can find an example of a regular repeater - the private one might do it - I think the time has come to ask David or Eric to go through the server logs for a host/time, and see if they can spot anything which matches our observations from out here.
ID: 1348077 · Report as offensive
Profile KWSN Ekky Ekky Ekky
Avatar

Send message
Joined: 25 May 99
Posts: 944
Credit: 52,956,491
RAC: 67
United Kingdom
Message 1348082 - Posted: 18 Mar 2013, 14:18:03 UTC
Last modified: 18 Mar 2013, 14:24:15 UTC

"Hair today, Goon tomorrow" (Popeye).
I'm abandoning nothing but piles of work is being abandoned for me.
What's more, these all seem to be work that was never actually sent to me. Most odd.
Total is exactly 200 tasks, ranging from "sent" on 15th March to 17th and all "abandoned" on 18th.

ID: 1348082 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1348084 - Posted: 18 Mar 2013, 14:23:53 UTC - in response to Message 1348082.  

"Hair today, Goon tomorrow" (Popeye).
I'm abandoning nothing but piles of work is being abandoned for me.
What's more, these all seem to be work that was never actually sent to me. Most odd.

Looks like it's only happening on one of your two computers - Error tasks for computer 6860030.

Are both the same machines on the same internet connection, or are they in different places?
ID: 1348084 · Report as offensive
Profile KWSN Ekky Ekky Ekky
Avatar

Send message
Joined: 25 May 99
Posts: 944
Credit: 52,956,491
RAC: 67
United Kingdom
Message 1348085 - Posted: 18 Mar 2013, 14:25:15 UTC - in response to Message 1348084.  

Looks like it's only happening on one of your two computers - Error tasks for computer 6860030.

Are both the same machines on the same internet connection, or are they in different places?


Both computers here on the same wifi network.

ID: 1348085 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1348105 - Posted: 18 Mar 2013, 15:23:10 UTC

Well, I've looked through the whole thread, and I'm stumped. There seems to be no pattern to it at all.

We've got people with multiple hosts on the same network. Sometimes only one is affected, sometimes more.

We've got people with one event, never repeated: others for whom once it starts, it happens again and again and again.

We've got hosts with ATI GPUs, NV GPUs, and no GPU at all.

We've got people running stock apps, and people running optimised.

We've got people running BOINC 6.10, 6.12, and 7.0

No rhyme or reason. Time to call in the cavalry, I think.
ID: 1348105 · Report as offensive
Profile KWSN Ekky Ekky Ekky
Avatar

Send message
Joined: 25 May 99
Posts: 944
Credit: 52,956,491
RAC: 67
United Kingdom
Message 1348107 - Posted: 18 Mar 2013, 15:24:49 UTC
Last modified: 18 Mar 2013, 15:26:37 UTC

This gets stranger and stranger.
One task was reported at 1:34:30 UTC today and successfully validated.
Ever since then, every task reported has been marked as abandoned.
What goes on ?

PS
What's more, every single "abandoned" task is marked at 1:34:59 UTC, whenever it was reported. Weird!

ID: 1348107 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1348123 - Posted: 18 Mar 2013, 16:18:39 UTC - in response to Message 1348107.  

This gets stranger and stranger.
One task was reported at 1:34:30 UTC today and successfully validated.
Ever since then, every task reported has been marked as abandoned.
What goes on ?

PS
What's more, every single "abandoned" task is marked at 1:34:59 UTC, whenever it was reported. Weird!

That is interesting.

I'm not surprised that there is some sort of 'abandonment' event which affects a whole lot of tasks at once, but it may very well be significant that it can happen 29 seconds after a normal report, like task 2876483636
ID: 1348123 · Report as offensive
Profile KWSN Ekky Ekky Ekky
Avatar

Send message
Joined: 25 May 99
Posts: 944
Credit: 52,956,491
RAC: 67
United Kingdom
Message 1348127 - Posted: 18 Mar 2013, 16:31:15 UTC
Last modified: 18 Mar 2013, 16:37:16 UTC

I thought I ought to check to make sure - two tasks that were about to report had already been marked "abandoned".
Effectively, all bar one task today has been done in vain and I have so far "lost" a huge amount of credit.
Will I get it back???

[edit]More recently downloaded work has not been marked abandoned, incidentally.

ID: 1348127 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1348135 - Posted: 18 Mar 2013, 16:54:45 UTC - in response to Message 1348105.  
Last modified: 18 Mar 2013, 17:43:44 UTC

Richard, Just tell me if there is something I can do (in the sense of logs, data mining or whatever) that may help to make a better picture of whats going on...

I still think that some network error (not necesarily or exclusive on the SETI pipes or campus) is what triggers the abandoned tasks...

Why?,Well, after a full week in which all my SETI hosts where having this issue at least once a day (at different hours) it happened that for 5 days it stopped at all...
Whats different between the first week and the next?
Only one thing, in the first week my main ISP was not working well so I needed to enable the other ISPs through the load balancer router (which in my case more than balancing what it does is more close to "choose the functional ISP")
then as Ive noticed that this ISP was working well Ive disabled the other ISPs and while the conection was working well there was no abandoned tasks...
In the last weekend the main ISP was failling again and without enablling the other ISPs Ive got again abandoned tasks on one of the hosts... [EDIT: And today in the other 2 hosts also]

About the message "last request too recent" Ive found that sometimes the time in which the RPC starts and the time that the servers register it has several minutes of difference (not due to clocks differences - my hosts are more or less at 15 secs of the time of the servers-)... when this happens there are high probabilities that the client times out... and sometimes that difference in the times (server-client) is long enough to make the 5 mins that the client waited a too short time from server side...
Im not sure if this last issue has any relation with the abandoned tasks, because there was abandoned tasks also in a time between two succesfull RPCs...
ID: 1348135 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1348138 - Posted: 18 Mar 2013, 17:02:01 UTC - in response to Message 1348127.  

I thought I ought to check to make sure - two tasks that were about to report had already been marked "abandoned".
Effectively, all bar one task today has been done in vain and I have so far "lost" a huge amount of credit.
Will I get it back???

[edit]More recently downloaded work has not been marked abandoned, incidentally.

Not, once the tasks are marked abandoned all the work made on them is wasted, because the results are discarded which means that someone else has to do it again and you wont get the credits.

If you have any doubt about the tasks you have in hands, the easiest way to get rid off the abandoned ones is doing a reset, but if your cache is small it may be better to check them one by one against the server and just aborting the ones marked as abandoned...
ID: 1348138 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1348159 - Posted: 18 Mar 2013, 18:05:17 UTC - in response to Message 1348077.  
Last modified: 18 Mar 2013, 18:27:27 UTC

How many people are seeing this problem, or are seeing it currently / recently / repeatedly?

I've had it occur once or twice, probably several months ago now.
I can't remember if it occured on only one or both of my machines.

EDIT- i think it was around the time of the Scheduler timeout issues.
Grant
Darwin NT
ID: 1348159 · Report as offensive
Profile KWSN Ekky Ekky Ekky
Avatar

Send message
Joined: 25 May 99
Posts: 944
Credit: 52,956,491
RAC: 67
United Kingdom
Message 1348229 - Posted: 18 Mar 2013, 20:03:27 UTC

Just aborted over 100 tasks.
Miserable about that but at least I have got rid of all the "abandoned" ones :-(


ID: 1348229 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1348237 - Posted: 18 Mar 2013, 20:20:51 UTC - in response to Message 1348229.  

In fact you only aborted 15 tasks, the others were already abandoned... so its not a big issue...
To avoid aborting valid pending tasks a reset is the easy way... those that are not abandoned will be resent to you and only those...
ID: 1348237 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1348251 - Posted: 18 Mar 2013, 20:54:50 UTC

Back in July 2009 I made an attempt to get a sanity check before abandoning tasks (then shown as "Client detached"). The boinc_dev thread starts at http://lists.ssl.berkeley.edu/pipermail/boinc_dev/2009-July/014662.html.

Unfortunately the check I proposed isn't workable, the "other_results" list has entries for tasks from other projects as well as S@H. There are other parts of the work request which could be checked, I don't know what would be quick and reliable.
                                                                   Joe
ID: 1348251 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1348264 - Posted: 18 Mar 2013, 21:28:08 UTC - in response to Message 1348251.  

Interesting:

On 27 Jul 2009 at 9:54, Jonathan wrote:

> Hi!
> Just read this.. and have a few questions/ideas:
>
> Is this happening throughout every kind of computer?
> I'm asking, because here with SIMAP, we see one such a happening every
> now and then,
> with a Mac connecting to the scheduler and then getting a reattachment.
>
> We have traced it to a *very* long running scheduler request that seems
> to be on hold for several hours,
> during that time, the Mac makes some more scheduler requests, increasing
> the request_sequence_id;
> Then suddenly, (why, we don't know or understand - or even can guess)
> the Mac pics up the long standing scheduler request,
> which suddenly returns and complains about not-in-order
> request_sequence_ids, effectively detaching the host.

>
> Might this be the case on Seti too?
>
> However, I don't see a chance for the seti-guys to track this down,
> because it took me - with our considerably smaller
> database/hostcount/workload -
> about three month to track it down. Though this lengthy it was only
> 'cause of our cyclic work-distribution scheme.
>
> Best
> -Jonathan
> from the BoincSIMAP team

That's more or less the same that Ive guessed in an earlier post... may be some RPC delayed on the networks (or in the scheduller) for a really long time?
(and with "long" I mean really long times on human scale not on "computing" scale...)
ID: 1348264 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1348276 - Posted: 18 Mar 2013, 21:59:40 UTC - in response to Message 1348251.  

Back in July 2009 I made an attempt to get a sanity check before abandoning tasks (then shown as "Client detached"). The boinc_dev thread starts at http://lists.ssl.berkeley.edu/pipermail/boinc_dev/2009-July/014662.html.

Unfortunately the check I proposed isn't workable, the "other_results" list has entries for tasks from other projects as well as S@H. There are other parts of the work request which could be checked, I don't know what would be quick and reliable.
                                                                   Joe

I've been doing some code-walking too, and found that there are only two places in the code where the routine which sets RESULT_OUTCOME_CLIENT_DETACHED is called. They're both in handle_request.cpp, and the more likely one says:
// If host CPID is present,
// scan backwards through this user's hosts,
// looking for one with the same host CPID.
// If we find one, it means the user detached and reattached.
// Use the existing host record,
// and mark in-progress results as over.

So, I'm pointing the finger at host CPID, too, and it worries me.

If you look at the host list for any of the people who have reported the problem in this thread, you'll see that below each HostID number, there's a link through to BOINCstats for that host. That link attempts to hook-up by CPID, and every one I've tried, BOINCstats has ended up in a dead end: whichever way I try it, I haven't been able to get back to a sane-looking stats page for the right user. On the other hand, the older hosts on my own account (up to Q6600 3751792) have clean links to BOINCstats - it starts going wrong with Q9300 4292666.

All of which could simply be a stats problem, but given all the weirdness in this thread, I'm beginning to wonder if we might have database corruption - I think that's more likely than the same hosts getting a bad network connection, again and again, when other hosts on the same network are OK. That would really be be a big bugger, both to diagnose and to fix.

What we need at this point is for the problem to strike a user who has real cold, hard, forensic, code-walking skills. I think that such a user, armed with their own logged-in account page (which gives access to the user account key), and the sched_request file from the host giving problems, could walk through authenticate_user() (line 242 of http://boinc.berkeley.edu/trac/browser/boinc/sched/handle_request.cpp), and find out whether they end up properly authenticating host and user IDs in the database. It ain't going to be easy - any takers?
ID: 1348276 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Abandoned tasks - Ongoing issue


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.