Message boards :
Number crunching :
Abandoned tasks - Ongoing issue
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
After a week working with the scripts, Ive found that the root cause for this abandoned tasks is a network issue... In the last 4 days Ive been using only one of my ISPs and it was working fine until yesterday... Then it started to lost the link and for times it was very, very slow, and in the middle of this issues one of the hosts got a bunch of abandoned tasks after almost 4 days with no issues... It makes sense also to explain why I get this so often while others don't... I doubt anybody else has such crappy ISPs... Sadly, there is no other ISPs in my area besides those Im using... But, the question remains (and probably it will for ever)... Why a network error confuses the servers so badly that they think that the tasks have to be marked as abandoned? (BTW, It doesnt happens with Einstein, if the network is not working well, it just fails the RPCs or the transfers, but Ive never seen an abandoned task there...) |
trader Send message Joined: 25 Jun 00 Posts: 126 Credit: 4,968,173 RAC: 0 |
The "last request too recent" messages are certainly suggestive of some other computer attempting to contact the scheduler with the same HostID number. For those off you afflicted with this problem - it doesn't seem to affect all of us - it just might be helpful to keep an eye on the IP addresses shown on the host details page for that host (available to logged-in users only): if there is a 'ghost host' contacting the scheduler, that should change, and the interloper's IP address might help the staff to track it down. I think ican answer this one (for once i'm answering and not asking) i just built this new rig. so now i have 2 crunchers. when looking at the log file i saw this many many times i went serching for the answer and this what i found. all my crunchers are accessing the internet through 1 router and by default the same IP. so cruncher a asks for work gets work... 10 seconds later cruncher b asks for work gets message. wait a minute and it asks again and gets work. i noticed in my log files and on my router activity that almost every time i got this message my other cruncher had just contacted seti. now i could be wrong but i think this is the reason I RTFM and it was WYSIWYG then i found out it was a PEBKAC error |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
After a week working with the scripts, Ive found that the root cause for this abandoned tasks is a network issue... In the last 4 days Ive been using only one of my ISPs and it was working fine until yesterday... SETI@Home is a test bed for the latest, & not always greatest, BOINC server code. So we can have issues here that no other projects will ever see. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Bernie Vine Send message Joined: 26 May 99 Posts: 9954 Credit: 103,452,613 RAC: 328 |
Perhaps it is due to a combination of network problems at the client end and totally maxed network at the servers. I don't believe any other project has the same network congestion as SETI@Home, so it is difficult to compare. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
How many people are seeing this problem, or are seeing it currently / recently / repeatedly? I'll go through the thread - Horatio comes to mind, of course - and I've had another notification by PM - one which doesn't seem to match the 'bad network' theory, strangely. If I can find an example of a regular repeater - the private one might do it - I think the time has come to ask David or Eric to go through the server logs for a host/time, and see if they can spot anything which matches our observations from out here. |
KWSN Ekky Ekky Ekky Send message Joined: 25 May 99 Posts: 944 Credit: 52,956,491 RAC: 67 |
"Hair today, Goon tomorrow" (Popeye). I'm abandoning nothing but piles of work is being abandoned for me. What's more, these all seem to be work that was never actually sent to me. Most odd. Total is exactly 200 tasks, ranging from "sent" on 15th March to 17th and all "abandoned" on 18th. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
"Hair today, Goon tomorrow" (Popeye). Looks like it's only happening on one of your two computers - Error tasks for computer 6860030. Are both the same machines on the same internet connection, or are they in different places? |
KWSN Ekky Ekky Ekky Send message Joined: 25 May 99 Posts: 944 Credit: 52,956,491 RAC: 67 |
Looks like it's only happening on one of your two computers - Error tasks for computer 6860030. Both computers here on the same wifi network. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Well, I've looked through the whole thread, and I'm stumped. There seems to be no pattern to it at all. We've got people with multiple hosts on the same network. Sometimes only one is affected, sometimes more. We've got people with one event, never repeated: others for whom once it starts, it happens again and again and again. We've got hosts with ATI GPUs, NV GPUs, and no GPU at all. We've got people running stock apps, and people running optimised. We've got people running BOINC 6.10, 6.12, and 7.0 No rhyme or reason. Time to call in the cavalry, I think. |
KWSN Ekky Ekky Ekky Send message Joined: 25 May 99 Posts: 944 Credit: 52,956,491 RAC: 67 |
This gets stranger and stranger. One task was reported at 1:34:30 UTC today and successfully validated. Ever since then, every task reported has been marked as abandoned. What goes on ? PS What's more, every single "abandoned" task is marked at 1:34:59 UTC, whenever it was reported. Weird! |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
This gets stranger and stranger. That is interesting. I'm not surprised that there is some sort of 'abandonment' event which affects a whole lot of tasks at once, but it may very well be significant that it can happen 29 seconds after a normal report, like task 2876483636 |
KWSN Ekky Ekky Ekky Send message Joined: 25 May 99 Posts: 944 Credit: 52,956,491 RAC: 67 |
I thought I ought to check to make sure - two tasks that were about to report had already been marked "abandoned". Effectively, all bar one task today has been done in vain and I have so far "lost" a huge amount of credit. Will I get it back??? [edit]More recently downloaded work has not been marked abandoned, incidentally. |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
Richard, Just tell me if there is something I can do (in the sense of logs, data mining or whatever) that may help to make a better picture of whats going on... I still think that some network error (not necesarily or exclusive on the SETI pipes or campus) is what triggers the abandoned tasks... Why?,Well, after a full week in which all my SETI hosts where having this issue at least once a day (at different hours) it happened that for 5 days it stopped at all... Whats different between the first week and the next? Only one thing, in the first week my main ISP was not working well so I needed to enable the other ISPs through the load balancer router (which in my case more than balancing what it does is more close to "choose the functional ISP") then as Ive noticed that this ISP was working well Ive disabled the other ISPs and while the conection was working well there was no abandoned tasks... In the last weekend the main ISP was failling again and without enablling the other ISPs Ive got again abandoned tasks on one of the hosts... [EDIT: And today in the other 2 hosts also] About the message "last request too recent" Ive found that sometimes the time in which the RPC starts and the time that the servers register it has several minutes of difference (not due to clocks differences - my hosts are more or less at 15 secs of the time of the servers-)... when this happens there are high probabilities that the client times out... and sometimes that difference in the times (server-client) is long enough to make the 5 mins that the client waited a too short time from server side... Im not sure if this last issue has any relation with the abandoned tasks, because there was abandoned tasks also in a time between two succesfull RPCs... |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
I thought I ought to check to make sure - two tasks that were about to report had already been marked "abandoned". Not, once the tasks are marked abandoned all the work made on them is wasted, because the results are discarded which means that someone else has to do it again and you wont get the credits. If you have any doubt about the tasks you have in hands, the easiest way to get rid off the abandoned ones is doing a reset, but if your cache is small it may be better to check them one by one against the server and just aborting the ones marked as abandoned... |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
How many people are seeing this problem, or are seeing it currently / recently / repeatedly? I've had it occur once or twice, probably several months ago now. I can't remember if it occured on only one or both of my machines. EDIT- i think it was around the time of the Scheduler timeout issues. Grant Darwin NT |
KWSN Ekky Ekky Ekky Send message Joined: 25 May 99 Posts: 944 Credit: 52,956,491 RAC: 67 |
Just aborted over 100 tasks. Miserable about that but at least I have got rid of all the "abandoned" ones :-( |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
In fact you only aborted 15 tasks, the others were already abandoned... so its not a big issue... To avoid aborting valid pending tasks a reset is the easy way... those that are not abandoned will be resent to you and only those... |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Back in July 2009 I made an attempt to get a sanity check before abandoning tasks (then shown as "Client detached"). The boinc_dev thread starts at http://lists.ssl.berkeley.edu/pipermail/boinc_dev/2009-July/014662.html. Unfortunately the check I proposed isn't workable, the "other_results" list has entries for tasks from other projects as well as S@H. There are other parts of the work request which could be checked, I don't know what would be quick and reliable. Joe |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
Interesting: On 27 Jul 2009 at 9:54, Jonathan wrote: > Hi! > Just read this.. and have a few questions/ideas: > > Is this happening throughout every kind of computer? > I'm asking, because here with SIMAP, we see one such a happening every > now and then, > with a Mac connecting to the scheduler and then getting a reattachment. > > We have traced it to a *very* long running scheduler request that seems > to be on hold for several hours, > during that time, the Mac makes some more scheduler requests, increasing > the request_sequence_id; > Then suddenly, (why, we don't know or understand - or even can guess) > the Mac pics up the long standing scheduler request, > which suddenly returns and complains about not-in-order > request_sequence_ids, effectively detaching the host. > > Might this be the case on Seti too? > > However, I don't see a chance for the seti-guys to track this down, > because it took me - with our considerably smaller > database/hostcount/workload - > about three month to track it down. Though this lengthy it was only > 'cause of our cyclic work-distribution scheme. > > Best > -Jonathan > from the BoincSIMAP team That's more or less the same that Ive guessed in an earlier post... may be some RPC delayed on the networks (or in the scheduller) for a really long time? (and with "long" I mean really long times on human scale not on "computing" scale...) |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Back in July 2009 I made an attempt to get a sanity check before abandoning tasks (then shown as "Client detached"). The boinc_dev thread starts at http://lists.ssl.berkeley.edu/pipermail/boinc_dev/2009-July/014662.html. I've been doing some code-walking too, and found that there are only two places in the code where the routine which sets RESULT_OUTCOME_CLIENT_DETACHED is called. They're both in handle_request.cpp, and the more likely one says: // If host CPID is present, // scan backwards through this user's hosts, // looking for one with the same host CPID. // If we find one, it means the user detached and reattached. // Use the existing host record, // and mark in-progress results as over. So, I'm pointing the finger at host CPID, too, and it worries me. If you look at the host list for any of the people who have reported the problem in this thread, you'll see that below each HostID number, there's a link through to BOINCstats for that host. That link attempts to hook-up by CPID, and every one I've tried, BOINCstats has ended up in a dead end: whichever way I try it, I haven't been able to get back to a sane-looking stats page for the right user. On the other hand, the older hosts on my own account (up to Q6600 3751792) have clean links to BOINCstats - it starts going wrong with Q9300 4292666. All of which could simply be a stats problem, but given all the weirdness in this thread, I'm beginning to wonder if we might have database corruption - I think that's more likely than the same hosts getting a bad network connection, again and again, when other hosts on the same network are OK. That would really be be a big bugger, both to diagnose and to fix. What we need at this point is for the problem to strike a user who has real cold, hard, forensic, code-walking skills. I think that such a user, armed with their own logged-in account page (which gives access to the user account key), and the sched_request file from the host giving problems, could walk through authenticate_user() (line 242 of http://boinc.berkeley.edu/trac/browser/boinc/sched/handle_request.cpp), and find out whether they end up properly authenticating host and user IDs in the database. It ain't going to be easy - any takers? |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.