Abandoned tasks - Ongoing issue

Message boards : Number crunching : Abandoned tasks - Ongoing issue
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1348279 - Posted: 18 Mar 2013, 22:07:44 UTC - in response to Message 1348276.  
Last modified: 18 Mar 2013, 22:17:38 UTC

What we need at this point is for the problem to strike a user who has real cold, hard, forensic, code-walking skills. I think that such a user, armed with their own logged-in account page (which gives access to the user account key), and the sched_request file from the host giving problems, could walk through authenticate_user() (line 242 of http://boinc.berkeley.edu/trac/browser/boinc/sched/handle_request.cpp), and find out whether they end up properly authenticating host and user IDs in the database. It ain't going to be easy - any takers?


Ive seen that issue with my hosts not having records in the stats before a certain date... And I know they had data for those old dates some time ago...

Im not sure I qualify with all that requirements but I can try...
Of course I need more specific instructions...

Interesting:

make_new_host:
410 // One final attempt to locate an existing host record:
411 // scan backwards through this user's hosts,
412 // looking for one with the same host name,
413 // IP address, processor and amount of RAM.
414 // If found, use the existing host record,
415 // and mark in-progress results as over.

ID: 1348279 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1348287 - Posted: 18 Mar 2013, 22:24:34 UTC - in response to Message 1348279.  

Interesting:

make_new_host:
410 // One final attempt to locate an existing host record:
411 // scan backwards through this user's hosts,
412 // looking for one with the same host name,
413 // IP address, processor and amount of RAM.
414 // If found, use the existing host record,
415 // and mark in-progress results as over.

That's exactly what I was looking for in the code - where and when does that happen? Answer - when the scheduler request gets through to the server, but it can't validate the HostID, UserID and security key (authenticator) against the data held on the server.
ID: 1348287 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1348290 - Posted: 18 Mar 2013, 22:36:42 UTC - in response to Message 1348287.  
Last modified: 18 Mar 2013, 22:41:04 UTC

Also...

335 // If the seqno from the host is less than what we expect,
336 // the user must have copied the state file to a different host.
337 // Make a new host record.

And the first thing inside make_new_host is what Ive posted before...


EDIT: Today, Boinc Stats is saying that my hosts are hidden for SETI when they are not... But the truth is that the hosts info on Boinc Stats are always weird and Ive seen cahnges from one day to another even when I was not having any issue with SETI...
ID: 1348290 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1348308 - Posted: 18 Mar 2013, 23:41:33 UTC - in response to Message 1348287.  

.....
If you look at the host list for any of the people who have reported the problem in this thread, you'll see that below each HostID number, there's a link through to BOINCstats for that host. That link attempts to hook-up by CPID, and every one I've tried, BOINCstats has ended up in a dead end: whichever way I try it, I haven't been able to get back to a sane-looking stats page for the right user. On the other hand, the older hosts on my own account (up to Q6600 3751792) have clean links to BOINCstats - it starts going wrong with Q9300 4292666......


Richard, I have not had this particular problem but something is certainly going funny over at BOINCstats as over the last few days they have not been updating the stats for my rigs and now today they are listed as being hidden.

Cheers.
ID: 1348308 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1348318 - Posted: 19 Mar 2013, 0:40:36 UTC - in response to Message 1348308.  

.....
If you look at the host list for any of the people who have reported the problem in this thread, you'll see that below each HostID number, there's a link through to BOINCstats for that host. That link attempts to hook-up by CPID, and every one I've tried, BOINCstats has ended up in a dead end: whichever way I try it, I haven't been able to get back to a sane-looking stats page for the right user. On the other hand, the older hosts on my own account (up to Q6600 3751792) have clean links to BOINCstats - it starts going wrong with Q9300 4292666......

Richard, I have not had this particular problem but something is certainly going funny over at BOINCstats as over the last few days they have not been updating the stats for my rigs and now today they are listed as being hidden.

Cheers.

And having had a scout round, it's affecting Einstein data too. That's a worthwhile health warning: at this stage we don't know whether that's cause, effect, or a complete red herring. All I'm doing is, slightly rashly, plucking straws from the wind, laying them out in front of you all, and trying to make sense of them.

If we're going to crack this at all, I think we're going to need that cold, hard, forensic, approach. But if anybody has some facts (as opposed to speculative opinion) to feed in, that'll all help.
ID: 1348318 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1348320 - Posted: 19 Mar 2013, 0:59:48 UTC - in response to Message 1348318.  

My conjeture is that in some way, sometimes, one RPC gets delayed (on the internet, on my ISP, on the servers, or may be even in the subspace...)

TBH, I think this explains everything, it explains why the abandoned tasks do not match exactly an RPC time, it explains the weird "last contact too recent" and also as the delay could last any arbitrary time, it explain why in some cases the RPCs close to the time of the abandoned tasks are normal and successfull ones... not to mention that the code that handles an out of order RPC does exactly what Ive been seeing since the beggining, it keeps the hostId but abandons all the tasks...

So, if there is a way to proove or discard that conjeture and/or to proove or discard the possibility of the IDs weirdness Im all for it, just tell me what I have to do!


ID: 1348320 · Report as offensive
Profile trader
Volunteer tester

Send message
Joined: 25 Jun 00
Posts: 126
Credit: 4,968,173
RAC: 0
United States
Message 1348347 - Posted: 19 Mar 2013, 3:39:34 UTC - in response to Message 1348320.  

My conjeture is that in some way, sometimes, one RPC gets delayed (on the internet, on my ISP, on the servers, or may be even in the subspace...)

TBH, I think this explains everything, it explains why the abandoned tasks do not match exactly an RPC time, it explains the weird "last contact too recent" and also as the delay could last any arbitrary time, it explain why in some cases the RPCs close to the time of the abandoned tasks are normal and successfull ones... not to mention that the code that handles an out of order RPC does exactly what Ive been seeing since the beggining, it keeps the hostId but abandons all the tasks...

So, if there is a way to proove or discard that conjeture and/or to proove or discard the possibility of the IDs weirdness Im all for it, just tell me what I have to do!



LOL @ Horacio....could be the chipmonks get tired now and then
ID: 1348347 · Report as offensive
Profile Uli
Volunteer tester
Avatar

Send message
Joined: 6 Feb 00
Posts: 10923
Credit: 5,996,015
RAC: 1
Germany
Message 1348382 - Posted: 19 Mar 2013, 6:24:27 UTC

I didn't follow the whole thread, but could Daylight Savings Time here be an isuue?
Pluto will always be a planet to me.

Seti Ambassador
Not to late to order an Anni Shirt
ID: 1348382 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1348390 - Posted: 19 Mar 2013, 6:37:06 UTC - in response to Message 1348318.  

Richard Haselgrove wrote:
And having had a scout round, it's affecting Einstein data too. That's a worthwhile health warning: at this stage we don't know whether that's cause, effect, or a complete red herring.
...

ISTR that Einstein had a way for users to view the Scheduler logs. If so, the log messages from handle_request.cpp might show and you'd be able to correlate with client indications.
                                                                    Joe
ID: 1348390 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1348434 - Posted: 19 Mar 2013, 10:39:19 UTC - in response to Message 1348390.  

Richard Haselgrove wrote:
And having had a scout round, it's affecting Einstein data too. That's a worthwhile health warning: at this stage we don't know whether that's cause, effect, or a complete red herring.
...

ISTR that Einstein had a way for users to view the Scheduler logs. If so, the log messages from handle_request.cpp might show and you'd be able to correlate with client indications.
                                                                    Joe

Ah - slight misunderstanding: it's only the host data at BOINCstats that I'm seeing messed up for Einstein. There have been no reports of these large-scale and regular tasks abandonments, so no target hosts to investigate.

Yes, Einstein server logs are accessible online (and very helpful it is too), but you only see the log for the single most recent RPC for each host. It would be quite time-consuming to find a host which was suffering, and then catch it in the act. If abandonments did start appearing there, I think I'd go straight to Bernd and ask the staff to take a look - they have more time available for that sort of thing than our long-suffering boyz do.
ID: 1348434 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1348440 - Posted: 19 Mar 2013, 11:11:00 UTC - in response to Message 1348320.  

My conjeture is that in some way, sometimes, one RPC gets delayed (on the internet, on my ISP, on the servers, or may be even in the subspace...)

TBH, I think this explains everything, it explains why the abandoned tasks do not match exactly an RPC time, it explains the weird "last contact too recent" and also as the delay could last any arbitrary time, it explain why in some cases the RPCs close to the time of the abandoned tasks are normal and successfull ones... not to mention that the code that handles an out of order RPC does exactly what Ive been seeing since the beggining, it keeps the hostId but abandons all the tasks...

So, if there is a way to proove or discard that conjeture and/or to proove or discard the possibility of the IDs weirdness Im all for it, just tell me what I have to do!

It's hard to see where in cyberspace a whole RPC could hide for long enough to get out of sequence, but it's an interesting idea to keep in mind.

I'm not sure I can give you a mechanical recipe for how to go about the search: in the nature of things, we don't know exactly what we're looking for. This sort of thing tends to turn into "spread every number you can find out on the table, and look for the one which 'feels wrong'".

One suggestion: if anyone would be prepared to trust me with access to their account on this website (I would need to log in to see the detailed version of the host records, but no more than that), I could run the numbers here. We would need to exchange email addresses by PM, and you would need to zip up and send me a 'sched_request_setiathome.berkeley.edu.xml' and 'sched_reply_setiathome.berkeley.edu.xml' file from an affected computer - ideally ones from close to an abandonment event.

Since we have a 24-hour outage coming up (normal server maintenance, followed by campus network maintenance, as per the front page), it would be good to start this asap, if anyone is game.
ID: 1348440 · Report as offensive
Profile trader
Volunteer tester

Send message
Joined: 25 Jun 00
Posts: 126
Credit: 4,968,173
RAC: 0
United States
Message 1348455 - Posted: 19 Mar 2013, 12:40:05 UTC - in response to Message 1348440.  


One suggestion: if anyone would be prepared to trust me with access to their account on this website (I would need to log in to see the detailed version of the host records, but no more than that)


hmmmm.... i think giving out account passwords is number one on a list out there somewhere enitled a really bad idea
ID: 1348455 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1348459 - Posted: 19 Mar 2013, 13:07:58 UTC - in response to Message 1348455.  

One suggestion: if anyone would be prepared to trust me with access to their account on this website (I would need to log in to see the detailed version of the host records, but no more than that)

hmmmm.... i think giving out account passwords is number one on a list out there somewhere enitled a really bad idea

Exactly. That's why I was up front and explicit about what the suggestion would entail.

And I didn't mention a password. I would be logging in to the website using the 'authenticator' (account key) contained within the sched_request file, so I only gain access to this site (not anything else that might share a password). I don't actually even need to know an email address - that's just a mechanism for file exchange: a secure dropbox would work just as well (though I'd see the email address as soon as I logged in, of course, so that wouldn't gain much).

But it's up to users. A trade off between how much they want this problem investigated, vs. how much they trust me. My record has 7,340 public posts (plus many thousand more on other BOINC-related sites) for them to evaluate me by.
ID: 1348459 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1348495 - Posted: 19 Mar 2013, 14:48:02 UTC

OK, I've got files from two users and four hosts now. That's enough for today, thanks.
ID: 1348495 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1348520 - Posted: 19 Mar 2013, 15:40:03 UTC

I hope my one machine that did this last week will repeat its performance this week. If it will I can load up wireshark and start logging everything on it.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1348520 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22190
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1353435 - Posted: 4 Apr 2013, 20:41:48 UTC

Earlier today I got a pile of new tasks - which was good news.
Now I look at my account page and see they are marked as "abandoned", but they are still on the cruncher, and being processed.

http://setiathome.berkeley.edu/results.php?hostid=6452693

Something strange going on here...
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1353435 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34257
Credit: 79,922,639
RAC: 80
Germany
Message 1353436 - Posted: 4 Apr 2013, 20:56:37 UTC - in response to Message 1353435.  

Earlier today I got a pile of new tasks - which was good news.
Now I look at my account page and see they are marked as "abandoned", but they are still on the cruncher, and being processed.

http://setiathome.berkeley.edu/results.php?hostid=6452693

Something strange going on here...


I had this happened a few weeks ago.



With each crime and every kindness we birth our future.
ID: 1353436 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14650
Credit: 200,643,578
RAC: 874
United Kingdom
Message 1353437 - Posted: 4 Apr 2013, 21:02:27 UTC - in response to Message 1353435.  

Earlier today I got a pile of new tasks - which was good news.
Now I look at my account page and see they are marked as "abandoned", but they are still on the cruncher, and being processed.

http://setiathome.berkeley.edu/results.php?hostid=6452693

Something strange going on here...

Ah-ha. I've been waiting to see if that would still happen after the move - and clearly it has.

Remember Horacio's Abandoned tasks - Ongoing issue thread? I've left that one rather on a back burner while all the changes happened - but it's obviously time to get back on the case.

Could you possibly move yourself (or make a copy post) into that thread, please? And could find the section of your local message/event log that covers 4 Apr 2013, 20:17:50 UTC - the time when they were marked as abandoned: +- 15 minutes should be enough. If it's not too verbose, perhaps you could post it in thread - otherwise PM or email me.

I think I'm going to have to ask David Anderson to look into the server logs for this one - I've already had some useful pointers from a coder who prefers to contact me by PM, and I've found a bit of code which is no longer doing what it describes itself as doing, so I have something to go on. Any other evidence - especially hosts which repeatedly abandon tasks - would be most helpful.
ID: 1353437 · Report as offensive
Horacio

Send message
Joined: 14 Jan 00
Posts: 536
Credit: 75,967,266
RAC: 0
Argentina
Message 1353473 - Posted: 5 Apr 2013, 1:36:57 UTC

Richard, in my case it's happening a lot less than before but mainly because my main ISP have been working as intended which allowed me to not depend on the others...
Also, I thought that it was not a good time to debug this until the servers and all the tunning was finished, or at least until things gets more or less stable... But if you think that we are ready to resume the testings I can plug the other ISPs to see if the issue arises as often as before...
ID: 1353473 · Report as offensive
MikeN

Send message
Joined: 24 Jan 11
Posts: 319
Credit: 64,719,409
RAC: 85
United Kingdom
Message 1353589 - Posted: 5 Apr 2013, 9:01:31 UTC

My main cruncher abandoned 98 tasks at 7.45 last night. This has not happened to me since Christmas and I was hoping the issue had been solved, I was down to my last 2 abandoned from the Christmas problems. Nothing wrong with the PC, all temps normal and continued crunching the adandoned tasks quite happily until I noticed and reset the project.

Most annoying thing is I did not notice until 9AM this morning, so thats over 12 hours of wasted crunching.

Still one good thing with the superdooper download speeds we get now at least I was able to get more WU's quickly.
ID: 1353589 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Abandoned tasks - Ongoing issue


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.