Message boards :
Number crunching :
Abandoned tasks - Ongoing issue
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
What we need at this point is for the problem to strike a user who has real cold, hard, forensic, code-walking skills. I think that such a user, armed with their own logged-in account page (which gives access to the user account key), and the sched_request file from the host giving problems, could walk through authenticate_user() (line 242 of http://boinc.berkeley.edu/trac/browser/boinc/sched/handle_request.cpp), and find out whether they end up properly authenticating host and user IDs in the database. It ain't going to be easy - any takers? Ive seen that issue with my hosts not having records in the stats before a certain date... And I know they had data for those old dates some time ago... Im not sure I qualify with all that requirements but I can try... Of course I need more specific instructions... Interesting: make_new_host: |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Interesting: That's exactly what I was looking for in the code - where and when does that happen? Answer - when the scheduler request gets through to the server, but it can't validate the HostID, UserID and security key (authenticator) against the data held on the server. |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
Also... 335 // If the seqno from the host is less than what we expect, 336 // the user must have copied the state file to a different host. 337 // Make a new host record. And the first thing inside make_new_host is what Ive posted before... EDIT: Today, Boinc Stats is saying that my hosts are hidden for SETI when they are not... But the truth is that the hosts info on Boinc Stats are always weird and Ive seen cahnges from one day to another even when I was not having any issue with SETI... |
Wiggo Send message Joined: 24 Jan 00 Posts: 34744 Credit: 261,360,520 RAC: 489 |
..... Richard, I have not had this particular problem but something is certainly going funny over at BOINCstats as over the last few days they have not been updating the stats for my rigs and now today they are listed as being hidden. Cheers. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
..... And having had a scout round, it's affecting Einstein data too. That's a worthwhile health warning: at this stage we don't know whether that's cause, effect, or a complete red herring. All I'm doing is, slightly rashly, plucking straws from the wind, laying them out in front of you all, and trying to make sense of them. If we're going to crack this at all, I think we're going to need that cold, hard, forensic, approach. But if anybody has some facts (as opposed to speculative opinion) to feed in, that'll all help. |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
My conjeture is that in some way, sometimes, one RPC gets delayed (on the internet, on my ISP, on the servers, or may be even in the subspace...) TBH, I think this explains everything, it explains why the abandoned tasks do not match exactly an RPC time, it explains the weird "last contact too recent" and also as the delay could last any arbitrary time, it explain why in some cases the RPCs close to the time of the abandoned tasks are normal and successfull ones... not to mention that the code that handles an out of order RPC does exactly what Ive been seeing since the beggining, it keeps the hostId but abandons all the tasks... So, if there is a way to proove or discard that conjeture and/or to proove or discard the possibility of the IDs weirdness Im all for it, just tell me what I have to do! |
trader Send message Joined: 25 Jun 00 Posts: 126 Credit: 4,968,173 RAC: 0 |
My conjeture is that in some way, sometimes, one RPC gets delayed (on the internet, on my ISP, on the servers, or may be even in the subspace...) LOL @ Horacio....could be the chipmonks get tired now and then |
Uli Send message Joined: 6 Feb 00 Posts: 10923 Credit: 5,996,015 RAC: 1 |
I didn't follow the whole thread, but could Daylight Savings Time here be an isuue? Pluto will always be a planet to me. Seti Ambassador Not to late to order an Anni Shirt |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Richard Haselgrove wrote: And having had a scout round, it's affecting Einstein data too. That's a worthwhile health warning: at this stage we don't know whether that's cause, effect, or a complete red herring. ISTR that Einstein had a way for users to view the Scheduler logs. If so, the log messages from handle_request.cpp might show and you'd be able to correlate with client indications. Joe |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Richard Haselgrove wrote:And having had a scout round, it's affecting Einstein data too. That's a worthwhile health warning: at this stage we don't know whether that's cause, effect, or a complete red herring. Ah - slight misunderstanding: it's only the host data at BOINCstats that I'm seeing messed up for Einstein. There have been no reports of these large-scale and regular tasks abandonments, so no target hosts to investigate. Yes, Einstein server logs are accessible online (and very helpful it is too), but you only see the log for the single most recent RPC for each host. It would be quite time-consuming to find a host which was suffering, and then catch it in the act. If abandonments did start appearing there, I think I'd go straight to Bernd and ask the staff to take a look - they have more time available for that sort of thing than our long-suffering boyz do. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
My conjeture is that in some way, sometimes, one RPC gets delayed (on the internet, on my ISP, on the servers, or may be even in the subspace...) It's hard to see where in cyberspace a whole RPC could hide for long enough to get out of sequence, but it's an interesting idea to keep in mind. I'm not sure I can give you a mechanical recipe for how to go about the search: in the nature of things, we don't know exactly what we're looking for. This sort of thing tends to turn into "spread every number you can find out on the table, and look for the one which 'feels wrong'". One suggestion: if anyone would be prepared to trust me with access to their account on this website (I would need to log in to see the detailed version of the host records, but no more than that), I could run the numbers here. We would need to exchange email addresses by PM, and you would need to zip up and send me a 'sched_request_setiathome.berkeley.edu.xml' and 'sched_reply_setiathome.berkeley.edu.xml' file from an affected computer - ideally ones from close to an abandonment event. Since we have a 24-hour outage coming up (normal server maintenance, followed by campus network maintenance, as per the front page), it would be good to start this asap, if anyone is game. |
trader Send message Joined: 25 Jun 00 Posts: 126 Credit: 4,968,173 RAC: 0 |
hmmmm.... i think giving out account passwords is number one on a list out there somewhere enitled a really bad idea |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
One suggestion: if anyone would be prepared to trust me with access to their account on this website (I would need to log in to see the detailed version of the host records, but no more than that) Exactly. That's why I was up front and explicit about what the suggestion would entail. And I didn't mention a password. I would be logging in to the website using the 'authenticator' (account key) contained within the sched_request file, so I only gain access to this site (not anything else that might share a password). I don't actually even need to know an email address - that's just a mechanism for file exchange: a secure dropbox would work just as well (though I'd see the email address as soon as I logged in, of course, so that wouldn't gain much). But it's up to users. A trade off between how much they want this problem investigated, vs. how much they trust me. My record has 7,340 public posts (plus many thousand more on other BOINC-related sites) for them to evaluate me by. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
OK, I've got files from two users and four hosts now. That's enough for today, thanks. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
I hope my one machine that did this last week will repeat its performance this week. If it will I can load up wireshark and start logging everything on it. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
rob smith Send message Joined: 7 Mar 03 Posts: 22190 Credit: 416,307,556 RAC: 380 |
Earlier today I got a pile of new tasks - which was good news. Now I look at my account page and see they are marked as "abandoned", but they are still on the cruncher, and being processed. http://setiathome.berkeley.edu/results.php?hostid=6452693 Something strange going on here... Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Mike Send message Joined: 17 Feb 01 Posts: 34257 Credit: 79,922,639 RAC: 80 |
Earlier today I got a pile of new tasks - which was good news. I had this happened a few weeks ago. With each crime and every kindness we birth our future. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
Earlier today I got a pile of new tasks - which was good news. Ah-ha. I've been waiting to see if that would still happen after the move - and clearly it has. Remember Horacio's Abandoned tasks - Ongoing issue thread? I've left that one rather on a back burner while all the changes happened - but it's obviously time to get back on the case. Could you possibly move yourself (or make a copy post) into that thread, please? And could find the section of your local message/event log that covers 4 Apr 2013, 20:17:50 UTC - the time when they were marked as abandoned: +- 15 minutes should be enough. If it's not too verbose, perhaps you could post it in thread - otherwise PM or email me. I think I'm going to have to ask David Anderson to look into the server logs for this one - I've already had some useful pointers from a coder who prefers to contact me by PM, and I've found a bit of code which is no longer doing what it describes itself as doing, so I have something to go on. Any other evidence - especially hosts which repeatedly abandon tasks - would be most helpful. |
Horacio Send message Joined: 14 Jan 00 Posts: 536 Credit: 75,967,266 RAC: 0 |
Richard, in my case it's happening a lot less than before but mainly because my main ISP have been working as intended which allowed me to not depend on the others... Also, I thought that it was not a good time to debug this until the servers and all the tunning was finished, or at least until things gets more or less stable... But if you think that we are ready to resume the testings I can plug the other ISPs to see if the issue arises as often as before... |
MikeN Send message Joined: 24 Jan 11 Posts: 319 Credit: 64,719,409 RAC: 85 |
My main cruncher abandoned 98 tasks at 7.45 last night. This has not happened to me since Christmas and I was hoping the issue had been solved, I was down to my last 2 abandoned from the Christmas problems. Nothing wrong with the PC, all temps normal and continued crunching the adandoned tasks quite happily until I noticed and reset the project. Most annoying thing is I did not notice until 9AM this morning, so thats over 12 hours of wasted crunching. Still one good thing with the superdooper download speeds we get now at least I was able to get more WU's quickly. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.