Message boards :
Number crunching :
Aborted by project?
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
speedimic Send message Joined: 28 Sep 02 Posts: 362 Credit: 16,590,653 RAC: 0 |
ahhh - "client state" on the website! LOL - I searched the complete boinc manager for that link.... Thanx Jim, Alinator! mic. |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
I've changed the text displayed for these results on the web site to make it clear that it shouldn't be considered an error. Confounding with 'user aborted WUs' not possible! Much more nicer now! |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
I've changed the text displayed for these results on the web site to make it clear that it shouldn't be considered an error. There are two different 'Client states'? 569572020 140356119 11 Jul 2007 16:35:11 UTC 12 Jul 2007 0:17:00 UTC Over Redundant result [b]Cancelled by server[/b] 0.00 - - 569572016 140356113 11 Jul 2007 16:35:11 UTC 12 Jul 2007 1:19:49 UTC Over Redundant result [b]Done[/b] 0.00 - - |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
Yeah, that happens on 'regular' WU failures as well. Sometimes they will show as a 'Compute Error', and other times as 'Done'. I guess it has to do with what the project gets back for a status word from the CC, and apparently it depends on what the failure is and how the app and/or handles it as it exits/recovers from the failure. Alinator |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14656 Credit: 200,643,578 RAC: 874 |
I had an 'aborted by project' showing in BoincView, which turned into a 'ready to report' when I had to restart BOINC for some reason before reporting. I think that one showed as 'Done' on the website too. Testable hypothesis: it's the BOINC restart wot causes it? |
Stefan Send message Joined: 28 Oct 02 Posts: 13 Credit: 671,275 RAC: 0 |
Hi folks, ...just a question about the new Boinc-clients (5.10.x) and to look if I got it right: A workunit reported as "redundant result, cancelled by server" is of no use for the project and never gets any credit (due to 2 other valid results of the same WU reported earlier to Seti with similar outcome). I read this thread carefully to find solutions on that issue. Reducing the cache down to 0 was the best I found (and I did it) but I'm still wondering if this could really avoid redundancies. Imagine: even if there's just one WU in my queue, couldn't it still happen that it becomes obsolete if the other 2 machines working exactly on the same WU finish their work earlier? So if I'm a "lucky" guy preferably catching the wrong (= "older") WU's from the project or just using slow machines, isn't there an increasing probability of producing large numbers of waste WU's the longer it takes for me to finish one? It's seems to me a bit like a lottery. Either you're always(!)... - faster than the others (no doubt that's the way...;-) or - you always catch the right WU's (brandnew of course - if I only could choose...:-) or - your results (and your spare cycles) will be of no use for anyone. I know looking at it that way is just worst case scenario - it won't come soooo bad of course. But is there a solution on that issue? Thanks for any reply! Stefan Stefan Greetings from Saarbrucken, Germany |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Hi folks, The "redundant, cancelled" work units don't get any credit, but they don't use any CPU time either -- If you started the work unit it won't be aborted, and you will get credit as long as it is delivered on-time. If you are using a 5.10.x BOINC client, you can set a short connection interval (i.e. 0.25 days) and 10 days of additional cache -- your machine will connect 4 times each day, some work (that you haven't started) will get aborted because it's no longer needed, and BOINC will "top up" to keep about 10 days handy, just in case there is an outage. Works very well. -- Ned |
Stefan Send message Joined: 28 Oct 02 Posts: 13 Credit: 671,275 RAC: 0 |
Hi Ned, ...wow, didn't expect to wait only a minute! Thanks for your reply. I just had a few of those "redundant" WU's and indeed, I didn't get any credit so far. Note: I know there are two kinds of remotedly aborted WU's, you mentioned the first one: 1) already cached WU's (downloaded but not in progress yet, returned with status "aborted by project", initial. Indeed, these are the "lossless" abortions, with no credit of course but also without waste of computing time) 2) finished WU's (though completed, nevertheless returned as third of two results, therefore status "redundant, aborted by project" - valid for older clients, finished in time, usual CPU-time, *no* credit, *no* validation from project, useless for Seti) I posted to learn more about the second ones and how to avoid them. No kidding, these WU's exist - there's *no* credit provided for them and they're simply a waste of time for the project. I know, this is against all what's been claimed earlier in this thread. Anyway, your suggestion (short connection interval, 10 days additional cache) is great, guess, I'll try it! I think you're right, it's much likelier that way to get more of the "good" abortions (the lossless type). Thanks again! Stefan Stefan Greetings from Saarbrucken, Germany |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
If I'm reading your posts correctly; There should never be a case 2 as you described it. Regardless of CC version, if your host has started a result and returns it to the project on time you should get credit for it no matter what. This assumes it's at least weakly similar to the canonical result for the WU, of course. The only time a result in progress would get aborted by the project is called the 'Unconditional Abort' and should only happen if your host is over the deadline and the WU has been validated. If your host is in this condition, then the next time it contacts the scheduler it would be issued an unconditional abort command. If the result had finished but was overdue, the scheduler will accept the report, but of course you don't get credit. If the WU is still in the BOINC database it will be marked 'Too late to validate'. The only exception to that is if there's no quorum yet and you beat the reissued result back. If it aborts while in progress for any reason other than what I said above (barring a real compute error), that was a malfunction of some kind at the project end, and you should post about it right away so we can take a look at it. HTH, Alinator |
1mp0£173 Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0 |
Hi Ned, Stefan, Your #2 above does not exist. That was my main point. BOINC does not abort work that has been started and will get credit, so there is no risk of losing a partial work unit that would have been returned on time or has a chance of being part of a Quorum. If you have an example of your #2, I'd like to see a link.... -- Ned |
Stefan Send message Joined: 28 Oct 02 Posts: 13 Credit: 671,275 RAC: 0 |
Stefan, Your #2 above does not exist. That was my main point. BOINC does not abort work that has been started and will get credit, so there is no risk of losing a partial work unit that would have been returned on time or has a chance of being part of a Quorum. If you have an example of your #2, I'd like to see a link.... -- Ned[/quote] If it aborts while in progress for any reason other than what I said above (barring a real compute error), that was a malfunction of some kind at the project end, and you should post about it right away so we can take a look at it. -- Alinator[/quote] Hi, @Ned: ...yes, sure there's an example for "#2": Work Unit ID: 140465139 The CPU-time of this WU was usual for my Mac (something around 14,000 sec - not shown in the WU data), BOINC-Client 5.10.10 for Mac, Mac OS X 10.4.10 for G4/G5-processors (= Darwin 8.10.0). Note: if you compare my result to the others you might find this "Mac Error -5000" in "stderr out" of the other participant using a Mac. Don't care about that. The new Boinc client for Mac generates two new user groups and I think it's just an access violation of stderr.txt that occurs using Alex Kan's (older) G4/G5-optimized Seti apps which do not support these changes yet - it has no influence on a successful outcome or validation of a result (I've got the same messages). @Alinator: ...ooops! Hopefully I didn't make a mistake. Though I'm crunching for years now, I'm not an expert on the deep secrets of the Seti computing processes. The errors you described are all time and technology related, caused by delays or real compute errors. This is not what happened here. I can only assume that there might(!) be a malfunction at Seti. Anyway, I'm glad if I can help to solve the issue. Stefan Stefan Greetings from Saarbrucken, Germany |
Stefan Send message Joined: 28 Oct 02 Posts: 13 Credit: 671,275 RAC: 0 |
Hi again, ...just noticed that the link in my last post is dead, sorry. Here we are: http://setiathome.berkeley.edu/workunit.php?wuid=140465139 Guess, that was my 8th post or so, ever...;-) Stefan Stefan Greetings from Saarbrucken, Germany |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14656 Credit: 200,643,578 RAC: 874 |
Are you absolutely, 100% sure that your result 569909333 is the one you spent all that time on? The result itself says 0 seconds: and it was reported just 20 seconds after result 569669991, which ran for 14,772.54 seconds and was awarded 61.63 credits. You may have to look back in your manager's message log and check the WU file names - see if you can find download / start / finish / upload messages for '29fe00aa.690.14112.959660.3.61' (the WU aborted by the server). |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
I've looked over the host summary here as well, and tracking the return dates and deadlines of the results, it would seem unlikely a single CPU host would have started execution on the result which has been 221'ed. Also, the stderr listing which appears on the Result summary is only showing the start message the CC puts in when it orders the execution to begin initially, but there's no sign of any app generated messages acknowledging execution has begun like there should be if the result actually had been run at all. I'm not saying there isn't a problem here yet, only that I'm not seeing any conclusive evidence of it from where we're sitting. One thing to keep in mind, which makes it a little difficult to track things like this, is the CC references your work stream by the filename of the WU, whereas the the project summary pages reference it by the WUID and RID. I've ended chasing the wrong data when troubleshooting in the past due to this. HTH, Alinator <edit> @ Stefan: I think I have a way to test this to satisfy yourself. 1.) Set BOINC to disable all Network Communication. 2.) Manually suspend each task onboard in turn so that BOINC will force some execution to happen on every result. Make a note of all the WUID's for the result which you did this for and make sure to confirm the relationship between filename and WUID. 3.) Re-enable all the tasks so that BOINC will return to normal execution of the results. 4.) Re-enable Network access for BOINC. If I haven't messed up theoretically here, this should not cause scheduling jams based on the relatively short CI you seem to be running. However, if any of the WUID's you wrote down gets aborted, then you will have confirmed that a problem at the server exists. Alinator |
meshmar Send message Joined: 15 May 99 Posts: 6 Credit: 19,624,041 RAC: 42 |
I have a wu that was completed by one of my computers well before the deadline. It never was aborted; was uploaded with no problems ... and does not show up on my account at all. It is in my job_log (1184161907.234375 ue 28076.816934 ct 17450.250000 fe 68177929660454.297000 nm 10jn00ab.1781.13922.192314.3.152_0) - but NOTHING for credit or even mention in my account. Is this a bug in the new system - or some other issue ... |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
What was the original sent date and how long did it run (approximately, or what was it's AR if that was logged)? Short deadline results get purged from the database rather quickly nowadays once they are fully completed (due to a project side choice to keep the DB as small as possible). So it's possible you did get credit, but just didn't get a chance to see it listed as complete before it got purged. Also, you could check at BOINCStats. You should be able to track your credit history for the last sixty days there. Alinator |
meshmar Send message Joined: 15 May 99 Posts: 6 Credit: 19,624,041 RAC: 42 |
What was the original sent date and how long did it run (approximately, or what was it's AR if that was logged)? Original sent date was somewhere in the 9-10 July time frame. It ran off and on and reported 12 July. An interesting point that may have some bearing ... I was issued another wu on 11 July that was due before the problematic wu. This other wu is showing with no problems. If you mean two days or less as rather quickly, I may have missed it, but for no record of it to even show up anywhere in my account seems a little odd. If xx wu have been completed - even if the details aren't kept - then I would expect to see xx wu. |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
OK, one complicating factor to consider here is whether your result was part of the initial replication or was it a reissue for a failure of one of the other hosts? If it was a reissue, then it's possible for the WU to get purged even quicker than normal (6 to 12 hours after the last outstanding result is returned is not uncommon). Like I said before, it looks like you will have to use the third party sites to reconstruct the credit history for the host for the last week or so to verify this one way or the other. HTH, Alinator |
meshmar Send message Joined: 15 May 99 Posts: 6 Credit: 19,624,041 RAC: 42 |
It must have been purged very quickly - BoincStats shows 64 credits on 12 July, and I never saw it listed at S@H. I still think I should have an actual track record of completed wu - not just unpurged ones. I don't expect to have all the data available - just the fact that a wu was actually done/awarded credit would help prevent a misunderstanding like this. |
Alinator Send message Joined: 19 Apr 05 Posts: 4178 Credit: 4,647,982 RAC: 0 |
LOL... I hear ya! I like to log all my results in a database of my own and frequently have to delete records because results got purged from the project database if I don't get a chance to extract the data you can only get from the website at least every 12 hours or so. OTOH, we have all seen what happens when the BOINC database sees fit to get too unwieldy, and that can be pretty ugly at times. ;-) The bright side here is at least you it looks like you did get credit for the result. :-) Alinator |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.