Aborted by project?

Message boards : Number crunching : Aborted by project?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile speedimic
Volunteer tester
Avatar

Send message
Joined: 28 Sep 02
Posts: 362
Credit: 16,590,653
RAC: 0
Germany
Message 601899 - Posted: 11 Jul 2007, 18:19:11 UTC

ahhh - "client state" on the website!

LOL - I searched the complete boinc manager for that link....

Thanx Jim, Alinator!
mic.


ID: 601899 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 602072 - Posted: 11 Jul 2007, 22:46:54 UTC - in response to Message 601856.  

I've changed the text displayed for these results on the web site to make it clear that it shouldn't be considered an error.

Let me know what you think.

Eric



Confounding with 'user aborted WUs' not possible!

Much more nicer now!


ID: 602072 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 602278 - Posted: 12 Jul 2007, 12:14:17 UTC - in response to Message 602072.  
Last modified: 12 Jul 2007, 12:17:30 UTC

I've changed the text displayed for these results on the web site to make it clear that it shouldn't be considered an error.

Let me know what you think.

Eric



Confounding with 'user aborted WUs' not possible!

Much more nicer now!




There are two different 'Client states'?

569572020 140356119 11 Jul 2007 16:35:11 UTC 12 Jul 2007 0:17:00 UTC Over Redundant result [b]Cancelled by server[/b] 0.00 - -
569572016 140356113 11 Jul 2007 16:35:11 UTC 12 Jul 2007 1:19:49 UTC Over Redundant result [b]Done[/b] 0.00 - -


ID: 602278 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 602324 - Posted: 12 Jul 2007, 14:18:40 UTC
Last modified: 12 Jul 2007, 14:18:52 UTC

Yeah, that happens on 'regular' WU failures as well. Sometimes they will show as a 'Compute Error', and other times as 'Done'.

I guess it has to do with what the project gets back for a status word from the CC, and apparently it depends on what the failure is and how the app and/or handles it as it exits/recovers from the failure.

Alinator
ID: 602324 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14656
Credit: 200,643,578
RAC: 874
United Kingdom
Message 602463 - Posted: 12 Jul 2007, 19:23:18 UTC

I had an 'aborted by project' showing in BoincView, which turned into a 'ready to report' when I had to restart BOINC for some reason before reporting. I think that one showed as 'Done' on the website too.

Testable hypothesis: it's the BOINC restart wot causes it?
ID: 602463 · Report as offensive
Profile Stefan
Avatar

Send message
Joined: 28 Oct 02
Posts: 13
Credit: 671,275
RAC: 0
Germany
Message 603126 - Posted: 14 Jul 2007, 2:09:03 UTC

Hi folks,

...just a question about the new Boinc-clients (5.10.x) and to look if I got it right:

A workunit reported as "redundant result, cancelled by server" is of no use for the project and never gets any credit (due to 2 other valid results of the same WU reported earlier to Seti with similar outcome).

I read this thread carefully to find solutions on that issue. Reducing the cache down to 0 was the best I found (and I did it) but I'm still wondering if this could really avoid redundancies.

Imagine:

even if there's just one WU in my queue, couldn't it still happen that it becomes obsolete if the other 2 machines working exactly on the same WU finish their work earlier? So if I'm a "lucky" guy preferably catching the wrong (= "older") WU's from the project or just using slow machines, isn't there an increasing probability of producing large numbers of waste WU's the longer it takes for me to finish one?

It's seems to me a bit like a lottery. Either you're always(!)...

- faster than the others (no doubt that's the way...;-) or
- you always catch the right WU's (brandnew of course - if I only could choose...:-) or
- your results (and your spare cycles) will be of no use for anyone.

I know looking at it that way is just worst case scenario - it won't come soooo bad of course.

But is there a solution on that issue?


Thanks for any reply!

Stefan






Stefan
Greetings from Saarbrucken, Germany
ID: 603126 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 603131 - Posted: 14 Jul 2007, 2:15:46 UTC - in response to Message 603126.  

Hi folks,

...just a question about the new Boinc-clients (5.10.x) and to look if I got it right:

A workunit reported as "redundant result, cancelled by server" is of no use for the project and never gets any credit (due to 2 other valid results of the same WU reported earlier to Seti with similar outcome).

I read this thread carefully to find solutions on that issue. Reducing the cache down to 0 was the best I found (and I did it) but I'm still wondering if this could really avoid redundancies.

<snip>

Thanks for any reply!

Stefan

The "redundant, cancelled" work units don't get any credit, but they don't use any CPU time either -- If you started the work unit it won't be aborted, and you will get credit as long as it is delivered on-time.

If you are using a 5.10.x BOINC client, you can set a short connection interval (i.e. 0.25 days) and 10 days of additional cache -- your machine will connect 4 times each day, some work (that you haven't started) will get aborted because it's no longer needed, and BOINC will "top up" to keep about 10 days handy, just in case there is an outage.

Works very well.

-- Ned
ID: 603131 · Report as offensive
Profile Stefan
Avatar

Send message
Joined: 28 Oct 02
Posts: 13
Credit: 671,275
RAC: 0
Germany
Message 603165 - Posted: 14 Jul 2007, 3:47:11 UTC - in response to Message 603131.  

Hi Ned,

...wow, didn't expect to wait only a minute! Thanks for your reply.

I just had a few of those "redundant" WU's and indeed, I didn't get any credit so far.

Note: I know there are two kinds of remotedly aborted WU's, you mentioned the first one:

1) already cached WU's (downloaded but not in progress yet, returned with status "aborted by project", initial. Indeed, these are the "lossless" abortions, with no credit of course but also without waste of computing time)

2) finished WU's (though completed, nevertheless returned as third of two results, therefore status "redundant, aborted by project" - valid for older clients, finished in time, usual CPU-time, *no* credit, *no* validation from project, useless for Seti)

I posted to learn more about the second ones and how to avoid them. No kidding, these WU's exist - there's *no* credit provided for them and they're simply a waste of time for the project. I know, this is against all what's been claimed earlier in this thread.

Anyway, your suggestion (short connection interval, 10 days additional cache) is great, guess, I'll try it! I think you're right, it's much likelier that way to get more of the "good" abortions (the lossless type).

Thanks again!

Stefan


Stefan
Greetings from Saarbrucken, Germany
ID: 603165 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 603191 - Posted: 14 Jul 2007, 4:45:03 UTC
Last modified: 14 Jul 2007, 4:52:11 UTC

If I'm reading your posts correctly;

There should never be a case 2 as you described it. Regardless of CC version, if your host has started a result and returns it to the project on time you should get credit for it no matter what. This assumes it's at least weakly similar to the canonical result for the WU, of course.

The only time a result in progress would get aborted by the project is called the 'Unconditional Abort' and should only happen if your host is over the deadline and the WU has been validated. If your host is in this condition, then the next time it contacts the scheduler it would be issued an unconditional abort command. If the result had finished but was overdue, the scheduler will accept the report, but of course you don't get credit. If the WU is still in the BOINC database it will be marked 'Too late to validate'. The only exception to that is if there's no quorum yet and you beat the reissued result back.

If it aborts while in progress for any reason other than what I said above (barring a real compute error), that was a malfunction of some kind at the project end, and you should post about it right away so we can take a look at it.

HTH,

Alinator
ID: 603191 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 603200 - Posted: 14 Jul 2007, 5:12:45 UTC - in response to Message 603165.  

Hi Ned,

...wow, didn't expect to wait only a minute! Thanks for your reply.

I just had a few of those "redundant" WU's and indeed, I didn't get any credit so far.

Note: I know there are two kinds of remotedly aborted WU's, you mentioned the first one:

1) already cached WU's (downloaded but not in progress yet, returned with status "aborted by project", initial. Indeed, these are the "lossless" abortions, with no credit of course but also without waste of computing time)

2) finished WU's (though completed, nevertheless returned as third of two results, therefore status "redundant, aborted by project" - valid for older clients, finished in time, usual CPU-time, *no* credit, *no* validation from project, useless for Seti)

I posted to learn more about the second ones and how to avoid them. No kidding, these WU's exist - there's *no* credit provided for them and they're simply a waste of time for the project. I know, this is against all what's been claimed earlier in this thread.

Anyway, your suggestion (short connection interval, 10 days additional cache) is great, guess, I'll try it! I think you're right, it's much likelier that way to get more of the "good" abortions (the lossless type).

Thanks again!

Stefan


Stefan,

Your #2 above does not exist. That was my main point.

BOINC does not abort work that has been started and will get credit, so there is no risk of losing a partial work unit that would have been returned on time or has a chance of being part of a Quorum.

If you have an example of your #2, I'd like to see a link....

-- Ned
ID: 603200 · Report as offensive
Profile Stefan
Avatar

Send message
Joined: 28 Oct 02
Posts: 13
Credit: 671,275
RAC: 0
Germany
Message 603339 - Posted: 14 Jul 2007, 14:52:27 UTC - in response to Message 603200.  


Stefan,

Your #2 above does not exist. That was my main point.

BOINC does not abort work that has been started and will get credit, so there is no risk of losing a partial work unit that would have been returned on time or has a chance of being part of a Quorum.

If you have an example of your #2, I'd like to see a link....

-- Ned[/quote]

If it aborts while in progress for any reason other than what I said above (barring a real compute error), that was a malfunction of some kind at the project end, and you should post about it right away so we can take a look at it.

-- Alinator[/quote]

Hi,

@Ned:

...yes, sure there's an example for "#2":

Work Unit ID: 140465139

The CPU-time of this WU was usual for my Mac (something around 14,000 sec - not shown in the WU data), BOINC-Client 5.10.10 for Mac, Mac OS X 10.4.10 for G4/G5-processors (= Darwin 8.10.0).

Note: if you compare my result to the others you might find this "Mac Error -5000" in "stderr out" of the other participant using a Mac. Don't care about that. The new Boinc client for Mac generates two new user groups and I think it's just an access violation of stderr.txt that occurs using Alex Kan's (older) G4/G5-optimized Seti apps which do not support these changes yet - it has no influence on a successful outcome or validation of a result (I've got the same messages).

@Alinator:

...ooops! Hopefully I didn't make a mistake. Though I'm crunching for years now, I'm not an expert on the deep secrets of the Seti computing processes. The errors you described are all time and technology related, caused by delays or real compute errors. This is not what happened here. I can only assume that there might(!) be a malfunction at Seti. Anyway, I'm glad if I can help to solve the issue.

Stefan
Stefan
Greetings from Saarbrucken, Germany
ID: 603339 · Report as offensive
Profile Stefan
Avatar

Send message
Joined: 28 Oct 02
Posts: 13
Credit: 671,275
RAC: 0
Germany
Message 603341 - Posted: 14 Jul 2007, 14:59:39 UTC - in response to Message 603339.  

Hi again,

...just noticed that the link in my last post is dead, sorry.

Here we are:

http://setiathome.berkeley.edu/workunit.php?wuid=140465139

Guess, that was my 8th post or so, ever...;-)

Stefan
Stefan
Greetings from Saarbrucken, Germany
ID: 603341 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14656
Credit: 200,643,578
RAC: 874
United Kingdom
Message 603360 - Posted: 14 Jul 2007, 15:35:22 UTC

Are you absolutely, 100% sure that your result 569909333 is the one you spent all that time on? The result itself says 0 seconds: and it was reported just 20 seconds after result 569669991, which ran for 14,772.54 seconds and was awarded 61.63 credits. You may have to look back in your manager's message log and check the WU file names - see if you can find download / start / finish / upload messages for '29fe00aa.690.14112.959660.3.61' (the WU aborted by the server).
ID: 603360 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 603433 - Posted: 14 Jul 2007, 16:55:41 UTC
Last modified: 14 Jul 2007, 17:20:31 UTC

I've looked over the host summary here as well, and tracking the return dates and deadlines of the results, it would seem unlikely a single CPU host would have started execution on the result which has been 221'ed.

Also, the stderr listing which appears on the Result summary is only showing the start message the CC puts in when it orders the execution to begin initially, but there's no sign of any app generated messages acknowledging execution has begun like there should be if the result actually had been run at all.

I'm not saying there isn't a problem here yet, only that I'm not seeing any conclusive evidence of it from where we're sitting. One thing to keep in mind, which makes it a little difficult to track things like this, is the CC references your work stream by the filename of the WU, whereas the the project summary pages reference it by the WUID and RID. I've ended chasing the wrong data when troubleshooting in the past due to this.

HTH,

Alinator

<edit> @ Stefan: I think I have a way to test this to satisfy yourself.

1.) Set BOINC to disable all Network Communication.

2.) Manually suspend each task onboard in turn so that BOINC will force some execution to happen on every result. Make a note of all the WUID's for the result which you did this for and make sure to confirm the relationship between filename and WUID.

3.) Re-enable all the tasks so that BOINC will return to normal execution of the results.

4.) Re-enable Network access for BOINC.

If I haven't messed up theoretically here, this should not cause scheduling jams based on the relatively short CI you seem to be running. However, if any of the WUID's you wrote down gets aborted, then you will have confirmed that a problem at the server exists.

Alinator
ID: 603433 · Report as offensive
Profile meshmar

Send message
Joined: 15 May 99
Posts: 6
Credit: 19,624,041
RAC: 42
United States
Message 603453 - Posted: 14 Jul 2007, 17:32:33 UTC

I have a wu that was completed by one of my computers well before the deadline. It never was aborted; was uploaded with no problems ... and does not show up on my account at all. It is in my job_log (1184161907.234375 ue 28076.816934 ct 17450.250000 fe 68177929660454.297000 nm 10jn00ab.1781.13922.192314.3.152_0) - but NOTHING for credit or even mention in my account. Is this a bug in the new system - or some other issue ...
ID: 603453 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 603461 - Posted: 14 Jul 2007, 17:48:01 UTC
Last modified: 14 Jul 2007, 17:50:12 UTC

What was the original sent date and how long did it run (approximately, or what was it's AR if that was logged)?

Short deadline results get purged from the database rather quickly nowadays once they are fully completed (due to a project side choice to keep the DB as small as possible).

So it's possible you did get credit, but just didn't get a chance to see it listed as complete before it got purged.

Also, you could check at BOINCStats. You should be able to track your credit history for the last sixty days there.

Alinator
ID: 603461 · Report as offensive
Profile meshmar

Send message
Joined: 15 May 99
Posts: 6
Credit: 19,624,041
RAC: 42
United States
Message 603476 - Posted: 14 Jul 2007, 18:15:37 UTC - in response to Message 603461.  

What was the original sent date and how long did it run (approximately, or what was it's AR if that was logged)?

Short deadline results get purged from the database rather quickly nowadays once they are fully completed (due to a project side choice to keep the DB as small as possible).

So it's possible you did get credit, but just didn't get a chance to see it listed as complete before it got purged.

Also, you could check at BOINCStats. You should be able to track your credit history for the last sixty days there.

Alinator


Original sent date was somewhere in the 9-10 July time frame. It ran off and on and reported 12 July. An interesting point that may have some bearing ... I was issued another wu on 11 July that was due before the problematic wu. This other wu is showing with no problems.

If you mean two days or less as rather quickly, I may have missed it, but for no record of it to even show up anywhere in my account seems a little odd. If xx wu have been completed - even if the details aren't kept - then I would expect to see xx wu.

ID: 603476 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 603481 - Posted: 14 Jul 2007, 18:24:30 UTC
Last modified: 14 Jul 2007, 18:27:13 UTC

OK, one complicating factor to consider here is whether your result was part of the initial replication or was it a reissue for a failure of one of the other hosts?

If it was a reissue, then it's possible for the WU to get purged even quicker than normal (6 to 12 hours after the last outstanding result is returned is not uncommon).

Like I said before, it looks like you will have to use the third party sites to reconstruct the credit history for the host for the last week or so to verify this one way or the other.

HTH,

Alinator
ID: 603481 · Report as offensive
Profile meshmar

Send message
Joined: 15 May 99
Posts: 6
Credit: 19,624,041
RAC: 42
United States
Message 603484 - Posted: 14 Jul 2007, 18:36:50 UTC

It must have been purged very quickly - BoincStats shows 64 credits on 12 July, and I never saw it listed at S@H. I still think I should have an actual track record of completed wu - not just unpurged ones. I don't expect to have all the data available - just the fact that a wu was actually done/awarded credit would help prevent a misunderstanding like this.
ID: 603484 · Report as offensive
Alinator
Volunteer tester

Send message
Joined: 19 Apr 05
Posts: 4178
Credit: 4,647,982
RAC: 0
United States
Message 603499 - Posted: 14 Jul 2007, 19:10:52 UTC
Last modified: 14 Jul 2007, 19:11:26 UTC

LOL...

I hear ya! I like to log all my results in a database of my own and frequently have to delete records because results got purged from the project database if I don't get a chance to extract the data you can only get from the website at least every 12 hours or so.

OTOH, we have all seen what happens when the BOINC database sees fit to get too unwieldy, and that can be pretty ugly at times. ;-)

The bright side here is at least you it looks like you did get credit for the result. :-)

Alinator
ID: 603499 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Aborted by project?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.