Work Unit problem

Message boards : Number crunching : Work Unit problem
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7

AuthorMessage
H Elzinga
Volunteer tester

Send message
Joined: 20 Aug 99
Posts: 125
Credit: 8,277,116
RAC: 0
Netherlands
Message 623857 - Posted: 21 Aug 2007, 7:44:54 UTC - in response to Message 623772.  

I have been receiving WU with to completion times of 119 hours+. However as the WU is crunched the to completion time drops dramatically. It ends up being only 5 or 6 hours to process.

If you've run one of the nasty WUs to completion, BOINC will have adjusted your DCF (Duration Correction Factor) to match its extended time. It will gradually come down as you do normal work, taking about 20 to 30 WUs to get back to normal.

If you want to fix it quickly, you can close down BOINC, find the <duration_correction_factor> entry for <project_name>SETI@home in the client_state.xml file, and edit it. Given a shown estimate of 119 hours for an unstarted WU which will actually take 5 hours, multiply the value by 5/119 or 0.042. That should be close enough, if it's slightly too low it will fully correct when one WU completes, if it's slightly high it will creep down as usual.
                                                                Joe


I also did this fix and was "rewarded" with a huge increase in computing time.
What i notiched was that 1 slow unit (thats all i had until now) could raise the time instantly.
Correctly processed units seem to only have a minor influence on scaling the time down.
Is this a design fault or a feature of which i fail to see the logic.
ID: 623857 · Report as offensive
CougarKy

Send message
Joined: 20 Aug 01
Posts: 5
Credit: 4,076,741
RAC: 1
United States
Message 623901 - Posted: 21 Aug 2007, 10:52:50 UTC - in response to Message 623772.  

I have been receiving WU with to completion times of 119 hours+. However as the WU is crunched the to completion time drops dramatically. It ends up being only 5 or 6 hours to process.

If you've run one of the nasty WUs to completion, BOINC will have adjusted your DCF (Duration Correction Factor) to match its extended time. It will gradually come down as you do normal work, taking about 20 to 30 WUs to get back to normal.

If you want to fix it quickly, you can close down BOINC, find the <duration_correction_factor> entry for <project_name>SETI@home in the client_state.xml file, and edit it. Given a shown estimate of 119 hours for an unstarted WU which will actually take 5 hours, multiply the value by 5/119 or 0.042. That should be close enough, if it's slightly too low it will fully correct when one WU completes, if it's slightly high it will creep down as usual.
                                                                Joe


Thank you for the help.
ID: 623901 · Report as offensive
Profile Jim-R.
Volunteer tester
Avatar

Send message
Joined: 7 Feb 06
Posts: 1494
Credit: 194,148
RAC: 0
United States
Message 623925 - Posted: 21 Aug 2007, 12:11:14 UTC - in response to Message 623857.  

I have been receiving WU with to completion times of 119 hours+. However as the WU is crunched the to completion time drops dramatically. It ends up being only 5 or 6 hours to process.

If you've run one of the nasty WUs to completion, BOINC will have adjusted your DCF (Duration Correction Factor) to match its extended time. It will gradually come down as you do normal work, taking about 20 to 30 WUs to get back to normal.

If you want to fix it quickly, you can close down BOINC, find the <duration_correction_factor> entry for <project_name>SETI@home in the client_state.xml file, and edit it. Given a shown estimate of 119 hours for an unstarted WU which will actually take 5 hours, multiply the value by 5/119 or 0.042. That should be close enough, if it's slightly too low it will fully correct when one WU completes, if it's slightly high it will creep down as usual.
                                                                Joe


I also did this fix and was "rewarded" with a huge increase in computing time.
What i notiched was that 1 slow unit (thats all i had until now) could raise the time instantly.
Correctly processed units seem to only have a minor influence on scaling the time down.
Is this a design fault or a feature of which i fail to see the logic.


It is a feature. The reason being that the estimated time to completion is supposed to be slightly on the high side so that you don't download a bunch of work that you can't finish before the deadline.

This has actually happened. We have had "runs" of various angle ranges which take different times to complete. When we have a run of very short running time angle ranges, the Duration Correction Factor (DCF) will slowly drop. With a long run of these, it can get "used" to the low crunch times and start downloading more work to compensate. Then a "run" of very long running time work may come across. BOINC thinks that these will run approximately the same as the others so it downloads a bunch of them. This results in your computer going into "Earliest Deadline First" (panic) mode ignoring everything else just to get these work units crunched.

If the DCF were to decrease immediately upon completing one of these very short running time units, it would immediately download more work and possibly end up in EDF (Earliest deadline first) mode. So it's designed to recover quickly from a low value by jumping immediately to the value of a longer running unit, but decrease slowly from the longer times to shorter ones.
Jim

Some people plan their life out and look back at the wealth they've had.
Others live life day by day and look back at the wealth of experiences and enjoyment they've had.
ID: 623925 · Report as offensive
H Elzinga
Volunteer tester

Send message
Joined: 20 Aug 99
Posts: 125
Credit: 8,277,116
RAC: 0
Netherlands
Message 623945 - Posted: 21 Aug 2007, 14:17:38 UTC - in response to Message 623925.  

I have been receiving WU with to completion times of 119 hours+. However as the WU is crunched the to completion time drops dramatically. It ends up being only 5 or 6 hours to process.

If you've run one of the nasty WUs to completion, BOINC will have adjusted your DCF (Duration Correction Factor) to match its extended time. It will gradually come down as you do normal work, taking about 20 to 30 WUs to get back to normal.

If you want to fix it quickly, you can close down BOINC, find the <duration_correction_factor> entry for <project_name>SETI@home in the client_state.xml file, and edit it. Given a shown estimate of 119 hours for an unstarted WU which will actually take 5 hours, multiply the value by 5/119 or 0.042. That should be close enough, if it's slightly too low it will fully correct when one WU completes, if it's slightly high it will creep down as usual.
                                                                Joe


I also did this fix and was "rewarded" with a huge increase in computing time.
What i notiched was that 1 slow unit (thats all i had until now) could raise the time instantly.
Correctly processed units seem to only have a minor influence on scaling the time down.
Is this a design fault or a feature of which i fail to see the logic.


It is a feature. The reason being that the estimated time to completion is supposed to be slightly on the high side so that you don't download a bunch of work that you can't finish before the deadline.

This has actually happened. We have had "runs" of various angle ranges which take different times to complete. When we have a run of very short running time angle ranges, the Duration Correction Factor (DCF) will slowly drop. With a long run of these, it can get "used" to the low crunch times and start downloading more work to compensate. Then a "run" of very long running time work may come across. BOINC thinks that these will run approximately the same as the others so it downloads a bunch of them. This results in your computer going into "Earliest Deadline First" (panic) mode ignoring everything else just to get these work units crunched.

If the DCF were to decrease immediately upon completing one of these very short running time units, it would immediately download more work and possibly end up in EDF (Earliest deadline first) mode. So it's designed to recover quickly from a low value by jumping immediately to the value of a longer running unit, but decrease slowly from the longer times to shorter ones.


I See.

The client completely unaware of the error assumes this is the first one of a set of similar (long) units.
ID: 623945 · Report as offensive
Profile Jim-R.
Volunteer tester
Avatar

Send message
Joined: 7 Feb 06
Posts: 1494
Credit: 194,148
RAC: 0
United States
Message 623962 - Posted: 21 Aug 2007, 14:50:58 UTC - in response to Message 623945.  
Last modified: 21 Aug 2007, 14:52:49 UTC


I See.

The client completely unaware of the error assumes this is the first one of a set of similar (long) units.

Exactly, so it will take a while to get the estimated time back down to normal. That's the reason it was suggested editing the client_state.xml file.
Jim

Some people plan their life out and look back at the wealth they've had.
Others live life day by day and look back at the wealth of experiences and enjoyment they've had.
ID: 623962 · Report as offensive
HTH
Volunteer tester

Send message
Joined: 8 Jul 00
Posts: 691
Credit: 909,237
RAC: 0
Finland
Message 624243 - Posted: 22 Aug 2007, 6:26:31 UTC
Last modified: 22 Aug 2007, 6:27:46 UTC

WU: 147512707.

0.26 cobblestones? Is this correct? The third guy didn't get credit at all. What's wrong? It is the WU that crunched very very slowly.

Manned mission to Mars in 2019 Petition <-- Sign this, please.
ID: 624243 · Report as offensive
Profile bounty.hunter
Volunteer tester
Avatar

Send message
Joined: 22 Mar 04
Posts: 442
Credit: 459,063
RAC: 0
India
Message 624260 - Posted: 22 Aug 2007, 8:16:48 UTC - in response to Message 624243.  

WU: 147512707.

0.26 cobblestones? Is this correct? The third guy didn't get credit at all. What's wrong? It is the WU that crunched very very slowly.


The third guy aborted the WU manually.
ID: 624260 · Report as offensive
mdpagel

Send message
Joined: 18 Sep 99
Posts: 53
Credit: 2,619,543
RAC: 0
United States
Message 624269 - Posted: 22 Aug 2007, 8:41:18 UTC

http://setiathome.berkeley.edu/workunit.php?wuid=147603991

this was the first of 3 units that was taking 24 hours to process without actually completing itself. My typical runtime on a unit is 1.5 hrs. I actually had chalked it up to signing up for E@h and getting the executables somehow mangled in the memory of BOINC, so I detached from E@h and deleted my S@h executable - which of course screwed up the execution of other WUs.

In any event, only one user claims to have processed that unit, and is making claims for other WUs along the order of 90 cobblestones. He's using a client for Darwin. Is there any chance that the main windows app has a bug that Darwin doesn't?
ID: 624269 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 624361 - Posted: 22 Aug 2007, 15:53:31 UTC - in response to Message 624269.  

http://setiathome.berkeley.edu/workunit.php?wuid=147603991

this was the first of 3 units that was taking 24 hours to process without actually completing itself. My typical runtime on a unit is 1.5 hrs. I actually had chalked it up to signing up for E@h and getting the executables somehow mangled in the memory of BOINC, so I detached from E@h and deleted my S@h executable - which of course screwed up the execution of other WUs.

In any event, only one user claims to have processed that unit, and is making claims for other WUs along the order of 90 cobblestones. He's using a client for Darwin. Is there any chance that the main windows app has a bug that Darwin doesn't?

That Mac did manage to get to Pulse overflow before BOINC killed the task for Maximum CPU time exceeded. I think that's mainly because the BOINC benchmarks for those quad systems are not nearly as much higher as their capability to crunch SETI work is. That makes the maximum time limit relatively further out. The other possibility is the compiler for those Mac builds may be producing more efficient code for the triplet finding loop.

The high claims are of course because the stock Mac builds have the 3.81 multiplier from Beta.
                                                                Joe
ID: 624361 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 626237 - Posted: 25 Aug 2007, 8:33:38 UTC
Last modified: 25 Aug 2007, 8:34:09 UTC



Is this a 'bad' WU too?

It was running ~ 1.5 hours, it was at ~ 15 % (not stopped!), ~ 2.5 hours (remaining time)

(Normally my PC need ~ 1.5 hours for this AR..)

New Rev. 2.4 from Crunch3r..

http://setiathome.berkeley.edu/workunit.php?wuid=149328091


ID: 626237 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 626476 - Posted: 25 Aug 2007, 17:35:28 UTC - in response to Message 626237.  



Is this a 'bad' WU too?

It was running ~ 1.5 hours, it was at ~ 15 % (not stopped!), ~ 2.5 hours (remaining time)

(Normally my PC need ~ 1.5 hours for this AR..)

New Rev. 2.4 from Crunch3r..

http://setiathome.berkeley.edu/workunit.php?wuid=149328091

It was created 19 Aug 2007 9:38:04 UTC, long after the splitter problem was cured. Another (wuid 149328099) from the same splitter group processed normally, so the thresholds are almost certainly correct. Looking in the WU or result would of course provide the best evidence, if you saved information before aborting.
                                                               Joe
ID: 626476 · Report as offensive
Profile [B^S] madmac
Volunteer tester
Avatar

Send message
Joined: 9 Feb 04
Posts: 1175
Credit: 4,754,897
RAC: 0
United Kingdom
Message 627080 - Posted: 26 Aug 2007, 14:17:10 UTC

I too have got another one 04mr07ab.10282.4980.3.4.87_2 and I know it is a -9 one going 2hrs and only 0.06 again. Will leave it to 16:00 BST and then abort it sorry to the other person waiting on this.
ID: 627080 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 627084 - Posted: 26 Aug 2007, 14:26:15 UTC - in response to Message 627080.  
Last modified: 26 Aug 2007, 14:41:10 UTC

I too have got another one 04mr07ab.10282.4980.3.4.87_2 and I know it is a -9 one going 2hrs and only 0.06 again. Will leave it to 16:00 BST and then abort it sorry to the other person waiting on this.

That one ran for just over 3 hours on a P4 2.4GHz - a bit slower than yours.

If you could bear to run it for just a little bit longer, you could kill it for good - seems a shame not to put it out of its misery, now you've already spent so much time on it.

Edit - I should have commented on it being a 'past deadline' re-issue. D**n. We could be seeing a lot of these - all hands to the boards!
ID: 627084 · Report as offensive
top1214

Send message
Joined: 18 Oct 06
Posts: 1
Credit: 44,898
RAC: 0
United States
Message 627141 - Posted: 26 Aug 2007, 16:36:55 UTC

I suspended my latest bad task (04mr07ab.7106.5798.10.4.216_3) as soon as I got it. SETI isn't sending me any more work though. Is that normal for task suspension?
ID: 627141 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 627147 - Posted: 26 Aug 2007, 16:48:23 UTC - in response to Message 627141.  
Last modified: 26 Aug 2007, 16:48:43 UTC

I suspended my latest bad task (04mr07ab.7106.5798.10.4.216_3) as soon as I got it. SETI isn't sending me any more work though. Is that normal for task suspension?

Yes it is, and I see you've had the WU for over a week - pity you didn't ask earlier.

If you feel up to performing Joe Segur's "technically adventurous" surgery described in this post, you could get it to run very quickly to completion - since it's already been processed by someone else, that would get rid of it for good. But you'll have to be quick: the deadline expires in under an hour, and after that it'll be put in the queue for issuing to someone else.

If you don't feel that adventurous, just abort it now - it's so close to deadline that it'll hardly make any difference.
ID: 627147 · Report as offensive
Profile [B^S] madmac
Volunteer tester
Avatar

Send message
Joined: 9 Feb 04
Posts: 1175
Credit: 4,754,897
RAC: 0
United Kingdom
Message 627230 - Posted: 26 Aug 2007, 18:48:24 UTC - in response to Message 627080.  

I too have got another one 04mr07ab.10282.4980.3.4.87_2 and I know it is a -9 one going 2hrs and only 0.06 again. Will leave it to 16:00 BST and then abort it sorry to the other person waiting on this.


On the end it took 2 hrs 45 mins to complete and that is with the latest version of chicken

ID: 627230 · Report as offensive
Jesse Viviano

Send message
Joined: 27 Feb 00
Posts: 100
Credit: 3,949,583
RAC: 0
United States
Message 627595 - Posted: 27 Aug 2007, 3:01:26 UTC - in response to Message 621449.  

For those of you technically adventurous, here's another option to handle the WUs with negative triplet threshold.

The negative threshold means no triplets can possibly be found, a high enough positive threshold has the same effect. But the very high positive threshold doesn't make crunching slow, it actually makes it slightly faster than a normal threshold because triplet finding has less work to do. So the workaround is:

1. Ensure the WU is not in use by shutting down BOINC. (IF your preferences are to have suspended work removed from memory, then Suspending the work would be enough.)
2. Open the WU in an editor. NOT a word processor or anything else which may change more than you intend.
3. Find the <triplet_thresh>-x.xxxxxx</triplet_thresh> line.
4. Change it to <triplet_thresh>99</triplet_thresh> .
5. Save the WU file.
6. Restart BOINC or Resume the WU.

That WU may not start running. You could force it by Suspending others, or simply let BOINC get to it whenever.

When it does run, it is likely to overflow on Pulses. The result should match that from someone who has allowed the WU to creep to that naturally. Because of the high threshold your credit claim will be lower, but probably by less than 1 cobblestone.

I did this to all 4 I had, wuid 148063182 has validated against a full run, wuid 148063170, wuid 148063178, and wuid 148140560 don't have other completed work yet.

I am not urging anyone else to try this, and can't think of another situation in which I'd consider modifying a WU. Normally any change to a WU would lead to an invalid result, this is a very unusual exception. Even so, I considered long and thoroughly before posting this, and will not be at all unhappy if someone from Berkeley decides to hide this post.
                                                                 Joe

I am not sure if this is a good idea. If a work unit is discarded due to too many errors, this might notify the administrators that something needed to be done about the problem WU. Once Eric finally fixes the splitter (what the admins did looks like a band-aid on code that was not their specialty so their patch may have broken the splitter in a way that they might not have seen), he will know that this work unit errored out, and have it resplit with the corrected splitter. If you modify the work unit, this flag might not be generated.
ID: 627595 · Report as offensive
Previous · 1 . . . 4 · 5 · 6 · 7

Message boards : Number crunching : Work Unit problem


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.