Work Unit problem

Author	Message
H Elzinga Volunteer tester Send message Joined: 20 Aug 99 Posts: 125 Credit: 8,277,116 RAC: 0	Message 623857 - Posted: 21 Aug 2007, 7:44:54 UTC - in response to Message 623772. I have been receiving WU with to completion times of 119 hours+. However as the WU is crunched the to completion time drops dramatically. It ends up being only 5 or 6 hours to process. If you've run one of the nasty WUs to completion, BOINC will have adjusted your DCF (Duration Correction Factor) to match its extended time. It will gradually come down as you do normal work, taking about 20 to 30 WUs to get back to normal. If you want to fix it quickly, you can close down BOINC, find the <duration_correction_factor> entry for <project_name>SETI@home in the client_state.xml file, and edit it. Given a shown estimate of 119 hours for an unstarted WU which will actually take 5 hours, multiply the value by 5/119 or 0.042. That should be close enough, if it's slightly too low it will fully correct when one WU completes, if it's slightly high it will creep down as usual. Joe I also did this fix and was "rewarded" with a huge increase in computing time. What i notiched was that 1 slow unit (thats all i had until now) could raise the time instantly. Correctly processed units seem to only have a minor influence on scaling the time down. Is this a design fault or a feature of which i fail to see the logic. ID: 623857 ·

CougarKy Send message Joined: 20 Aug 01 Posts: 5 Credit: 4,076,741 RAC: 1	Message 623901 - Posted: 21 Aug 2007, 10:52:50 UTC - in response to Message 623772. I have been receiving WU with to completion times of 119 hours+. However as the WU is crunched the to completion time drops dramatically. It ends up being only 5 or 6 hours to process. If you've run one of the nasty WUs to completion, BOINC will have adjusted your DCF (Duration Correction Factor) to match its extended time. It will gradually come down as you do normal work, taking about 20 to 30 WUs to get back to normal. If you want to fix it quickly, you can close down BOINC, find the <duration_correction_factor> entry for <project_name>SETI@home in the client_state.xml file, and edit it. Given a shown estimate of 119 hours for an unstarted WU which will actually take 5 hours, multiply the value by 5/119 or 0.042. That should be close enough, if it's slightly too low it will fully correct when one WU completes, if it's slightly high it will creep down as usual. Joe Thank you for the help. ID: 623901 ·

Jim-R. Volunteer tester Send message Joined: 7 Feb 06 Posts: 1494 Credit: 194,148 RAC: 0	Message 623925 - Posted: 21 Aug 2007, 12:11:14 UTC - in response to Message 623857. I have been receiving WU with to completion times of 119 hours+. However as the WU is crunched the to completion time drops dramatically. It ends up being only 5 or 6 hours to process. If you've run one of the nasty WUs to completion, BOINC will have adjusted your DCF (Duration Correction Factor) to match its extended time. It will gradually come down as you do normal work, taking about 20 to 30 WUs to get back to normal. If you want to fix it quickly, you can close down BOINC, find the <duration_correction_factor> entry for <project_name>SETI@home in the client_state.xml file, and edit it. Given a shown estimate of 119 hours for an unstarted WU which will actually take 5 hours, multiply the value by 5/119 or 0.042. That should be close enough, if it's slightly too low it will fully correct when one WU completes, if it's slightly high it will creep down as usual. Joe I also did this fix and was "rewarded" with a huge increase in computing time. What i notiched was that 1 slow unit (thats all i had until now) could raise the time instantly. Correctly processed units seem to only have a minor influence on scaling the time down. Is this a design fault or a feature of which i fail to see the logic. It is a feature. The reason being that the estimated time to completion is supposed to be slightly on the high side so that you don't download a bunch of work that you can't finish before the deadline. This has actually happened. We have had "runs" of various angle ranges which take different times to complete. When we have a run of very short running time angle ranges, the Duration Correction Factor (DCF) will slowly drop. With a long run of these, it can get "used" to the low crunch times and start downloading more work to compensate. Then a "run" of very long running time work may come across. BOINC thinks that these will run approximately the same as the others so it downloads a bunch of them. This results in your computer going into "Earliest Deadline First" (panic) mode ignoring everything else just to get these work units crunched. If the DCF were to decrease immediately upon completing one of these very short running time units, it would immediately download more work and possibly end up in EDF (Earliest deadline first) mode. So it's designed to recover quickly from a low value by jumping immediately to the value of a longer running unit, but decrease slowly from the longer times to shorter ones. Jim Some people plan their life out and look back at the wealth they've had. Others live life day by day and look back at the wealth of experiences and enjoyment they've had. ID: 623925 ·

H Elzinga Volunteer tester Send message Joined: 20 Aug 99 Posts: 125 Credit: 8,277,116 RAC: 0	Message 623945 - Posted: 21 Aug 2007, 14:17:38 UTC - in response to Message 623925. I have been receiving WU with to completion times of 119 hours+. However as the WU is crunched the to completion time drops dramatically. It ends up being only 5 or 6 hours to process. If you've run one of the nasty WUs to completion, BOINC will have adjusted your DCF (Duration Correction Factor) to match its extended time. It will gradually come down as you do normal work, taking about 20 to 30 WUs to get back to normal. If you want to fix it quickly, you can close down BOINC, find the <duration_correction_factor> entry for <project_name>SETI@home in the client_state.xml file, and edit it. Given a shown estimate of 119 hours for an unstarted WU which will actually take 5 hours, multiply the value by 5/119 or 0.042. That should be close enough, if it's slightly too low it will fully correct when one WU completes, if it's slightly high it will creep down as usual. Joe I also did this fix and was "rewarded" with a huge increase in computing time. What i notiched was that 1 slow unit (thats all i had until now) could raise the time instantly. Correctly processed units seem to only have a minor influence on scaling the time down. Is this a design fault or a feature of which i fail to see the logic. It is a feature. The reason being that the estimated time to completion is supposed to be slightly on the high side so that you don't download a bunch of work that you can't finish before the deadline. This has actually happened. We have had "runs" of various angle ranges which take different times to complete. When we have a run of very short running time angle ranges, the Duration Correction Factor (DCF) will slowly drop. With a long run of these, it can get "used" to the low crunch times and start downloading more work to compensate. Then a "run" of very long running time work may come across. BOINC thinks that these will run approximately the same as the others so it downloads a bunch of them. This results in your computer going into "Earliest Deadline First" (panic) mode ignoring everything else just to get these work units crunched. If the DCF were to decrease immediately upon completing one of these very short running time units, it would immediately download more work and possibly end up in EDF (Earliest deadline first) mode. So it's designed to recover quickly from a low value by jumping immediately to the value of a longer running unit, but decrease slowly from the longer times to shorter ones. I See. The client completely unaware of the error assumes this is the first one of a set of similar (long) units. ID: 623945 ·

Jim-R. Volunteer tester Send message Joined: 7 Feb 06 Posts: 1494 Credit: 194,148 RAC: 0	Message 623962 - Posted: 21 Aug 2007, 14:50:58 UTC - in response to Message 623945. Last modified: 21 Aug 2007, 14:52:49 UTC I See. The client completely unaware of the error assumes this is the first one of a set of similar (long) units. Exactly, so it will take a while to get the estimated time back down to normal. That's the reason it was suggested editing the client_state.xml file. Jim Some people plan their life out and look back at the wealth they've had. Others live life day by day and look back at the wealth of experiences and enjoyment they've had. ID: 623962 ·

HTH Volunteer tester Send message Joined: 8 Jul 00 Posts: 691 Credit: 909,237 RAC: 0	Message 624243 - Posted: 22 Aug 2007, 6:26:31 UTC Last modified: 22 Aug 2007, 6:27:46 UTC WU: 147512707. 0.26 cobblestones? Is this correct? The third guy didn't get credit at all. What's wrong? It is the WU that crunched very very slowly. Manned mission to Mars in 2019 Petition <-- Sign this, please. ID: 624243 ·

bounty.hunter Volunteer tester Send message Joined: 22 Mar 04 Posts: 442 Credit: 459,063 RAC: 0	Message 624260 - Posted: 22 Aug 2007, 8:16:48 UTC - in response to Message 624243. WU: 147512707. 0.26 cobblestones? Is this correct? The third guy didn't get credit at all. What's wrong? It is the WU that crunched very very slowly. The third guy aborted the WU manually. ID: 624260 ·

mdpagel Send message Joined: 18 Sep 99 Posts: 53 Credit: 2,619,543 RAC: 0	Message 624269 - Posted: 22 Aug 2007, 8:41:18 UTC http://setiathome.berkeley.edu/workunit.php?wuid=147603991 this was the first of 3 units that was taking 24 hours to process without actually completing itself. My typical runtime on a unit is 1.5 hrs. I actually had chalked it up to signing up for E@h and getting the executables somehow mangled in the memory of BOINC, so I detached from E@h and deleted my S@h executable - which of course screwed up the execution of other WUs. In any event, only one user claims to have processed that unit, and is making claims for other WUs along the order of 90 cobblestones. He's using a client for Darwin. Is there any chance that the main windows app has a bug that Darwin doesn't? ID: 624269 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 624361 - Posted: 22 Aug 2007, 15:53:31 UTC - in response to Message 624269. http://setiathome.berkeley.edu/workunit.php?wuid=147603991 this was the first of 3 units that was taking 24 hours to process without actually completing itself. My typical runtime on a unit is 1.5 hrs. I actually had chalked it up to signing up for E@h and getting the executables somehow mangled in the memory of BOINC, so I detached from E@h and deleted my S@h executable - which of course screwed up the execution of other WUs. In any event, only one user claims to have processed that unit, and is making claims for other WUs along the order of 90 cobblestones. He's using a client for Darwin. Is there any chance that the main windows app has a bug that Darwin doesn't? That Mac did manage to get to Pulse overflow before BOINC killed the task for Maximum CPU time exceeded. I think that's mainly because the BOINC benchmarks for those quad systems are not nearly as much higher as their capability to crunch SETI work is. That makes the maximum time limit relatively further out. The other possibility is the compiler for those Mac builds may be producing more efficient code for the triplet finding loop. The high claims are of course because the stock Mac builds have the 3.81 multiplier from Beta. Joe ID: 624361 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 626237 - Posted: 25 Aug 2007, 8:33:38 UTC Last modified: 25 Aug 2007, 8:34:09 UTC Is this a 'bad' WU too? It was running ~ 1.5 hours, it was at ~ 15 % (not stopped!), ~ 2.5 hours (remaining time) (Normally my PC need ~ 1.5 hours for this AR..) New Rev. 2.4 from Crunch3r.. http://setiathome.berkeley.edu/workunit.php?wuid=149328091 ID: 626237 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 626476 - Posted: 25 Aug 2007, 17:35:28 UTC - in response to Message 626237. Is this a 'bad' WU too? It was running ~ 1.5 hours, it was at ~ 15 % (not stopped!), ~ 2.5 hours (remaining time) (Normally my PC need ~ 1.5 hours for this AR..) New Rev. 2.4 from Crunch3r.. http://setiathome.berkeley.edu/workunit.php?wuid=149328091 It was created 19 Aug 2007 9:38:04 UTC, long after the splitter problem was cured. Another (wuid 149328099) from the same splitter group processed normally, so the thresholds are almost certainly correct. Looking in the WU or result would of course provide the best evidence, if you saved information before aborting. Joe ID: 626476 ·

[B^S] madmac Volunteer tester Send message Joined: 9 Feb 04 Posts: 1175 Credit: 4,754,897 RAC: 0	Message 627080 - Posted: 26 Aug 2007, 14:17:10 UTC I too have got another one 04mr07ab.10282.4980.3.4.87_2 and I know it is a -9 one going 2hrs and only 0.06 again. Will leave it to 16:00 BST and then abort it sorry to the other person waiting on this. ID: 627080 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14661 Credit: 200,643,578 RAC: 874	Message 627084 - Posted: 26 Aug 2007, 14:26:15 UTC - in response to Message 627080. Last modified: 26 Aug 2007, 14:41:10 UTC I too have got another one 04mr07ab.10282.4980.3.4.87_2 and I know it is a -9 one going 2hrs and only 0.06 again. Will leave it to 16:00 BST and then abort it sorry to the other person waiting on this. That one ran for just over 3 hours on a P4 2.4GHz - a bit slower than yours. If you could bear to run it for just a little bit longer, you could kill it for good - seems a shame not to put it out of its misery, now you've already spent so much time on it. Edit - I should have commented on it being a 'past deadline' re-issue. D**n. We could be seeing a lot of these - all hands to the boards! ID: 627084 ·

top1214 Send message Joined: 18 Oct 06 Posts: 1 Credit: 44,898 RAC: 0	Message 627141 - Posted: 26 Aug 2007, 16:36:55 UTC I suspended my latest bad task (04mr07ab.7106.5798.10.4.216_3) as soon as I got it. SETI isn't sending me any more work though. Is that normal for task suspension? ID: 627141 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14661 Credit: 200,643,578 RAC: 874	Message 627147 - Posted: 26 Aug 2007, 16:48:23 UTC - in response to Message 627141. Last modified: 26 Aug 2007, 16:48:43 UTC I suspended my latest bad task (04mr07ab.7106.5798.10.4.216_3) as soon as I got it. SETI isn't sending me any more work though. Is that normal for task suspension? Yes it is, and I see you've had the WU for over a week - pity you didn't ask earlier. If you feel up to performing Joe Segur's "technically adventurous" surgery described in this post, you could get it to run very quickly to completion - since it's already been processed by someone else, that would get rid of it for good. But you'll have to be quick: the deadline expires in under an hour, and after that it'll be put in the queue for issuing to someone else. If you don't feel that adventurous, just abort it now - it's so close to deadline that it'll hardly make any difference. ID: 627147 ·

[B^S] madmac Volunteer tester Send message Joined: 9 Feb 04 Posts: 1175 Credit: 4,754,897 RAC: 0	Message 627230 - Posted: 26 Aug 2007, 18:48:24 UTC - in response to Message 627080. I too have got another one 04mr07ab.10282.4980.3.4.87_2 and I know it is a -9 one going 2hrs and only 0.06 again. Will leave it to 16:00 BST and then abort it sorry to the other person waiting on this. On the end it took 2 hrs 45 mins to complete and that is with the latest version of chicken ID: 627230 ·

Jesse Viviano Send message Joined: 27 Feb 00 Posts: 100 Credit: 3,949,583 RAC: 0	Message 627595 - Posted: 27 Aug 2007, 3:01:26 UTC - in response to Message 621449. For those of you technically adventurous, here's another option to handle the WUs with negative triplet threshold. The negative threshold means no triplets can possibly be found, a high enough positive threshold has the same effect. But the very high positive threshold doesn't make crunching slow, it actually makes it slightly faster than a normal threshold because triplet finding has less work to do. So the workaround is: 1. Ensure the WU is not in use by shutting down BOINC. (IF your preferences are to have suspended work removed from memory, then Suspending the work would be enough.) 2. Open the WU in an editor. NOT a word processor or anything else which may change more than you intend. 3. Find the <triplet_thresh>-x.xxxxxx</triplet_thresh> line. 4. Change it to <triplet_thresh>99</triplet_thresh> . 5. Save the WU file. 6. Restart BOINC or Resume the WU. That WU may not start running. You could force it by Suspending others, or simply let BOINC get to it whenever. When it does run, it is likely to overflow on Pulses. The result should match that from someone who has allowed the WU to creep to that naturally. Because of the high threshold your credit claim will be lower, but probably by less than 1 cobblestone. I did this to all 4 I had, wuid 148063182 has validated against a full run, wuid 148063170, wuid 148063178, and wuid 148140560 don't have other completed work yet. I am not urging anyone else to try this, and can't think of another situation in which I'd consider modifying a WU. Normally any change to a WU would lead to an invalid result, this is a very unusual exception. Even so, I considered long and thoroughly before posting this, and will not be at all unhappy if someone from Berkeley decides to hide this post. Joe I am not sure if this is a good idea. If a work unit is discarded due to too many errors, this might notify the administrators that something needed to be done about the problem WU. Once Eric finally fixes the splitter (what the admins did looks like a band-aid on code that was not their specialty so their patch may have broken the splitter in a way that they might not have seen), he will know that this work unit errored out, and have it resplit with the corrected splitter. If you modify the work unit, this flag might not be generated. ID: 627595 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.