Work Unit problem

Author	Message
gomeyer Volunteer tester Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0	Message 620960 - Posted: 17 Aug 2007, 10:36:22 UTC - in response to Message 620775. Because there are 256 WUs with identical thresholds in each group, only the first 3 fields of the WU name are needed. Here's mine plus those already mentioned in the thread: 04mr07ab.10282.4980 04mr07ab.14840.4980 04mr07ab.32128.5798 05mr07aa.12591.24612 05mr07aa.15859.24612 05mr07ab.7301.368637 Joe And mine 04mr07ab.7106.5389 05mr07ab.6072.369046 (3 of these) 04mr07ab.7106.5798 05mr07aa.12210.24612 04mr07ab.7106.6207 05mr07aa.3769.20522 (5 of these) 04mr07ab.7106.6616 Gus ID: 620960 ·

Jesse Viviano Send message Joined: 27 Feb 00 Posts: 100 Credit: 3,949,583 RAC: 0	Message 621005 - Posted: 17 Aug 2007, 12:14:57 UTC Seems that a few of them do not overflow with a -9 overflow message, but exit with a compute error that states that the maximum CPU time limit was exceeded and returned a -187 (0xffffff4f) error. Look at mine, 04mr07ab.10282.4980, here, which is the result to a work unit I posted above. If you run across it, just abort it until there are so many errors that the work unit is tossed as trash. ID: 621005 ·

Bob Nadler Send message Joined: 3 Sep 99 Posts: 7 Credit: 726,368 RAC: 0	Message 621012 - Posted: 17 Aug 2007, 12:39:06 UTC - in response to Message 621005. Seems that a few of them do not overflow with a -9 overflow message, but exit with a compute error that states that the maximum CPU time limit was exceeded and returned a -187 (0xffffff4f) error. Look at mine, 04mr07ab.10282.4980, here, which is the result to a work unit I posted above. If you run across it, just abort it until there are so many errors that the work unit is tossed as trash. I would agree.. I am running 04mr07ab.14840.4980.6.4.243 on a Linux system w/ 2Ghz Xeon cpus, BOINC v5.8.16 and SETI v5.27. It has 13 hours of CPU time and is .035% done :-\\ I suspended and restarted it so now it is .030% done and has 14 hours to completion. http://setiathome.berkeley.edu/workunit.php?wuid=147539328 What is the maximum amount of CPU time the project allows on a workunit? Will this time out or just run past the report deadline (which is 25 Aug) ? I would rather let this go if it would be the last time anyone gets this WU. Thanks! Bob ID: 621012 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 621020 - Posted: 17 Aug 2007, 13:09:39 UTC - in response to Message 621005. Seems that a few of them do not overflow with a -9 overflow message, but exit with a compute error that states that the maximum CPU time limit was exceeded and returned a -187 (0xffffff4f) error. Look at mine, 04mr07ab.10282.4980, here, which is the result to a work unit I posted above. If you run across it, just abort it until there are so many errors that the work unit is tossed as trash. I thought for a moment I'd got your hand-me-downs! But it was a different re-issue - from result 590582939 - with the same symptoms as you describe. I've suspended mine at 3 hours 20 minutes (0.089%) in case anyone wants details. (Joe - <triplet_thresh>-0.764051318</triplet_thresh> - ???). I'm happy to keep it out of circulation for the weekend (I've got other long-running projects on the box), and we can ask the lab on Monday what the chances are of a scripted bulk cancellation. [Anyone know whether Eric is due back next week, or has he still got two, three, ..., weeks of vacation to go?] ID: 621020 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19063 Credit: 40,757,560 RAC: 67	Message 621033 - Posted: 17 Aug 2007, 13:46:14 UTC - in response to Message 621020. Seems that a few of them do not overflow with a -9 overflow message, but exit with a compute error that states that the maximum CPU time limit was exceeded and returned a -187 (0xffffff4f) error. Look at mine, 04mr07ab.10282.4980, here, which is the result to a work unit I posted above. If you run across it, just abort it until there are so many errors that the work unit is tossed as trash. I thought for a moment I'd got your hand-me-downs! But it was a different re-issue - from result 590582939 - with the same symptoms as you describe. I've suspended mine at 3 hours 20 minutes (0.089%) in case anyone wants details. (Joe - <triplet_thresh>-0.764051318</triplet_thresh> - ???). I'm happy to keep it out of circulation for the weekend (I've got other long-running projects on the box), and we can ask the lab on Monday what the chances are of a scripted bulk cancellation. [Anyone know whether Eric is due back next week, or has he still got two, three, ..., weeks of vacation to go?] It would probably be better to get a bulk cancellation asap, because it must be part of the problem connecting etc. One guesses that there could have been up to 500,000 original bad units and if they are all re-issued until there is max errors then that would be up to 750,000 extra copies. I've already had one re-issue that has gone to 5 copies. Andy ID: 621033 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 621043 - Posted: 17 Aug 2007, 14:05:36 UTC - in response to Message 621033. [Anyone know whether Eric is due back next week, or has he still got two, three, ..., weeks of vacation to go?] It would probably be better to get a bulk cancellation asap, because it must be part of the problem connecting etc. One guesses that there could have been up to 500,000 original bad units and if they are all re-issued until there is max errors then that would be up to 750,000 extra copies. I've already had one re-issue that has gone to 5 copies. Andy Pappa has posted in Beta that Eric is on vacation (holiday) for another week .... not much else will get done for at least another week and 2 weekends. so it would have to be Matt or Jeff. I'm not sure what their thinking would be on that - I suppose it depends whether disk/database space, or download throughput, is seen as more important at the moment: at least the 'sticky' WUs don't have a very fast turnover! I guess it'll have to be their call in the end. ID: 621043 ·

Pablo_ZPM Send message Joined: 13 Jul 01 Posts: 3 Credit: 367,720 RAC: 0	Message 621050 - Posted: 17 Aug 2007, 14:22:18 UTC Hi all, I used to have stuck units from time to time by now I'm experiencing the opposite: see post http://setiathome.berkeley.edu/forum_thread.php?id=41585&nowrap=true#621028 Any ideas? :o) Seems to me I'm generating a lot of traffic / requests for new work and, hey! I do get new units from time to time. right now my box is finishing another one it took 2.5 hours to process (usually it takes 11 - 18 hours to churn out a result) so it is hollering for more work. I've had several units it took 90 seconds to process and back for more work, sic. :o( Pablo_ZPM ID: 621050 ·

Jim-R. Volunteer tester Send message Joined: 7 Feb 06 Posts: 1494 Credit: 194,148 RAC: 0	Message 621055 - Posted: 17 Aug 2007, 14:35:46 UTC - in response to Message 621050. Hi all, I used to have stuck units from time to time by now I'm experiencing the opposite: see post http://setiathome.berkeley.edu/forum_thread.php?id=41585&nowrap=true#621028 Any ideas? :o) Seems to me I'm generating a lot of traffic / requests for new work and, hey! I do get new units from time to time. right now my box is finishing another one it took 2.5 hours to process (usually it takes 11 - 18 hours to churn out a result) so it is hollering for more work. I've had several units it took 90 seconds to process and back for more work, sic. :o( Pablo_ZPM From checking about a dozen of your last results it seems you have run into quite a large number of "high" angle range work units, ar= 1.49xxx. These will run quicker than the more "normal" 0.42xxx ar's. Also there have been quite a few that have been what we call "-9 overflow" or "noisy" work units. The time it takes to crunch these work units just depends on how "noisy" they are. If they reach the maximum number of results quickly (extremely "noisy") they will end quickly. If they are not too noisy they might take a bit longer to error out. So your computer is still doing good work and there's nothing to worry about. Once we get out of the high angle ranges and start issuing some more "normal" ranges things will settle down. Jim Some people plan their life out and look back at the wealth they've had. Others live life day by day and look back at the wealth of experiences and enjoyment they've had. ID: 621055 ·

Pablo_ZPM Send message Joined: 13 Jul 01 Posts: 3 Credit: 367,720 RAC: 0	Message 621089 - Posted: 17 Aug 2007, 15:47:27 UTC - in response to Message 621055. Hi all, I used to have stuck units from time to time by now I'm experiencing the opposite: see post http://setiathome.berkeley.edu/forum_thread.php?id=41585&nowrap=true#621028 Any ideas? :o) Seems to me I'm generating a lot of traffic / requests for new work and, hey! I do get new units from time to time. right now my box is finishing another one it took 2.5 hours to process (usually it takes 11 - 18 hours to churn out a result) so it is hollering for more work. I've had several units it took 90 seconds to process and back for more work, sic. :o( Pablo_ZPM From checking about a dozen of your last results it seems you have run into quite a large number of "high" angle range work units, ar= 1.49xxx. These will run quicker than the more "normal" 0.42xxx ar's. Also there have been quite a few that have been what we call "-9 overflow" or "noisy" work units. The time it takes to crunch these work units just depends on how "noisy" they are. If they reach the maximum number of results quickly (extremely "noisy") they will end quickly. If they are not too noisy they might take a bit longer to error out. So your computer is still doing good work and there's nothing to worry about. Once we get out of the high angle ranges and start issuing some more "normal" ranges things will settle down. Thanks, I now finally got some units which seem to be perfectly "ordinary" - rate of progress vs. processing time and that with your answer quieted my worries about creating excessive demand on s@h servers. Everything back to normal, if searching for "little green men" fits that description... :o) ID: 621089 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 621105 - Posted: 17 Aug 2007, 16:06:25 UTC - in response to Message 621089. Last modified: 17 Aug 2007, 16:07:14 UTC Well, It seems all my 04mar07ab units (about 4 dozen of them) crunched through and all '-9 overflowed' in from 6 to 30 CPU seconds [none stuck]. All gone now and I seem to be crunching much healthier workunits :D "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 621105 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 621172 - Posted: 17 Aug 2007, 17:24:45 UTC - in response to Message 621012. ... What is the maximum amount of CPU time the project allows on a workunit? Will this time out or just run past the report deadline (which is 25 Aug) ? I would rather let this go if it would be the last time anyone gets this WU. For your hosts with Whetstone MIPs in the 1720 range, BOINC will kill these high angle range WUs after about 1.5 days. The exact amount can be found from values in the client_state.xml file. Find the <rsc_fpops_bound> value for the WU and divide it by the host <p_fpops> value to get the CPU time limit in seconds. Unfortunately, that would be reported as an error and the servers would reissue the WU to another host. The splitter problem which causes triplet_thresh to go negative also causes pulse_thresh to be lower than normal, so probably most of these glacially slow WUs will overflow on pulses if run long enough. But if the data is actually quiet enough that may not happen. Joe ID: 621172 ·

Bob Nadler Send message Joined: 3 Sep 99 Posts: 7 Credit: 726,368 RAC: 0	Message 621204 - Posted: 17 Aug 2007, 17:59:48 UTC - in response to Message 621172. ... What is the maximum amount of CPU time the project allows on a workunit? Will this time out or just run past the report deadline (which is 25 Aug) ? I would rather let this go if it would be the last time anyone gets this WU. For your hosts with Whetstone MIPs in the 1720 range, BOINC will kill these high angle range WUs after about 1.5 days. The exact amount can be found from values in the client_state.xml file. Find the <rsc_fpops_bound> value for the WU and divide it by the host <p_fpops> value to get the CPU time limit in seconds. Unfortunately, that would be reported as an error and the servers would reissue the WU to another host. The splitter problem which causes triplet_thresh to go negative also causes pulse_thresh to be lower than normal, so probably most of these glacially slow WUs will overflow on pulses if run long enough. But if the data is actually quiet enough that may not happen. Joe Thanks Joe! I calculate that out for this workunit on my system to be 36.35 CPU hours.. Unless someone knows otherwise I guess I can let that run to see if it reaches that threshold or finishes.. Maybe this is the WU with a real ET signal ;-) Bob ID: 621204 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 621449 - Posted: 17 Aug 2007, 23:08:51 UTC For those of you technically adventurous, here's another option to handle the WUs with negative triplet threshold. The negative threshold means no triplets can possibly be found, a high enough positive threshold has the same effect. But the very high positive threshold doesn't make crunching slow, it actually makes it slightly faster than a normal threshold because triplet finding has less work to do. So the workaround is: 1. Ensure the WU is not in use by shutting down BOINC. (IF your preferences are to have suspended work removed from memory, then Suspending the work would be enough.) 2. Open the WU in an editor. NOT a word processor or anything else which may change more than you intend. 3. Find the <triplet_thresh>-x.xxxxxx</triplet_thresh> line. 4. Change it to <triplet_thresh>99</triplet_thresh> . 5. Save the WU file. 6. Restart BOINC or Resume the WU. That WU may not start running. You could force it by Suspending others, or simply let BOINC get to it whenever. When it does run, it is likely to overflow on Pulses. The result should match that from someone who has allowed the WU to creep to that naturally. Because of the high threshold your credit claim will be lower, but probably by less than 1 cobblestone. I did this to all 4 I had, wuid 148063182 has validated against a full run, wuid 148063170, wuid 148063178, and wuid 148140560 don't have other completed work yet. I am not urging anyone else to try this, and can't think of another situation in which I'd consider modifying a WU. Normally any change to a WU would lead to an invalid result, this is a very unusual exception. Even so, I considered long and thoroughly before posting this, and will not be at all unhappy if someone from Berkeley decides to hide this post. Joe ID: 621449 ·

gomeyer Volunteer tester Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0	Message 621497 - Posted: 18 Aug 2007, 0:05:04 UTC - in response to Message 621449. For those of you technically adventurous . . . Joe Shweet! Works as advertised. Thanks for sticking with this Joe, and for coming up with a viable workaround. I was up to 19 of these and my BoincView display was getting a little messy. Regards, Gus ID: 621497 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 621850 - Posted: 18 Aug 2007, 10:17:51 UTC - in response to Message 621449. For those of you technically adventurous... Sounds like a considered and well-reasoned argument: under the circumstances, I can't imagine anyone from Berkeley arguing against it. Given Matt's figure of 50% spurious -9s, I think the whole 'tape' will have to be put in Matt's recycling 'box' for re-scrutiny at a future date (I wonder how he'll manage that filing system now it's all on hard drives and remote archival storage, LOL). Anyway, I've performed surgery on my four, but they'll just have to wait their turn in the queue - now I've unsuspended them, there is indeed a queue, which is good to see. Glad to do my little bit towards tidying up my corner of the Berkeley BOINC database. ID: 621850 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 621878 - Posted: 18 Aug 2007, 12:05:03 UTC - in response to Message 621850. For those of you technically adventurous... Sounds like a considered and well-reasoned argument: under the circumstances, I can't imagine anyone from Berkeley arguing against it. Given Matt's figure of 50% spurious -9s, I think the whole 'tape' will have to be put in Matt's recycling 'box' for re-scrutiny at a future date (I wonder how he'll manage that filing system now it's all on hard drives and remote archival storage, LOL). Anyway, I've performed surgery on my four, but they'll just have to wait their turn in the queue - now I've unsuspended them, there is indeed a queue, which is good to see. Glad to do my little bit towards tidying up my corner of the Berkeley BOINC database. Done my Surgery on my three, but one of them had already done nine hours and was at 0.12%, so my RDCF (Result duration correction factor) is now 3.15, so now all my MB WU's are reported that they are going to take 35 hours or so, more than 10 times that they should, i tried lowering it to 0.9, (in sched_request_setiathome.berkeley.edu.xml), but when i report it goes back to what it was, any ideas?, or do i have to wait until it goes down by itself? Claggy. ID: 621878 ·

Astro Volunteer tester Send message Joined: 16 Apr 02 Posts: 8026 Credit: 600,015 RAC: 0	Message 621880 - Posted: 18 Aug 2007, 12:16:15 UTC Well, you could edit it in the project section of the "client_state.xml" file instead. You'll get better results. ID: 621880 ·

Richard Haselgrove Volunteer tester Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874	Message 621881 - Posted: 18 Aug 2007, 12:16:43 UTC - in response to Message 621878. Done my Surgery on my three, but one of them had already done nine hours and was at 0.12%, so my RDCF (Result duration correction factor) is now 3.15, so now all my MB WU's are reported that they are going to take 35 hours or so, more than 10 times that they should, i tried lowering it to 0.9, (in sched_request_setiathome.berkeley.edu.xml), but when i report it goes back to what it was, any ideas?, or do i have to wait until it goes down by itself? Claggy. Personally, I'm letting mine sort itself out in its own time - I think that there are currently still some issues with WU runtime estimates in MB, so there would be no such thing as a "correct" RDCF for all work. But if you want to give it a helping hand, the file to edit is client_state.xml (stop BOINC first, use extreme care and a text-only editor, make sure you edit the right project's RDCF figure, backups are always a good idea, etc., etc.) ID: 621881 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19063 Credit: 40,757,560 RAC: 67	Message 621882 - Posted: 18 Aug 2007, 12:17:11 UTC - in response to Message 621878. For those of you technically adventurous... Sounds like a considered and well-reasoned argument: under the circumstances, I can't imagine anyone from Berkeley arguing against it. Given Matt's figure of 50% spurious -9s, I think the whole 'tape' will have to be put in Matt's recycling 'box' for re-scrutiny at a future date (I wonder how he'll manage that filing system now it's all on hard drives and remote archival storage, LOL). Anyway, I've performed surgery on my four, but they'll just have to wait their turn in the queue - now I've unsuspended them, there is indeed a queue, which is good to see. Glad to do my little bit towards tidying up my corner of the Berkeley BOINC database. Done my Surgery on my three, but one of them had already done nine hours and was at 0.12%, so my RDCF (Result duration correction factor) is now 3.15, so now all my MB WU's are reported that they are going to take 35 hours or so, more than 10 times that they should, i tried lowering it to 0.9, (in sched_request_setiathome.berkeley.edu.xml), but when i report it goes back to what it was, any ideas?, or do i have to wait until it goes down by itself? Claggy. Did you exit BOINC during edit of client state? ID: 621882 ·

Claggy Volunteer tester Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4	Message 621886 - Posted: 18 Aug 2007, 12:23:01 UTC - in response to Message 621882. For those of you technically adventurous... Sounds like a considered and well-reasoned argument: under the circumstances, I can't imagine anyone from Berkeley arguing against it. Given Matt's figure of 50% spurious -9s, I think the whole 'tape' will have to be put in Matt's recycling 'box' for re-scrutiny at a future date (I wonder how he'll manage that filing system now it's all on hard drives and remote archival storage, LOL). Anyway, I've performed surgery on my four, but they'll just have to wait their turn in the queue - now I've unsuspended them, there is indeed a queue, which is good to see. Glad to do my little bit towards tidying up my corner of the Berkeley BOINC database. Done my Surgery on my three, but one of them had already done nine hours and was at 0.12%, so my RDCF (Result duration correction factor) is now 3.15, so now all my MB WU's are reported that they are going to take 35 hours or so, more than 10 times that they should, i tried lowering it to 0.9, (in sched_request_setiathome.berkeley.edu.xml), but when i report it goes back to what it was, any ideas?, or do i have to wait until it goes down by itself? Claggy. Did you exit BOINC during edit of client state? Yep, and put a back up on the desktop. Claggy. ID: 621886 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.