Work Unit problem

Message boards : Number crunching : Work Unit problem
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 620960 - Posted: 17 Aug 2007, 10:36:22 UTC - in response to Message 620775.  


Because there are 256 WUs with identical thresholds in each group, only the first 3 fields of the WU name are needed. Here's mine plus those already mentioned in the thread:

04mr07ab.10282.4980
04mr07ab.14840.4980
04mr07ab.32128.5798
05mr07aa.12591.24612
05mr07aa.15859.24612
05mr07ab.7301.368637
                                                               Joe

And mine
04mr07ab.7106.5389
05mr07ab.6072.369046 (3 of these)
04mr07ab.7106.5798
05mr07aa.12210.24612
04mr07ab.7106.6207
05mr07aa.3769.20522 (5 of these)
04mr07ab.7106.6616
Gus
ID: 620960 · Report as offensive
Jesse Viviano

Send message
Joined: 27 Feb 00
Posts: 100
Credit: 3,949,583
RAC: 0
United States
Message 621005 - Posted: 17 Aug 2007, 12:14:57 UTC

Seems that a few of them do not overflow with a -9 overflow message, but exit with a compute error that states that the maximum CPU time limit was exceeded and returned a -187 (0xffffff4f) error. Look at mine, 04mr07ab.10282.4980, here, which is the result to a work unit I posted above. If you run across it, just abort it until there are so many errors that the work unit is tossed as trash.
ID: 621005 · Report as offensive
Bob Nadler

Send message
Joined: 3 Sep 99
Posts: 7
Credit: 726,368
RAC: 0
United States
Message 621012 - Posted: 17 Aug 2007, 12:39:06 UTC - in response to Message 621005.  

Seems that a few of them do not overflow with a -9 overflow message, but exit with a compute error that states that the maximum CPU time limit was exceeded and returned a -187 (0xffffff4f) error. Look at mine, 04mr07ab.10282.4980, here, which is the result to a work unit I posted above. If you run across it, just abort it until there are so many errors that the work unit is tossed as trash.



I would agree.. I am running 04mr07ab.14840.4980.6.4.243 on a Linux system w/ 2Ghz Xeon cpus, BOINC v5.8.16 and SETI v5.27. It has 13 hours of CPU time and is .035% done :-\\ I suspended and restarted it so now it is .030% done and has 14 hours to completion.

http://setiathome.berkeley.edu/workunit.php?wuid=147539328

What is the maximum amount of CPU time the project allows on a workunit? Will this time out or just run past the report deadline (which is 25 Aug) ? I would rather let this go if it would be the last time anyone gets this WU.

Thanks!

Bob
ID: 621012 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 621020 - Posted: 17 Aug 2007, 13:09:39 UTC - in response to Message 621005.  

Seems that a few of them do not overflow with a -9 overflow message, but exit with a compute error that states that the maximum CPU time limit was exceeded and returned a -187 (0xffffff4f) error. Look at mine, 04mr07ab.10282.4980, here, which is the result to a work unit I posted above. If you run across it, just abort it until there are so many errors that the work unit is tossed as trash.

I thought for a moment I'd got your hand-me-downs!

But it was a different re-issue - from result 590582939 - with the same symptoms as you describe.

I've suspended mine at 3 hours 20 minutes (0.089%) in case anyone wants details. (Joe - <triplet_thresh>-0.764051318</triplet_thresh> - ???). I'm happy to keep it out of circulation for the weekend (I've got other long-running projects on the box), and we can ask the lab on Monday what the chances are of a scripted bulk cancellation.

[Anyone know whether Eric is due back next week, or has he still got two, three, ..., weeks of vacation to go?]
ID: 621020 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19012
Credit: 40,757,560
RAC: 67
United Kingdom
Message 621033 - Posted: 17 Aug 2007, 13:46:14 UTC - in response to Message 621020.  

Seems that a few of them do not overflow with a -9 overflow message, but exit with a compute error that states that the maximum CPU time limit was exceeded and returned a -187 (0xffffff4f) error. Look at mine, 04mr07ab.10282.4980, here, which is the result to a work unit I posted above. If you run across it, just abort it until there are so many errors that the work unit is tossed as trash.

I thought for a moment I'd got your hand-me-downs!

But it was a different re-issue - from result 590582939 - with the same symptoms as you describe.

I've suspended mine at 3 hours 20 minutes (0.089%) in case anyone wants details. (Joe - <triplet_thresh>-0.764051318</triplet_thresh> - ???). I'm happy to keep it out of circulation for the weekend (I've got other long-running projects on the box), and we can ask the lab on Monday what the chances are of a scripted bulk cancellation.

[Anyone know whether Eric is due back next week, or has he still got two, three, ..., weeks of vacation to go?]

It would probably be better to get a bulk cancellation asap, because it must be part of the problem connecting etc. One guesses that there could have been up to 500,000 original bad units and if they are all re-issued until there is max errors then that would be up to 750,000 extra copies.

I've already had one re-issue that has gone to 5 copies.

Andy
ID: 621033 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 621043 - Posted: 17 Aug 2007, 14:05:36 UTC - in response to Message 621033.  

[Anyone know whether Eric is due back next week, or has he still got two, three, ..., weeks of vacation to go?]

It would probably be better to get a bulk cancellation asap, because it must be part of the problem connecting etc. One guesses that there could have been up to 500,000 original bad units and if they are all re-issued until there is max errors then that would be up to 750,000 extra copies.

I've already had one re-issue that has gone to 5 copies.

Andy

Pappa has posted in Beta that
Eric is on vacation (holiday) for another week .... not much else will get done for at least another week and 2 weekends.

so it would have to be Matt or Jeff. I'm not sure what their thinking would be on that - I suppose it depends whether disk/database space, or download throughput, is seen as more important at the moment: at least the 'sticky' WUs don't have a very fast turnover!

I guess it'll have to be their call in the end.
ID: 621043 · Report as offensive
Pablo_ZPM

Send message
Joined: 13 Jul 01
Posts: 3
Credit: 367,720
RAC: 0
Poland
Message 621050 - Posted: 17 Aug 2007, 14:22:18 UTC

Hi all,
I used to have stuck units from time to time by now I'm experiencing the opposite: see post http://setiathome.berkeley.edu/forum_thread.php?id=41585&nowrap=true#621028
Any ideas? :o) Seems to me I'm generating a lot of traffic / requests for new work and, hey! I do get new units from time to time. right now my box is finishing another one it took 2.5 hours to process (usually it takes 11 - 18 hours to churn out a result) so it is hollering for more work. I've had several units it took 90 seconds to process and back for more work, sic. :o(
Pablo_ZPM
ID: 621050 · Report as offensive
Profile Jim-R.
Volunteer tester
Avatar

Send message
Joined: 7 Feb 06
Posts: 1494
Credit: 194,148
RAC: 0
United States
Message 621055 - Posted: 17 Aug 2007, 14:35:46 UTC - in response to Message 621050.  

Hi all,
I used to have stuck units from time to time by now I'm experiencing the opposite: see post http://setiathome.berkeley.edu/forum_thread.php?id=41585&nowrap=true#621028
Any ideas? :o) Seems to me I'm generating a lot of traffic / requests for new work and, hey! I do get new units from time to time. right now my box is finishing another one it took 2.5 hours to process (usually it takes 11 - 18 hours to churn out a result) so it is hollering for more work. I've had several units it took 90 seconds to process and back for more work, sic. :o(
Pablo_ZPM

From checking about a dozen of your last results it seems you have run into quite a large number of "high" angle range work units, ar= 1.49xxx. These will run quicker than the more "normal" 0.42xxx ar's. Also there have been quite a few that have been what we call "-9 overflow" or "noisy" work units. The time it takes to crunch these work units just depends on how "noisy" they are. If they reach the maximum number of results quickly (extremely "noisy") they will end quickly. If they are not too noisy they might take a bit longer to error out. So your computer is still doing good work and there's nothing to worry about. Once we get out of the high angle ranges and start issuing some more "normal" ranges things will settle down.
Jim

Some people plan their life out and look back at the wealth they've had.
Others live life day by day and look back at the wealth of experiences and enjoyment they've had.
ID: 621055 · Report as offensive
Pablo_ZPM

Send message
Joined: 13 Jul 01
Posts: 3
Credit: 367,720
RAC: 0
Poland
Message 621089 - Posted: 17 Aug 2007, 15:47:27 UTC - in response to Message 621055.  

Hi all,
I used to have stuck units from time to time by now I'm experiencing the opposite: see post http://setiathome.berkeley.edu/forum_thread.php?id=41585&nowrap=true#621028
Any ideas? :o) Seems to me I'm generating a lot of traffic / requests for new work and, hey! I do get new units from time to time. right now my box is finishing another one it took 2.5 hours to process (usually it takes 11 - 18 hours to churn out a result) so it is hollering for more work. I've had several units it took 90 seconds to process and back for more work, sic. :o(
Pablo_ZPM

From checking about a dozen of your last results it seems you have run into quite a large number of "high" angle range work units, ar= 1.49xxx. These will run quicker than the more "normal" 0.42xxx ar's. Also there have been quite a few that have been what we call "-9 overflow" or "noisy" work units. The time it takes to crunch these work units just depends on how "noisy" they are. If they reach the maximum number of results quickly (extremely "noisy") they will end quickly. If they are not too noisy they might take a bit longer to error out. So your computer is still doing good work and there's nothing to worry about. Once we get out of the high angle ranges and start issuing some more "normal" ranges things will settle down.

Thanks, I now finally got some units which seem to be perfectly "ordinary" - rate of progress vs. processing time and that with your answer quieted my worries about creating excessive demand on s@h servers. Everything back to normal, if searching for "little green men" fits that description... :o)
ID: 621089 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 621105 - Posted: 17 Aug 2007, 16:06:25 UTC - in response to Message 621089.  
Last modified: 17 Aug 2007, 16:07:14 UTC

Well, It seems all my 04mar07ab units (about 4 dozen of them) crunched through and all '-9 overflowed' in from 6 to 30 CPU seconds [none stuck]. All gone now and I seem to be crunching much healthier workunits :D

"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 621105 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 621172 - Posted: 17 Aug 2007, 17:24:45 UTC - in response to Message 621012.  

...
What is the maximum amount of CPU time the project allows on a workunit? Will this time out or just run past the report deadline (which is 25 Aug) ? I would rather let this go if it would be the last time anyone gets this WU.


For your hosts with Whetstone MIPs in the 1720 range, BOINC will kill these high angle range WUs after about 1.5 days. The exact amount can be found from values in the client_state.xml file. Find the <rsc_fpops_bound> value for the WU and divide it by the host <p_fpops> value to get the CPU time limit in seconds.

Unfortunately, that would be reported as an error and the servers would reissue the WU to another host.

The splitter problem which causes triplet_thresh to go negative also causes pulse_thresh to be lower than normal, so probably most of these glacially slow WUs will overflow on pulses if run long enough. But if the data is actually quiet enough that may not happen.
                                                                Joe

ID: 621172 · Report as offensive
Bob Nadler

Send message
Joined: 3 Sep 99
Posts: 7
Credit: 726,368
RAC: 0
United States
Message 621204 - Posted: 17 Aug 2007, 17:59:48 UTC - in response to Message 621172.  

...
What is the maximum amount of CPU time the project allows on a workunit? Will this time out or just run past the report deadline (which is 25 Aug) ? I would rather let this go if it would be the last time anyone gets this WU.


For your hosts with Whetstone MIPs in the 1720 range, BOINC will kill these high angle range WUs after about 1.5 days. The exact amount can be found from values in the client_state.xml file. Find the <rsc_fpops_bound> value for the WU and divide it by the host <p_fpops> value to get the CPU time limit in seconds.

Unfortunately, that would be reported as an error and the servers would reissue the WU to another host.

The splitter problem which causes triplet_thresh to go negative also causes pulse_thresh to be lower than normal, so probably most of these glacially slow WUs will overflow on pulses if run long enough. But if the data is actually quiet enough that may not happen.
                                                                Joe




Thanks Joe!

I calculate that out for this workunit on my system to be 36.35 CPU hours.. Unless someone knows otherwise I guess I can let that run to see if it reaches that threshold or finishes..

Maybe this is the WU with a real ET signal ;-)

Bob
ID: 621204 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 621449 - Posted: 17 Aug 2007, 23:08:51 UTC

For those of you technically adventurous, here's another option to handle the WUs with negative triplet threshold.

The negative threshold means no triplets can possibly be found, a high enough positive threshold has the same effect. But the very high positive threshold doesn't make crunching slow, it actually makes it slightly faster than a normal threshold because triplet finding has less work to do. So the workaround is:

1. Ensure the WU is not in use by shutting down BOINC. (IF your preferences are to have suspended work removed from memory, then Suspending the work would be enough.)
2. Open the WU in an editor. NOT a word processor or anything else which may change more than you intend.
3. Find the <triplet_thresh>-x.xxxxxx</triplet_thresh> line.
4. Change it to <triplet_thresh>99</triplet_thresh> .
5. Save the WU file.
6. Restart BOINC or Resume the WU.

That WU may not start running. You could force it by Suspending others, or simply let BOINC get to it whenever.

When it does run, it is likely to overflow on Pulses. The result should match that from someone who has allowed the WU to creep to that naturally. Because of the high threshold your credit claim will be lower, but probably by less than 1 cobblestone.

I did this to all 4 I had, wuid 148063182 has validated against a full run, wuid 148063170, wuid 148063178, and wuid 148140560 don't have other completed work yet.

I am not urging anyone else to try this, and can't think of another situation in which I'd consider modifying a WU. Normally any change to a WU would lead to an invalid result, this is a very unusual exception. Even so, I considered long and thoroughly before posting this, and will not be at all unhappy if someone from Berkeley decides to hide this post.
                                                                 Joe
ID: 621449 · Report as offensive
gomeyer
Volunteer tester

Send message
Joined: 21 May 99
Posts: 488
Credit: 50,370,425
RAC: 0
United States
Message 621497 - Posted: 18 Aug 2007, 0:05:04 UTC - in response to Message 621449.  

For those of you technically adventurous . . .
                                                                 Joe

Shweet! Works as advertised. Thanks for sticking with this Joe, and for coming up with a viable workaround. I was up to 19 of these and my BoincView display was getting a little messy.
Regards,
Gus
ID: 621497 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 621850 - Posted: 18 Aug 2007, 10:17:51 UTC - in response to Message 621449.  

For those of you technically adventurous...

Sounds like a considered and well-reasoned argument: under the circumstances, I can't imagine anyone from Berkeley arguing against it. Given Matt's figure of 50% spurious -9s, I think the whole 'tape' will have to be put in Matt's recycling 'box' for re-scrutiny at a future date (I wonder how he'll manage that filing system now it's all on hard drives and remote archival storage, LOL).

Anyway, I've performed surgery on my four, but they'll just have to wait their turn in the queue - now I've unsuspended them, there is indeed a queue, which is good to see.

Glad to do my little bit towards tidying up my corner of the Berkeley BOINC database.
ID: 621850 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 621878 - Posted: 18 Aug 2007, 12:05:03 UTC - in response to Message 621850.  

For those of you technically adventurous...

Sounds like a considered and well-reasoned argument: under the circumstances, I can't imagine anyone from Berkeley arguing against it. Given Matt's figure of 50% spurious -9s, I think the whole 'tape' will have to be put in Matt's recycling 'box' for re-scrutiny at a future date (I wonder how he'll manage that filing system now it's all on hard drives and remote archival storage, LOL).

Anyway, I've performed surgery on my four, but they'll just have to wait their turn in the queue - now I've unsuspended them, there is indeed a queue, which is good to see.

Glad to do my little bit towards tidying up my corner of the Berkeley BOINC database.


Done my Surgery on my three, but one of them had already done nine hours and
was at 0.12%, so my RDCF (Result duration correction factor) is now 3.15,
so now all my MB WU's are reported that they are going to take 35 hours or so,
more than 10 times that they should, i tried lowering it to 0.9,
(in sched_request_setiathome.berkeley.edu.xml), but when i
report it goes back to what it was, any ideas?, or do i have to wait until
it goes down by itself?

Claggy.
ID: 621878 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 621880 - Posted: 18 Aug 2007, 12:16:15 UTC

Well, you could edit it in the project section of the "client_state.xml" file instead. You'll get better results.
ID: 621880 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14649
Credit: 200,643,578
RAC: 874
United Kingdom
Message 621881 - Posted: 18 Aug 2007, 12:16:43 UTC - in response to Message 621878.  

Done my Surgery on my three, but one of them had already done nine hours and was at 0.12%, so my RDCF (Result duration correction factor) is now 3.15,
so now all my MB WU's are reported that they are going to take 35 hours or so,
more than 10 times that they should, i tried lowering it to 0.9,
(in sched_request_setiathome.berkeley.edu.xml), but when i
report it goes back to what it was, any ideas?, or do i have to wait until
it goes down by itself?

Claggy.

Personally, I'm letting mine sort itself out in its own time - I think that there are currently still some issues with WU runtime estimates in MB, so there would be no such thing as a "correct" RDCF for all work.

But if you want to give it a helping hand, the file to edit is client_state.xml (stop BOINC first, use extreme care and a text-only editor, make sure you edit the right project's RDCF figure, backups are always a good idea, etc., etc.)
ID: 621881 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19012
Credit: 40,757,560
RAC: 67
United Kingdom
Message 621882 - Posted: 18 Aug 2007, 12:17:11 UTC - in response to Message 621878.  

For those of you technically adventurous...

Sounds like a considered and well-reasoned argument: under the circumstances, I can't imagine anyone from Berkeley arguing against it. Given Matt's figure of 50% spurious -9s, I think the whole 'tape' will have to be put in Matt's recycling 'box' for re-scrutiny at a future date (I wonder how he'll manage that filing system now it's all on hard drives and remote archival storage, LOL).

Anyway, I've performed surgery on my four, but they'll just have to wait their turn in the queue - now I've unsuspended them, there is indeed a queue, which is good to see.

Glad to do my little bit towards tidying up my corner of the Berkeley BOINC database.


Done my Surgery on my three, but one of them had already done nine hours and
was at 0.12%, so my RDCF (Result duration correction factor) is now 3.15,
so now all my MB WU's are reported that they are going to take 35 hours or so,
more than 10 times that they should, i tried lowering it to 0.9,
(in sched_request_setiathome.berkeley.edu.xml), but when i
report it goes back to what it was, any ideas?, or do i have to wait until
it goes down by itself?

Claggy.

Did you exit BOINC during edit of client state?
ID: 621882 · Report as offensive
Claggy
Volunteer tester

Send message
Joined: 5 Jul 99
Posts: 4654
Credit: 47,537,079
RAC: 4
United Kingdom
Message 621886 - Posted: 18 Aug 2007, 12:23:01 UTC - in response to Message 621882.  

For those of you technically adventurous...

Sounds like a considered and well-reasoned argument: under the circumstances, I can't imagine anyone from Berkeley arguing against it. Given Matt's figure of 50% spurious -9s, I think the whole 'tape' will have to be put in Matt's recycling 'box' for re-scrutiny at a future date (I wonder how he'll manage that filing system now it's all on hard drives and remote archival storage, LOL).

Anyway, I've performed surgery on my four, but they'll just have to wait their turn in the queue - now I've unsuspended them, there is indeed a queue, which is good to see.

Glad to do my little bit towards tidying up my corner of the Berkeley BOINC database.


Done my Surgery on my three, but one of them had already done nine hours and
was at 0.12%, so my RDCF (Result duration correction factor) is now 3.15,
so now all my MB WU's are reported that they are going to take 35 hours or so,
more than 10 times that they should, i tried lowering it to 0.9,
(in sched_request_setiathome.berkeley.edu.xml), but when i
report it goes back to what it was, any ideas?, or do i have to wait until
it goes down by itself?

Claggy.

Did you exit BOINC during edit of client state?


Yep, and put a back up on the desktop.

Claggy.

ID: 621886 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : Work Unit problem


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.