Message boards :
Number crunching :
Work Unit problem
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next
Author | Message |
---|---|
gomeyer Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0 |
And mine 04mr07ab.7106.5389 05mr07ab.6072.369046 (3 of these) 04mr07ab.7106.5798 05mr07aa.12210.24612 04mr07ab.7106.6207 05mr07aa.3769.20522 (5 of these) 04mr07ab.7106.6616 Gus |
Jesse Viviano Send message Joined: 27 Feb 00 Posts: 100 Credit: 3,949,583 RAC: 0 |
Seems that a few of them do not overflow with a -9 overflow message, but exit with a compute error that states that the maximum CPU time limit was exceeded and returned a -187 (0xffffff4f) error. Look at mine, 04mr07ab.10282.4980, here, which is the result to a work unit I posted above. If you run across it, just abort it until there are so many errors that the work unit is tossed as trash. |
Bob Nadler Send message Joined: 3 Sep 99 Posts: 7 Credit: 726,368 RAC: 0 |
Seems that a few of them do not overflow with a -9 overflow message, but exit with a compute error that states that the maximum CPU time limit was exceeded and returned a -187 (0xffffff4f) error. Look at mine, 04mr07ab.10282.4980, here, which is the result to a work unit I posted above. If you run across it, just abort it until there are so many errors that the work unit is tossed as trash. I would agree.. I am running 04mr07ab.14840.4980.6.4.243 on a Linux system w/ 2Ghz Xeon cpus, BOINC v5.8.16 and SETI v5.27. It has 13 hours of CPU time and is .035% done :-\\ I suspended and restarted it so now it is .030% done and has 14 hours to completion. http://setiathome.berkeley.edu/workunit.php?wuid=147539328 What is the maximum amount of CPU time the project allows on a workunit? Will this time out or just run past the report deadline (which is 25 Aug) ? I would rather let this go if it would be the last time anyone gets this WU. Thanks! Bob |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Seems that a few of them do not overflow with a -9 overflow message, but exit with a compute error that states that the maximum CPU time limit was exceeded and returned a -187 (0xffffff4f) error. Look at mine, 04mr07ab.10282.4980, here, which is the result to a work unit I posted above. If you run across it, just abort it until there are so many errors that the work unit is tossed as trash. I thought for a moment I'd got your hand-me-downs! But it was a different re-issue - from result 590582939 - with the same symptoms as you describe. I've suspended mine at 3 hours 20 minutes (0.089%) in case anyone wants details. (Joe - <triplet_thresh>-0.764051318</triplet_thresh> - ???). I'm happy to keep it out of circulation for the weekend (I've got other long-running projects on the box), and we can ask the lab on Monday what the chances are of a scripted bulk cancellation. [Anyone know whether Eric is due back next week, or has he still got two, three, ..., weeks of vacation to go?] |
W-K 666 Send message Joined: 18 May 99 Posts: 19406 Credit: 40,757,560 RAC: 67 |
Seems that a few of them do not overflow with a -9 overflow message, but exit with a compute error that states that the maximum CPU time limit was exceeded and returned a -187 (0xffffff4f) error. Look at mine, 04mr07ab.10282.4980, here, which is the result to a work unit I posted above. If you run across it, just abort it until there are so many errors that the work unit is tossed as trash. It would probably be better to get a bulk cancellation asap, because it must be part of the problem connecting etc. One guesses that there could have been up to 500,000 original bad units and if they are all re-issued until there is max errors then that would be up to 750,000 extra copies. I've already had one re-issue that has gone to 5 copies. Andy |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
[Anyone know whether Eric is due back next week, or has he still got two, three, ..., weeks of vacation to go?] Pappa has posted in Beta that Eric is on vacation (holiday) for another week .... not much else will get done for at least another week and 2 weekends. so it would have to be Matt or Jeff. I'm not sure what their thinking would be on that - I suppose it depends whether disk/database space, or download throughput, is seen as more important at the moment: at least the 'sticky' WUs don't have a very fast turnover! I guess it'll have to be their call in the end. |
Pablo_ZPM Send message Joined: 13 Jul 01 Posts: 3 Credit: 367,720 RAC: 0 |
Hi all, I used to have stuck units from time to time by now I'm experiencing the opposite: see post http://setiathome.berkeley.edu/forum_thread.php?id=41585&nowrap=true#621028 Any ideas? :o) Seems to me I'm generating a lot of traffic / requests for new work and, hey! I do get new units from time to time. right now my box is finishing another one it took 2.5 hours to process (usually it takes 11 - 18 hours to churn out a result) so it is hollering for more work. I've had several units it took 90 seconds to process and back for more work, sic. :o( Pablo_ZPM |
Jim-R. Send message Joined: 7 Feb 06 Posts: 1494 Credit: 194,148 RAC: 0 |
Hi all, From checking about a dozen of your last results it seems you have run into quite a large number of "high" angle range work units, ar= 1.49xxx. These will run quicker than the more "normal" 0.42xxx ar's. Also there have been quite a few that have been what we call "-9 overflow" or "noisy" work units. The time it takes to crunch these work units just depends on how "noisy" they are. If they reach the maximum number of results quickly (extremely "noisy") they will end quickly. If they are not too noisy they might take a bit longer to error out. So your computer is still doing good work and there's nothing to worry about. Once we get out of the high angle ranges and start issuing some more "normal" ranges things will settle down. Jim Some people plan their life out and look back at the wealth they've had. Others live life day by day and look back at the wealth of experiences and enjoyment they've had. |
Pablo_ZPM Send message Joined: 13 Jul 01 Posts: 3 Credit: 367,720 RAC: 0 |
Hi all, Thanks, I now finally got some units which seem to be perfectly "ordinary" - rate of progress vs. processing time and that with your answer quieted my worries about creating excessive demand on s@h servers. Everything back to normal, if searching for "little green men" fits that description... :o) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Well, It seems all my 04mar07ab units (about 4 dozen of them) crunched through and all '-9 overflowed' in from 6 to 30 CPU seconds [none stuck]. All gone now and I seem to be crunching much healthier workunits :D "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
... For your hosts with Whetstone MIPs in the 1720 range, BOINC will kill these high angle range WUs after about 1.5 days. The exact amount can be found from values in the client_state.xml file. Find the <rsc_fpops_bound> value for the WU and divide it by the host <p_fpops> value to get the CPU time limit in seconds. Unfortunately, that would be reported as an error and the servers would reissue the WU to another host. The splitter problem which causes triplet_thresh to go negative also causes pulse_thresh to be lower than normal, so probably most of these glacially slow WUs will overflow on pulses if run long enough. But if the data is actually quiet enough that may not happen. Joe |
Bob Nadler Send message Joined: 3 Sep 99 Posts: 7 Credit: 726,368 RAC: 0 |
... Thanks Joe! I calculate that out for this workunit on my system to be 36.35 CPU hours.. Unless someone knows otherwise I guess I can let that run to see if it reaches that threshold or finishes.. Maybe this is the WU with a real ET signal ;-) Bob |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
For those of you technically adventurous, here's another option to handle the WUs with negative triplet threshold. The negative threshold means no triplets can possibly be found, a high enough positive threshold has the same effect. But the very high positive threshold doesn't make crunching slow, it actually makes it slightly faster than a normal threshold because triplet finding has less work to do. So the workaround is: 1. Ensure the WU is not in use by shutting down BOINC. (IF your preferences are to have suspended work removed from memory, then Suspending the work would be enough.) 2. Open the WU in an editor. NOT a word processor or anything else which may change more than you intend. 3. Find the <triplet_thresh>-x.xxxxxx</triplet_thresh> line. 4. Change it to <triplet_thresh>99</triplet_thresh> . 5. Save the WU file. 6. Restart BOINC or Resume the WU. That WU may not start running. You could force it by Suspending others, or simply let BOINC get to it whenever. When it does run, it is likely to overflow on Pulses. The result should match that from someone who has allowed the WU to creep to that naturally. Because of the high threshold your credit claim will be lower, but probably by less than 1 cobblestone. I did this to all 4 I had, wuid 148063182 has validated against a full run, wuid 148063170, wuid 148063178, and wuid 148140560 don't have other completed work yet. I am not urging anyone else to try this, and can't think of another situation in which I'd consider modifying a WU. Normally any change to a WU would lead to an invalid result, this is a very unusual exception. Even so, I considered long and thoroughly before posting this, and will not be at all unhappy if someone from Berkeley decides to hide this post. Joe |
gomeyer Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0 |
For those of you technically adventurous . . .Joe Shweet! Works as advertised. Thanks for sticking with this Joe, and for coming up with a viable workaround. I was up to 19 of these and my BoincView display was getting a little messy. Regards, Gus |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
For those of you technically adventurous... Sounds like a considered and well-reasoned argument: under the circumstances, I can't imagine anyone from Berkeley arguing against it. Given Matt's figure of 50% spurious -9s, I think the whole 'tape' will have to be put in Matt's recycling 'box' for re-scrutiny at a future date (I wonder how he'll manage that filing system now it's all on hard drives and remote archival storage, LOL). Anyway, I've performed surgery on my four, but they'll just have to wait their turn in the queue - now I've unsuspended them, there is indeed a queue, which is good to see. Glad to do my little bit towards tidying up my corner of the Berkeley BOINC database. |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
For those of you technically adventurous... Done my Surgery on my three, but one of them had already done nine hours and was at 0.12%, so my RDCF (Result duration correction factor) is now 3.15, so now all my MB WU's are reported that they are going to take 35 hours or so, more than 10 times that they should, i tried lowering it to 0.9, (in sched_request_setiathome.berkeley.edu.xml), but when i report it goes back to what it was, any ideas?, or do i have to wait until it goes down by itself? Claggy. |
Astro Send message Joined: 16 Apr 02 Posts: 8026 Credit: 600,015 RAC: 0 |
Well, you could edit it in the project section of the "client_state.xml" file instead. You'll get better results. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Done my Surgery on my three, but one of them had already done nine hours and was at 0.12%, so my RDCF (Result duration correction factor) is now 3.15, Personally, I'm letting mine sort itself out in its own time - I think that there are currently still some issues with WU runtime estimates in MB, so there would be no such thing as a "correct" RDCF for all work. But if you want to give it a helping hand, the file to edit is client_state.xml (stop BOINC first, use extreme care and a text-only editor, make sure you edit the right project's RDCF figure, backups are always a good idea, etc., etc.) |
W-K 666 Send message Joined: 18 May 99 Posts: 19406 Credit: 40,757,560 RAC: 67 |
For those of you technically adventurous... Did you exit BOINC during edit of client state? |
Claggy Send message Joined: 5 Jul 99 Posts: 4654 Credit: 47,537,079 RAC: 4 |
For those of you technically adventurous... Yep, and put a back up on the desktop. Claggy. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.