Long Work Units

Message boards : Number crunching : Long Work Units
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Kevster

Send message
Joined: 11 Jan 01
Posts: 33
Credit: 1,548,476
RAC: 0
Canada
Message 729873 - Posted: 24 Mar 2008, 16:33:50 UTC

99.9% of the work units on my computer take just over 4 hours. Once in a while I get strange work units that take many times longer. If I stop Boinc Manager and restart, the work unit will often finish, sometimes not. Other times things will not make sense, like after 10 hours a work unit is 10% complete with only 16 hours left. You do the math, it just doesn't work out. I check my CPU status, and my computer isn't working on anything else. I basically use my computer for email, so it's not like it's busy trying to solve the answer to world peace now and then. What it going on?
ID: 729873 · Report as offensive
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 962
Credit: 537,293
RAC: 9
United Kingdom
Message 730063 - Posted: 24 Mar 2008, 23:04:13 UTC - in response to Message 729873.  

99.9% of the work units on my computer take just over 4 hours. Once in a while I get strange work units that take many times longer. If I stop Boinc Manager and restart, the work unit will often finish, sometimes not. Other times things will not make sense, like after 10 hours a work unit is 10% complete with only 16 hours left. You do the math, it just doesn't work out. I check my CPU status, and my computer isn't working on anything else. I basically use my computer for email, so it's not like it's busy trying to solve the answer to world peace now and then. What it going on?


Different WU's have different Angle Ranges.

http://www.boinc-wiki.info/True_Angle_Range
http://www.boinc-wiki.info/SETI%40Home_FAQ:_The_SETI%40Home_Project#Why_is_there_so_much_variability_in_work_unit_completion_time_with_version_3.x.3F
Sir Arthur C Clarke 1917-2008
ID: 730063 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 735952 - Posted: 7 Apr 2008, 22:58:17 UTC

I've gotten a few of these, and have one now. It seems that the last one(s) finished in about 10x the time it was projected to take. And my wingman finished it in a 'standard' time. So I lost about 9 wu's equivalent by letting it finish. This was not an AR issue.

So what is suggested? Do we abort the errant wu's when we spot them, or let them loop almost forever and waste the compute time? Anyone??
ID: 735952 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 735959 - Posted: 7 Apr 2008, 23:04:05 UTC - in response to Message 735952.  

I've gotten a few of these, and have one now. It seems that the last one(s) finished in about 10x the time it was projected to take. And my wingman finished it in a 'standard' time. So I lost about 9 wu's equivalent by letting it finish. This was not an AR issue.

So what is suggested? Do we abort the errant wu's when we spot them, or let them loop almost forever and waste the compute time? Anyone??

Do we get a clue? A WU ID, perhaps, or a Task ID? Even a host ID? Your cache size and batch-processing mode makes hunting for a needle in the proverbial a tad difficult.
ID: 735959 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 735962 - Posted: 7 Apr 2008, 23:11:21 UTC

Kevster's task 796655083 restarted again, and again, and again, from the beginning. Setting 'keep applications in memory when suspended' can help avoid this - though since the RAC on his other two projects is 0, that's probably not the problem here.
ID: 735962 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 735970 - Posted: 7 Apr 2008, 23:17:39 UTC

Could be one of them: http://setiathome.berkeley.edu/result.php?resultid=795910157 . It is long, but maybe not 'too' long.

The one currently running is 29mr07af.22054.23794.15.7.156_1; it is 3h into a projected 6h run that normally takes 1.5-2h.

The egregious one has been cleared by the system. It matched its partner, but had way too many hours. I think it completed on or after 4/4/08.

Sorry, I can't quote more.



ID: 735970 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 735972 - Posted: 7 Apr 2008, 23:22:29 UTC - in response to Message 735970.  

Could be one of them: http://setiathome.berkeley.edu/result.php?resultid=795910157 . It is long, but maybe not 'too' long.

A Q6700 @ 2.66 GHz takes less than half the time of a P4 @ 3.00 GHz? Doesn't seem to be terribly much wrong with that.
ID: 735972 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 735973 - Posted: 7 Apr 2008, 23:27:26 UTC - in response to Message 735972.  

Could be one of them: http://setiathome.berkeley.edu/result.php?resultid=795910157 . It is long, but maybe not 'too' long.

A Q6700 @ 2.66 GHz takes less than half the time of a P4 @ 3.00 GHz? Doesn't seem to be terribly much wrong with that.


Like I said, it could have been one of them. The bad boy which has disappeard from the on-line query was about 8-10x longer than it's wingman. If I catch a fresh one, like the running now, I'll try to post the info here.
ID: 735973 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 735976 - Posted: 7 Apr 2008, 23:31:46 UTC - in response to Message 735973.  

Could be one of them: http://setiathome.berkeley.edu/result.php?resultid=795910157 . It is long, but maybe not 'too' long.

A Q6700 @ 2.66 GHz takes less than half the time of a P4 @ 3.00 GHz? Doesn't seem to be terribly much wrong with that.


Like I said, it could have been one of them. The bad boy which has disappeard from the on-line query was about 8-10x longer than it's wingman. If I catch a fresh one, like the running now, I'll try to post the info here.


Oh, yes, I listed this one because it took a long time and I got half the credit I would have expected. The ones I completed issued to me at the same time completed in 6000 s and received 70 cs's. This one too 12K s and got 60 cs's.
ID: 735976 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 735977 - Posted: 7 Apr 2008, 23:33:18 UTC - in response to Message 735976.  

Could be one of them: http://setiathome.berkeley.edu/result.php?resultid=795910157 . It is long, but maybe not 'too' long.

A Q6700 @ 2.66 GHz takes less than half the time of a P4 @ 3.00 GHz? Doesn't seem to be terribly much wrong with that.


Like I said, it could have been one of them. The bad boy which has disappeard from the on-line query was about 8-10x longer than it's wingman. If I catch a fresh one, like the running now, I'll try to post the info here.


Oh, yes, I listed this one because it took a long time and I got half the credit I would have expected. The ones I completed issued to me at the same time completed in 6000 s and received 70 cs's. This one too 12K s and got 60 cs's.

Like Keith T said, different ARs perform differently. It's not linear.
ID: 735977 · Report as offensive
Profile GeoCochlea
Avatar

Send message
Joined: 4 Feb 05
Posts: 35
Credit: 31,021,410
RAC: 0
Denmark
Message 736162 - Posted: 8 Apr 2008, 12:51:38 UTC

I've had the same problem for about a year on my laptop (and ONLY on my laptop). Sometimes a WU gets stuck in processing, the progress indicator stops, and the '% to completion' goes up instead of down. I suspected Crunch3r's optimized app and changed to Simon's Chicken app, but then it happened again…and again…

I feel that this particular problem has nothing to do with AR, and because it doesn’t happen on all computers, I think it is somehow software related. Recently there was the 'headless' batch (13feb08ac.8515.4162.3.7.xxx) but that was just a batch of dud WUs.
Whether it’s Windows, the optimized app, or something different, I don’t know, but if somebody could think of a cure I would certainly be grateful.

I haven’t noticed if the problem occurs in random WUs or if the problem is associated with certain batches. I only had a few in the past month, but in January/February I had several every week.

Right now I can find only one bad WU in my work list (but I cancelled it when I saw it on the computer): 762186011.
The laptop is a Lenovo/IBM T60 Thinkpad.

/Mark
ID: 736162 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 736177 - Posted: 8 Apr 2008, 13:25:29 UTC

http://setiathome.berkeley.edu/workunit.php?wuid=242589114
this took 55K sec to earn what the wingman earned in 7Ks, using essentially the same cpu. (C2Q) I suspect there is some errant code somewhere, perhaps corners cut in order to obtain performance?


http://setiathome.berkeley.edu/workunit.php?wuid=242590925
This one took 12K sec to earn what I usually earn in 7Ks. I am really beginning to have disdain for this credit granting system.

ID: 736177 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14690
Credit: 200,643,578
RAC: 874
United Kingdom
Message 736181 - Posted: 8 Apr 2008, 13:41:39 UTC - in response to Message 736177.  
Last modified: 8 Apr 2008, 13:56:16 UTC

WU 242589114 this took 55K sec to earn what the wingman earned in 7Ks, using essentially the same cpu. (C2Q) I suspect there is some errant code somewhere, perhaps corners cut in order to obtain performance?

Crunch3r 2.4V 64-bit on Vista SP1 - quite a lot of variables to look at there. But no sign of restarting.
WU 242590925 This one took 12K sec to earn what I usually earn in 7Ks. I am really beginning to have disdain for this credit granting system.

Likewise.

I saw a rant from someone on one of the boards who claimed that Vista SP1 had knocked back his RAC, but no facts: no examples, no WU links, no 'before and after' comparisons. So I ignored it.

But I did put SP1 on my own Vista (32-bit) box on Sunday, and that is one that I monitor and log quite closely. If anything shows up in the way of extended crunch times, I'll notice it and post about it. But so far, no problems at all.

PS You saw what I did with your links?

Edit - just re-vac'd the chart, and I've got one at 11,407 seconds against a usual range of 6K - 8K. But it was one of the rare AR=0.242808 ones, claiming 106 credits, so I don't think we can blame Vista for that (and anyway, it was a few days ago, and it's been purged now). Everything else is within the same scatter as before.
ID: 736181 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 739348 - Posted: 15 Apr 2008, 11:56:08 UTC

Here is another one that ran anomalously. And this one ran to normal completion but failed to validate, which is very very rare for this box.

I think that seti and/or boinc has a hard time playing with other tasks in the Vista 64b SP1 playground. It seems that the workunits have trouble, leading to long runtimes and once in a while an invalid result, when my backup runs or when I install software (like MS updates). Both of these actions seem to lock out other activities on the computer for short periods of times, and boinc seems to choke. The difficulty can be seen in the boinc log. See this thread for further information.
ID: 739348 · Report as offensive
Profile Clyde C. Phillips, III

Send message
Joined: 2 Aug 00
Posts: 1851
Credit: 5,955,047
RAC: 0
United States
Message 739898 - Posted: 16 Apr 2008, 17:38:03 UTC

Recently I did see two 20,000-second workunits, which only gave as much as 10,000-second ones for my PD950s. It looked like at least one restarted late in cycle but not the other. There were only two so I'm not gonna worry. Don't know what kind of error would have caused this. If I see more I will try to find out why. With error-free units crunchtimes are supposed to vary with the credits awarded but that proportion holds up poorly because it's difficult to impossible to accommodate the program to all processors and projects. Fortunately RAC doesn't vary too much because of the variety of workunits over a period of a week or so.
ID: 739898 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 740436 - Posted: 17 Apr 2008, 15:02:11 UTC

And here is another one . My quad finished it in 34K sec and received 31 credits, while the wingman finished it 5K sec.

There is a loose cannon in the code and it is hurting the project. I really wish someone competent would investigate. In this example, I could have completed 6 other wu's in the time wasted.

Isn't anyone (but me) PO'd?
ID: 740436 · Report as offensive
Profile Clyde C. Phillips, III

Send message
Joined: 2 Aug 00
Posts: 1851
Credit: 5,955,047
RAC: 0
United States
Message 740915 - Posted: 18 Apr 2008, 17:14:17 UTC

I saw at least three more long units today, three times longer than other typical units, all on one of my two PD950s. One unit is 28mr07al...59 and another is 27mr08al...219. One was a reset, the other wasn't. The third one was a "shortie" that required the typical time for a longer unit. What's goin' on, anyway? If something's wrong with that machine all of its units should be long. Not so!
ID: 740915 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 741048 - Posted: 18 Apr 2008, 23:07:24 UTC
Last modified: 18 Apr 2008, 23:07:55 UTC

And another one, where the wingman spent 11K sec on a C2D and I spent 34K sec on a C2Q. (Should have matched better if something wasn't wrong, right?) No restarts listed for my machine, by the way:

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
Optimized SETI@Home Enhanced application
Optimizers: Ben Herndon, Josef Segur, Alex Kan, Simon Zadra
Version: Windows SSSE3 64-bit based on S@H V5.15 'Noo? No - Ni!'
Revision: R-2.4V|xT|FFT:IPP_SSSE3|Ben-Joe
CPUID: Intel(R) Core(TM)2 Quad CPU Q6700 @ 2.66GHz
Speed: 4 x 2660 MHz
Cache: L1=64K L2=4096K
Features: MMX SSE SSE2 SSE3 x86_64

Work Unit Info
WU Credit multi. is: 2.85
WU True angle range: 0.436959

Spikes Pulses Triplets Gaussians Flops
0 0 2 1 15754773052931

</stderr_txt>
]]>


I do notice that when this happens, boinc manager recomputes all the expected times and reports some wild expected run-times for the wu's ready to start. These estimates slowly return to 'normal' as more wu's are computed, however.
ID: 741048 · Report as offensive
Profile RandyC
Avatar

Send message
Joined: 20 Oct 99
Posts: 714
Credit: 1,704,345
RAC: 0
United States
Message 741066 - Posted: 18 Apr 2008, 23:46:45 UTC - in response to Message 741048.  
Last modified: 18 Apr 2008, 23:48:08 UTC

And another one, where the wingman spent 11K sec on a C2D and I spent 34K sec on a C2Q. (Should have matched better if something wasn't wrong, right?) No restarts listed for my machine, by the way:

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
Optimized SETI@Home Enhanced application
Optimizers: Ben Herndon, Josef Segur, Alex Kan, Simon Zadra
Version: Windows SSSE3 64-bit based on S@H V5.15 'Noo? No - Ni!'
Revision: R-2.4V|xT|FFT:IPP_SSSE3|Ben-Joe
CPUID: Intel(R) Core(TM)2 Quad CPU Q6700 @ 2.66GHz
Speed: 4 x 2660 MHz
Cache: L1=64K L2=4096K
Features: MMX SSE SSE2 SSE3 x86_64

Work Unit Info
WU Credit multi. is: 2.85
WU True angle range: 0.436959

Spikes Pulses Triplets Gaussians Flops
0 0 2 1 15754773052931

</stderr_txt>
]]>


I do notice that when this happens, boinc manager recomputes all the expected times and reports some wild expected run-times for the wu's ready to start. These estimates slowly return to 'normal' as more wu's are computed, however.


That's your DCF (duration correction factor) being adjusted based on expected vs actual run time.
[edit]correction: estimated vs actual, not expected vs actual[/edit]
ID: 741066 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 741358 - Posted: 19 Apr 2008, 12:42:13 UTC

Yes, I agree it is the DCF. But the wild estimates (4-8x too large) are due to these errant long work units I get. Probably the dcf is being computed correctly.
ID: 741358 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Long Work Units


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.