Message boards :
Number crunching :
Shorties estimate up from three minutes to six hours after today's outage!
Message board moderation
Author | Message |
---|---|
Oddbjornik Send message Joined: 15 May 99 Posts: 220 Credit: 349,610,548 RAC: 1,728 |
I just got a batch of shorties, i.e. work units with a two week time limit. Such units normally finish in three minutes on cuda. Only the ones I got just after today's outage are estimated to take 05:59:09. Naturally Boinc has abruptly stopped requesting new work. Perhaps this is a trick to temporarily lower the load on the download pipe? I fear the units will run in three minutes as before, and that only the estimate is wrong. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
I just got a batch of shorties, i.e. work units with a two week time limit. Such units normally finish in three minutes on cuda. That probably indicates that the server software has been rebuilt with changeset [trac]changeset:24128[/trac]. If so, each 3 minute run will reduce the estimates by 1% as Duration Correction Factor (DCF) gradually reduces until the estimates get below 30 minutes, then it will adapt somewhat faster. CPU work will also be overestimated until DCF gets into a lower range, after that CPU work will tend to fight against the adaptation for GPU work. The purpose of the change is admirable, it's implementation unrealistic. Joe |
Khangollo Send message Joined: 1 Aug 00 Posts: 245 Credit: 36,410,524 RAC: 0 |
I just started receiving WUs with massively overestimated times, too. Won't that wreak havoc with boinc client's DCF? What will happen when these units cause local DCF to drop too much? As far as I understand, the client will keep requesting more and more work, much more than cache preferences allow until DCF stabilizes around 1 again. For people with large caches, this might be a nightmare. Or am I wrong? |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
I just started receiving WUs with massively overestimated times, too. Luckily you're wrong. The gross overestimation means BOINC thinks you already have a lot of work. DCF will eventually come down as work completes and the cache gets near empty. That will allow at least minimal work fetch requests, but it shouldn't get so low that overfetches by a large amount are likely. If DCF were able to stabilize at a "correct" level, work fetch and estimates would be normal. Unfortunately, most users of optimized apps are using two or more app versions and the local DCF cannot track more than one correctly. So there is the possibility of overfetching CPU work because DCF has been pulled lower by GPU work. Then when a CPU task finishes DCF jumps up because it took considerably longer than BOINC was expecting. From there it gradually drops again as GPU tasks finish. We're basically back in the situation as before the server-side scaling based on what's displayed as "Average processing rate" was deployed. Those running anonymous platform have to cope as well as possible until something better comes along. Joe |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0 |
I just got four 6.10 cudas. Three at 6 hours estimated and the fourth at almost 17 hours. I'm running a couple of APs on my GPU right now so I won't worry about them for awhile. PROUD MEMBER OF Team Starfire World BOINC |
arkayn Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0 |
|
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
I don't understand all this new stuff.. So it's time for anonymous members to use <flops> entries in app_info.xml file again? - Best regards! - Sutaru Tsureku, team seti.international founder. - Optimize your PC for higher RAC. - SETI@home needs your help. - |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
I don't understand all this new stuff.. Those having a mix of work where some was sent before the change and still has reasonable estimates, should not take any such panic action. Setting <flops> affects all project work on the host. Considering how many got into problems trying to use <flops> before, only those who understand how BOINC uses them ought to consider making hasty changes. However, it is true that the change assumes that those running Anonymous platform will have reasonably accurate <flops> in their app_info.xml. If that were true, nobody would be seeing any problem because of the change. Setting <flops>, and adjusting them when trying different operating conditions (such as changes in how many tasks GPUs should be doing at the same time), doesn't require higher math skills. Simple arithmetic is adequate, but needs care. Joe |
arkayn Send message Joined: 14 May 99 Posts: 4438 Credit: 55,006,323 RAC: 0 |
|
Dave Stegner Send message Joined: 20 Oct 04 Posts: 540 Credit: 65,583,328 RAC: 27 |
How does one determin what the correct <flops> entry is ?? Dave |
Geek@Play Send message Joined: 31 Jul 01 Posts: 2467 Credit: 86,146,931 RAC: 0 |
I removed the <flops> entry in my app_info file when I found out that the SETI servers are calculating a flops value for each science app and uses that value when sending work. Removed the <flops> value 3 or 4 months ago. In my view a lot of problems are caused by the various rescheduler programs moving work from cpu to gpu or from gpu to cpu. The Berkeley servers expect the work to be done by the app it sent it out to, not a different app. So I stopped rescheduling work also. Right now my system is getting work with proper time estimates. I have not however received any work since todays outage. My opinion is still to leave the <flops> out of the app_info file for now and wait and see what happens. Boinc....Boinc....Boinc....Boinc.... |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
I just got a batch of shorties, i.e. work units with a two week time limit. Such units normally finish in three minutes on cuda. I guess a side effect of this will help stabilize the huge payouts in the new credit system from machines that completed a lot of shorties and then did some large complex jobs right after. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Gatekeeper Send message Joined: 14 Jul 04 Posts: 887 Credit: 176,479,616 RAC: 0 |
My VHAR's are at 6:30 and change. My 603's are from 14 to 22 hours. Haven't got any AP's since the update, but can only imagine what they'll look like. Sheesh. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Dave Stegner wrote: How does one determin what the correct <flops> entry is ?? For your domain controllers there isn't any, unfortunately. BOINC 5.10.45 doesn't report app_version flops so the servers will just use the host Whetstone benchmark, and there's no practical way to make that reflect how fast a CPU is doing S@h work. Because those hosts don't have usable GPUs either, local DCF will prove adequate to eventually get estimates reasonably close. Once most of the work on such a host is the overestimated kind, you could edit the <duration_correction_factor> field for this project in its client_state.xml to whatever small fraction gives reasonable estimates for that new work. That's to speed the adaptation, but if you do it too soon the older work which will then have tiny estimates could cause BOINC to fetch more work than you really want; worst case more work than could be done by deadline. Anything you do to try to correct the situation could turn out unwise if Dr. Anderson recognizes the problem and does some corrective modification. Sitting tight and hoping for the best might indeed be the best policy, as Geek@Play advises. Joe |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
I guess a side effect of this will help stabilize the huge payouts in the new credit system from machines that completed a lot of shorties and then did some large complex jobs right after. No, the only change is that estimated runtimes are too big as seen on the client. It doesn't affect the server-side averages at all nor actual runtimes so has no effect on the credit calculations. Joe |
W-K 666 Send message Joined: 18 May 99 Posts: 19048 Credit: 40,757,560 RAC: 67 |
Do wish the BOINC devs would do a proper fix and not bailing wire and chewing gum, think I might go and look at the southern sky. |
W-K 666 Send message Joined: 18 May 99 Posts: 19048 Credit: 40,757,560 RAC: 67 |
If you have selected to do AP tasks as well as MB, I would suggest you de-select AP 505 until the computer has sorted out the estimates and DCF. If not then you will face months of DCF variations because each time a AP task completes it will punch the DCF up into the stratosphere and the computer will stop requesting work. edit] for devs, remember KISS - use server side estimates OR DCF not Sever side estimates AND DCF |
Dave Stegner Send message Joined: 20 Oct 04 Posts: 540 Credit: 65,583,328 RAC: 27 |
I have 19 machines crunching Seti. Some AP only, some MB only, some mixed, and for good measure I threw 3 low level GPU cards in the mix a month ago. I have been running opti apps for over a year and have never had an issue with incorrect time estimates. I did have a small issue when I added the GPU cards but, it fixed itself in about 2 weeks. After 2 weeks, they were correctly estimating times even though the machines were doing MB GPU and MB and AP CPU. ALL TIME ESTIMATES FOR ALL TYPE OF WORK (DEDICATED OR MIXED MODE MACHINES) WERE ACCURATE. If I understand Joe's analysis of what to expect in the future, ping pong / herky jerky time estimates, over and under subscribing, unless you run AP only or MB only and CPU only or GPU only; I do NOT understand why this change was made. For my machines, it creates a problem where none existed before. Does anyone know how big the problem, as described in the changeset, was ?? Dave |
W-K 666 Send message Joined: 18 May 99 Posts: 19048 Credit: 40,757,560 RAC: 67 |
I have 19 machines crunching Seti. Some AP only, some MB only, some mixed, and for good measure I threw 3 low level GPU cards in the mix a month ago. I have been running opti apps for over a year and have never had an issue with incorrect time estimates. I did have a small issue when I added the GPU cards but, it fixed itself in about 2 weeks. After 2 weeks, they were correctly estimating times even though the machines were doing MB GPU and MB and AP CPU. I still have a problem wit AP task estimates on the quad I rebuilt (mobo failure) which I updated to Win 7 and installed some new components, that was first connected on the 1st Aug. In the first ten AP tasks there were at least three that had "too much blanking" and completed after 30 sec. This made the initial APR about twice what it should be. It is still about 50% out as the present day estimates for the tasks in progress are 8h:45m (duration correction factor 1.055596) when the true estimate should be at 12h:30m minimum. This present patch is another band aid and chewing gum attempt, when the real patch should be detecting early completion, -9 and too much blanking etc, and removing those times from the APR calculation. |
skildude Send message Joined: 4 Oct 00 Posts: 9541 Credit: 50,759,529 RAC: 60 |
I got a few WU's from 25jn11ab that estimated at 133 hours on my CPU. Final times were around 3 hours. That is an unusually long time for even a VLAR on my PC. These WU's weren't marked as VLAR's either. I have to wonder if the estimates are specific to the AR. So if BOINC sees an AR that it hasn't processed yet, It gives wildly exaggerated estimates In a rich man's house there is no place to spit but his face. Diogenes Of Sinope |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.