Shorties estimate up from three minutes to six hours after today's outage!

Message boards : Number crunching : Shorties estimate up from three minutes to six hours after today's outage!
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 9 · Next

AuthorMessage
Oddbjornik Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 15 May 99
Posts: 220
Credit: 349,610,548
RAC: 1,728
Norway
Message 1151791 - Posted: 13 Sep 2011, 19:22:28 UTC

I just got a batch of shorties, i.e. work units with a two week time limit. Such units normally finish in three minutes on cuda.

Only the ones I got just after today's outage are estimated to take 05:59:09.

Naturally Boinc has abruptly stopped requesting new work. Perhaps this is a trick to temporarily lower the load on the download pipe? I fear the units will run in three minutes as before, and that only the estimate is wrong.
ID: 1151791 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1151797 - Posted: 13 Sep 2011, 19:34:37 UTC - in response to Message 1151791.  

I just got a batch of shorties, i.e. work units with a two week time limit. Such units normally finish in three minutes on cuda.

Only the ones I got just after today's outage are estimated to take 05:59:09.

Naturally Boinc has abruptly stopped requesting new work. Perhaps this is a trick to temporarily lower the load on the download pipe? I fear the units will run in three minutes as before, and that only the estimate is wrong.

That probably indicates that the server software has been rebuilt with changeset [trac]changeset:24128[/trac]. If so, each 3 minute run will reduce the estimates by 1% as Duration Correction Factor (DCF) gradually reduces until the estimates get below 30 minutes, then it will adapt somewhat faster. CPU work will also be overestimated until DCF gets into a lower range, after that CPU work will tend to fight against the adaptation for GPU work.

The purpose of the change is admirable, it's implementation unrealistic.
                                                                  Joe
ID: 1151797 · Report as offensive
Profile Khangollo
Avatar

Send message
Joined: 1 Aug 00
Posts: 245
Credit: 36,410,524
RAC: 0
Slovenia
Message 1151819 - Posted: 13 Sep 2011, 20:21:22 UTC - in response to Message 1151797.  

I just started receiving WUs with massively overestimated times, too.
Won't that wreak havoc with boinc client's DCF? What will happen when these units cause local DCF to drop too much? As far as I understand, the client will keep requesting more and more work, much more than cache preferences allow until DCF stabilizes around 1 again. For people with large caches, this might be a nightmare. Or am I wrong?
ID: 1151819 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1151847 - Posted: 13 Sep 2011, 20:58:48 UTC - in response to Message 1151819.  

I just started receiving WUs with massively overestimated times, too.
Won't that wreak havoc with boinc client's DCF? What will happen when these units cause local DCF to drop too much? As far as I understand, the client will keep requesting more and more work, much more than cache preferences allow until DCF stabilizes around 1 again. For people with large caches, this might be a nightmare. Or am I wrong?

Luckily you're wrong. The gross overestimation means BOINC thinks you already have a lot of work. DCF will eventually come down as work completes and the cache gets near empty. That will allow at least minimal work fetch requests, but it shouldn't get so low that overfetches by a large amount are likely.

If DCF were able to stabilize at a "correct" level, work fetch and estimates would be normal. Unfortunately, most users of optimized apps are using two or more app versions and the local DCF cannot track more than one correctly. So there is the possibility of overfetching CPU work because DCF has been pulled lower by GPU work. Then when a CPU task finishes DCF jumps up because it took considerably longer than BOINC was expecting. From there it gradually drops again as GPU tasks finish.

We're basically back in the situation as before the server-side scaling based on what's displayed as "Average processing rate" was deployed. Those running anonymous platform have to cope as well as possible until something better comes along.
                                                                  Joe
ID: 1151847 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1151849 - Posted: 13 Sep 2011, 20:59:04 UTC - in response to Message 1151819.  

I just got four 6.10 cudas. Three at 6 hours estimated and the fourth at almost 17 hours. I'm running a couple of APs on my GPU right now so I won't worry about them for awhile.


PROUD MEMBER OF Team Starfire World BOINC
ID: 1151849 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 1151853 - Posted: 13 Sep 2011, 21:08:07 UTC - in response to Message 1151849.  

I just got four 6.10 cudas. Three at 6 hours estimated and the fourth at almost 17 hours. I'm running a couple of APs on my GPU right now so I won't worry about them for awhile.


They are showing 23 hours for the GTX560.

ID: 1151853 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1151873 - Posted: 13 Sep 2011, 22:03:10 UTC

I don't understand all this new stuff..

So it's time for anonymous members to use <flops> entries in app_info.xml file again?


- Best regards! - Sutaru Tsureku, team seti.international founder. - Optimize your PC for higher RAC. - SETI@home needs your help. -
ID: 1151873 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1151890 - Posted: 13 Sep 2011, 22:41:03 UTC - in response to Message 1151873.  

I don't understand all this new stuff..

So it's time for anonymous members to use <flops> entries in app_info.xml file again?

Best regards! Sutaru Tsureku

Those having a mix of work where some was sent before the change and still has reasonable estimates, should not take any such panic action. Setting <flops> affects all project work on the host. Considering how many got into problems trying to use <flops> before, only those who understand how BOINC uses them ought to consider making hasty changes.

However, it is true that the change assumes that those running Anonymous platform will have reasonably accurate <flops> in their app_info.xml. If that were true, nobody would be seeing any problem because of the change.

Setting <flops>, and adjusting them when trying different operating conditions (such as changes in how many tasks GPUs should be doing at the same time), doesn't require higher math skills. Simple arithmetic is adequate, but needs care.
                                                                  Joe
ID: 1151890 · Report as offensive
Profile arkayn
Volunteer tester
Avatar

Send message
Joined: 14 May 99
Posts: 4438
Credit: 55,006,323
RAC: 0
United States
Message 1151895 - Posted: 13 Sep 2011, 23:00:46 UTC - in response to Message 1151853.  

I just got four 6.10 cudas. Three at 6 hours estimated and the fourth at almost 17 hours. I'm running a couple of APs on my GPU right now so I won't worry about them for awhile.


They are showing 23 hours for the GTX560.


And took 3 minutes.

ID: 1151895 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 1151897 - Posted: 13 Sep 2011, 23:06:19 UTC

How does one determin what the correct <flops> entry is ??
Dave

ID: 1151897 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 1151902 - Posted: 13 Sep 2011, 23:23:53 UTC

I removed the <flops> entry in my app_info file when I found out that the SETI servers are calculating a flops value for each science app and uses that value when sending work. Removed the <flops> value 3 or 4 months ago.

In my view a lot of problems are caused by the various rescheduler programs moving work from cpu to gpu or from gpu to cpu. The Berkeley servers expect the work to be done by the app it sent it out to, not a different app. So I stopped rescheduling work also.

Right now my system is getting work with proper time estimates. I have not however received any work since todays outage.

My opinion is still to leave the <flops> out of the app_info file for now and wait and see what happens.
Boinc....Boinc....Boinc....Boinc....
ID: 1151902 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6533
Credit: 196,805,888
RAC: 57
United States
Message 1151927 - Posted: 14 Sep 2011, 1:02:00 UTC - in response to Message 1151797.  

I just got a batch of shorties, i.e. work units with a two week time limit. Such units normally finish in three minutes on cuda.

Only the ones I got just after today's outage are estimated to take 05:59:09.

Naturally Boinc has abruptly stopped requesting new work. Perhaps this is a trick to temporarily lower the load on the download pipe? I fear the units will run in three minutes as before, and that only the estimate is wrong.

That probably indicates that the server software has been rebuilt with changeset [trac]changeset:24128[/trac]. If so, each 3 minute run will reduce the estimates by 1% as Duration Correction Factor (DCF) gradually reduces until the estimates get below 30 minutes, then it will adapt somewhat faster. CPU work will also be overestimated until DCF gets into a lower range, after that CPU work will tend to fight against the adaptation for GPU work.

The purpose of the change is admirable, it's implementation unrealistic.
                                                                  Joe

I guess a side effect of this will help stabilize the huge payouts in the new credit system from machines that completed a lot of shorties and then did some large complex jobs right after.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1151927 · Report as offensive
Profile Gatekeeper
Avatar

Send message
Joined: 14 Jul 04
Posts: 887
Credit: 176,479,616
RAC: 0
United States
Message 1151935 - Posted: 14 Sep 2011, 1:29:38 UTC
Last modified: 14 Sep 2011, 1:30:47 UTC

My VHAR's are at 6:30 and change. My 603's are from 14 to 22 hours. Haven't got any AP's since the update, but can only imagine what they'll look like. Sheesh.
ID: 1151935 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1151946 - Posted: 14 Sep 2011, 2:14:16 UTC - in response to Message 1151897.  

Dave Stegner wrote:
How does one determin what the correct <flops> entry is ??

For your domain controllers there isn't any, unfortunately. BOINC 5.10.45 doesn't report app_version flops so the servers will just use the host Whetstone benchmark, and there's no practical way to make that reflect how fast a CPU is doing S@h work. Because those hosts don't have usable GPUs either, local DCF will prove adequate to eventually get estimates reasonably close.

Once most of the work on such a host is the overestimated kind, you could edit the <duration_correction_factor> field for this project in its client_state.xml to whatever small fraction gives reasonable estimates for that new work. That's to speed the adaptation, but if you do it too soon the older work which will then have tiny estimates could cause BOINC to fetch more work than you really want; worst case more work than could be done by deadline.

Anything you do to try to correct the situation could turn out unwise if Dr. Anderson recognizes the problem and does some corrective modification. Sitting tight and hoping for the best might indeed be the best policy, as Geek@Play advises.
                                                                  Joe
ID: 1151946 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1151950 - Posted: 14 Sep 2011, 2:18:29 UTC - in response to Message 1151927.  

I guess a side effect of this will help stabilize the huge payouts in the new credit system from machines that completed a lot of shorties and then did some large complex jobs right after.

No, the only change is that estimated runtimes are too big as seen on the client. It doesn't affect the server-side averages at all nor actual runtimes so has no effect on the credit calculations.
                                                                   Joe
ID: 1151950 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 17617
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1151989 - Posted: 14 Sep 2011, 3:57:45 UTC - in response to Message 1151797.  

Do wish the BOINC devs would do a proper fix and not bailing wire and chewing gum, think I might go and look at the southern sky.
ID: 1151989 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 17617
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1152006 - Posted: 14 Sep 2011, 4:59:52 UTC
Last modified: 14 Sep 2011, 5:03:12 UTC

If you have selected to do AP tasks as well as MB, I would suggest you de-select AP 505 until the computer has sorted out the estimates and DCF. If not then you will face months of DCF variations because each time a AP task completes it will punch the DCF up into the stratosphere and the computer will stop requesting work.

edit] for devs, remember KISS - use server side estimates OR DCF not Sever side estimates AND DCF
ID: 1152006 · Report as offensive
Dave Stegner
Volunteer tester
Avatar

Send message
Joined: 20 Oct 04
Posts: 540
Credit: 65,583,328
RAC: 27
United States
Message 1152033 - Posted: 14 Sep 2011, 7:16:59 UTC
Last modified: 14 Sep 2011, 7:39:33 UTC

I have 19 machines crunching Seti. Some AP only, some MB only, some mixed, and for good measure I threw 3 low level GPU cards in the mix a month ago. I have been running opti apps for over a year and have never had an issue with incorrect time estimates. I did have a small issue when I added the GPU cards but, it fixed itself in about 2 weeks. After 2 weeks, they were correctly estimating times even though the machines were doing MB GPU and MB and AP CPU.

ALL TIME ESTIMATES FOR ALL TYPE OF WORK (DEDICATED OR MIXED MODE MACHINES) WERE ACCURATE.

If I understand Joe's analysis of what to expect in the future, ping pong / herky jerky time estimates, over and under subscribing, unless you run AP only or MB only and CPU only or GPU only; I do NOT understand why this change was made.

For my machines, it creates a problem where none existed before.

Does anyone know how big the problem, as described in the changeset, was ??
Dave

ID: 1152033 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 17617
Credit: 40,757,560
RAC: 67
United Kingdom
Message 1152057 - Posted: 14 Sep 2011, 9:46:49 UTC - in response to Message 1152033.  

I have 19 machines crunching Seti. Some AP only, some MB only, some mixed, and for good measure I threw 3 low level GPU cards in the mix a month ago. I have been running opti apps for over a year and have never had an issue with incorrect time estimates. I did have a small issue when I added the GPU cards but, it fixed itself in about 2 weeks. After 2 weeks, they were correctly estimating times even though the machines were doing MB GPU and MB and AP CPU.

ALL TIME ESTIMATES FOR ALL TYPE OF WORK (DEDICATED OR MIXED MODE MACHINES) WERE ACCURATE.

If I understand Joe's analysis of what to expect in the future, ping pong / herky jerky time estimates, over and under subscribing, unless you run AP only or MB only and CPU only or GPU only; I do NOT understand why this change was made.

For my machines, it creates a problem where none existed before.

Does anyone know how big the problem, as described in the changeset, was ??

I still have a problem wit AP task estimates on the quad I rebuilt (mobo failure) which I updated to Win 7 and installed some new components, that was first connected on the 1st Aug.
In the first ten AP tasks there were at least three that had "too much blanking" and completed after 30 sec. This made the initial APR about twice what it should be. It is still about 50% out as the present day estimates for the tasks in progress are 8h:45m (duration correction factor 1.055596) when the true estimate should be at 12h:30m minimum.

This present patch is another band aid and chewing gum attempt, when the real patch should be detecting early completion, -9 and too much blanking etc, and removing those times from the APR calculation.
ID: 1152057 · Report as offensive
Profile skildude
Avatar

Send message
Joined: 4 Oct 00
Posts: 9541
Credit: 50,759,529
RAC: 60
Yemen
Message 1152088 - Posted: 14 Sep 2011, 12:47:19 UTC - in response to Message 1152057.  

I got a few WU's from 25jn11ab that estimated at 133 hours on my CPU. Final times were around 3 hours. That is an unusually long time for even a VLAR on my PC.
These WU's weren't marked as VLAR's either.

I have to wonder if the estimates are specific to the AR. So if BOINC sees an AR that it hasn't processed yet, It gives wildly exaggerated estimates


In a rich man's house there is no place to spit but his face.
Diogenes Of Sinope
ID: 1152088 · Report as offensive
1 · 2 · 3 · 4 . . . 9 · Next

Message boards : Number crunching : Shorties estimate up from three minutes to six hours after today's outage!


 
©2022 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.