Server side DCF for anon platforms

Message boards : Number crunching : Server side DCF for anon platforms
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Gatekeeper
Avatar

Send message
Joined: 14 Jul 04
Posts: 887
Credit: 176,479,616
RAC: 0
United States
Message 1019862 - Posted: 26 Jul 2010, 23:15:25 UTC

My MB WU's seem to be pretty close to expected, but the AP WU's are off by a factor of 2, i.e., twice what "should" be the run time.
ID: 1019862 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1019864 - Posted: 26 Jul 2010, 23:22:39 UTC - in response to Message 1019863.  

(...)

People better watch out for a -700 error shi* storm.


You mean the -177 errors?

;-)

ID: 1019864 · Report as offensive
Profile Tim Norton
Volunteer tester
Avatar

Send message
Joined: 2 Jun 99
Posts: 835
Credit: 33,540,164
RAC: 0
United Kingdom
Message 1019875 - Posted: 26 Jul 2010, 23:48:08 UTC

Strange

the last set of downloads have come in more in line with what i would expect

most of today they have been fluctuating wildly from 40 sec to 7 hours - should be more like 2:30 hours

and now i am seeing a difference between gpu and cpu which is also realistic

so mileage must vary

will see what they are like in the morning before the cut off :)
Tim

ID: 1019875 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1019883 - Posted: 27 Jul 2010, 0:00:23 UTC

My flops count in my app_info seems to be keeping mine steady. I haven't noticed anything wild going on yet.


PROUD MEMBER OF Team Starfire World BOINC
ID: 1019883 · Report as offensive
JohnDK Crowdfunding Project Donor*Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 28 May 00
Posts: 1222
Credit: 451,243,443
RAC: 1,127
Denmark
Message 1019907 - Posted: 27 Jul 2010, 1:00:45 UTC

I have DCF and flops adjusted. The latests WUs downloaded seems to have around 50% longer estimated time.
ID: 1019907 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1019916 - Posted: 27 Jul 2010, 1:41:44 UTC
Last modified: 27 Jul 2010, 1:50:48 UTC

My latest CPU MB WUs are way underestimated - like as 17 minutes and 6 minutes; this will cause catastrophic failures (-177, here we come) going forward.

I have to agree that DA seems to have fumbled the ball again. And it's a pity, as things seemed to be working quite well the last week or two.

I have about 100 of these now; I have suspended new tasks until I find a way to handle these. Must I abort them, because they will all error -177 out? Or is there something simple I can do with them?

Thanks for your help!
ID: 1019916 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1019928 - Posted: 27 Jul 2010, 2:15:47 UTC - in response to Message 1019862.  

My MB WU's seem to be pretty close to expected, but the AP WU's are off by a factor of 2, i.e., twice what "should" be the run time.

The server side scaling for an application version does not start until "Number of tasks completed" reaches 10, and Dr. Anderson has noted there's a bug which he hasn't found which causes the counting to fail for AP tasks. So all hosts which do both MB and AP work are going to have poor AP estimates until that bug is fixed.

Count yourself lucky that it's only off by a factor of 2.
                                                              Joe
ID: 1019928 · Report as offensive
Profile perryjay
Volunteer tester
Avatar

Send message
Joined: 20 Aug 02
Posts: 3377
Credit: 20,676,751
RAC: 0
United States
Message 1019932 - Posted: 27 Jul 2010, 2:31:01 UTC

Ok, now I'm starting to see times way underestimated. Should I remove my flops counts and let it sort itself out before I get to the bad ones? Will they sort themselves out or will it just cause even more to lose their right timing? I've got a few days before I get to the underestimated ones, will that give me enough time to get the estimates right?


PROUD MEMBER OF Team Starfire World BOINC
ID: 1019932 · Report as offensive
Profile rebest Project Donor
Volunteer tester
Avatar

Send message
Joined: 16 Apr 00
Posts: 1296
Credit: 45,357,093
RAC: 0
United States
Message 1019938 - Posted: 27 Jul 2010, 2:44:32 UTC
Last modified: 27 Jul 2010, 2:45:10 UTC

Damn! I must have had an error. Estimate is now 522 hours for Astropulse and 45 hours for MB CUDA. No more new work for me!

Just lovely.

Join the PACK!
ID: 1019938 · Report as offensive
Profile hiamps
Volunteer tester
Avatar

Send message
Joined: 23 May 99
Posts: 4292
Credit: 72,971,319
RAC: 0
United States
Message 1019940 - Posted: 27 Jul 2010, 2:47:49 UTC - in response to Message 1019938.  

Damn! I must have had an error. Estimate is now 522 hours for Astropulse and 45 hours for MB CUDA. No more new work for me!

Just lovely.

Can always adjust your DCF...been doing it all night.
Official Abuser of Boinc Buttons...
And no good credit hound!
ID: 1019940 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1019942 - Posted: 27 Jul 2010, 2:53:07 UTC - in response to Message 1019916.  

My latest CPU MB WUs are way underestimated - like as 17 minutes and 6 minutes; this will cause catastrophic failures (-177, here we come) going forward.

I have to agree that DA seems to have fumbled the ball again. And it's a pity, as things seemed to be working quite well the last week or two.

I have about 100 of these now; I have suspended new tasks until I find a way to handle these. Must I abort them, because they will all error -177 out? Or is there something simple I can do with them?

Thanks for your help!

There are at least two relatively simple fixes. The more sophisticated one is Fred M's new rescheduler which you can get from http://www.efmer.eu/forum_tt/index.php?topic=428.0. It can boost the rsc_fpops_bound values for all S@H MB tasks to 5e17 which amounts to more than a year on even the fastest hosts. That removes the protection against a hung task which the bound is meant to provide, but there's no other downside AFAIK.

The even simpler alternative is to shut BOINC down completely and do a global replace in client_state.xml of all <rsc_fpops_bound> with <rsc_fpops_bound>3. That boosts the bound by a factor of 4 at least, but affects all tasks for all projects. If you can wait until the beginning of the outage, doing that just twice gives a boost of at least 34. That should be sufficient protection against -177 errors.
                                                                Joe
ID: 1019942 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1019970 - Posted: 27 Jul 2010, 5:26:37 UTC - in response to Message 1019932.  

Ok, now I'm starting to see times way underestimated. Should I remove my flops counts and let it sort itself out before I get to the bad ones? Will they sort themselves out or will it just cause even more to lose their right timing? I've got a few days before I get to the underestimated ones, will that give me enough time to get the estimates right?

The server-side scaling assumes a DCF of 1.0 on the host. If yours is much lower it makes those adjusted estimates look too short, but when your host does the first of the adjusted tasks the DCF ought to jump up to near the 1.0 level. That's Dr. Anderson's theory, anyhow.

You can check if it may work out OK by dividing an apparently low estimate by your current DCF. IOW, if the DCF is 0.1 and it's showing a 2 minute estimate on a task which will actually run about 20 minutes then when the host finishes the first underestimated task DCF should jump up to about 1 and the rest of the tasks should then show nearly correct estimates.

OTOH, the server may have a bad average of seconds/flop for either MB application perhaps because you had to reschedule a lot of VLARs. In that case, probably nothing will help the estimates much, the best thing to do may be just protect against -177 errors and hope the averages will adapt when work done during this outage is reported.

I think <flops> in an app_info.xml ought to be kept. Without those to indicate the relative speed of CPU and GPU, the estimate scaling cannot work well. But if they were chosen based on the threads which aimed at stabilizing DCF near 0.2, that's fighting against the server-side assumption of 1.0 DCF. The problem is it's a delayed feedback system, the current <flops> settings were used in scaling the estimates for work sent today, but the seconds/flop average is only adjusted as results are checked by the Validator and the average changes only by 1% of the difference between it and the seconds/flop of each new included result. My best guess is that <flops> entries for 0.2 DCF should be immediately multiplied by 5 to make <flops> entries for 1.0 DCF, and in client_state.xml the DCF should be set to 1.0 and rsc_fpops_bound boosted to avoid any possible -177 errors. That ought to make future estimates fairly good. The tasks with low estimates will boost DCF well above 1.0 though, but if those can be finished during the outage then resetting DCF to 1.0 prior to getting new work Friday would give the best chance of sensible work fetch.
                                                                  Joe
ID: 1019970 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1019984 - Posted: 27 Jul 2010, 5:56:58 UTC - in response to Message 1019970.  
Last modified: 27 Jul 2010, 5:58:55 UTC

Is there an easy how-to-do somewhere how I must calculate now my flops entries in my app_info.xml file?

Thanks!


EDIT: Which will work well with Fred's BOINC Rescheduler tool.
ID: 1019984 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1019995 - Posted: 27 Jul 2010, 6:33:22 UTC - in response to Message 1019942.  

There are at least two relatively simple fixes. The more sophisticated one is Fred M's new rescheduler which you can get from http://www.efmer.eu/forum_tt/index.php?topic=428.0. It can boost the rsc_fpops_bound values for all S@H MB tasks to 5e17 which amounts to more than a year on even the fastest hosts. That removes the protection against a hung task which the bound is meant to provide, but there's no other downside AFAIK.

The even simpler alternative is to shut BOINC down completely and do a global replace in client_state.xml of all <rsc_fpops_bound> with <rsc_fpops_bound>3. That boosts the bound by a factor of 4 at least, but affects all tasks for all projects. If you can wait until the beginning of the outage, doing that just twice gives a boost of at least 34. That should be sufficient protection against -177 errors.
                                                                Joe


Thanks, Joe.
I think I'll try Fred's new rescheduler - I can't remember a hung WU on either of my systems, so maybe I'm immune to that problem, and all will be well.
ID: 1019995 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1020023 - Posted: 27 Jul 2010, 8:35:28 UTC - in response to Message 1019984.  

Is there an easy how-to-do somewhere how I must calculate now my flops entries in my app_info.xml file?

Thanks!


EDIT: Which will work well with Fred's BOINC Rescheduler tool.

For those who had achieved a stable DCF with <flops> for the old standard, there's a very easy conversion: divide the old <flops> by the old DCF to produce the new <flops>. Mathematically the new DCF is found the same way, dividing the old DCF by the old DCF to produce 1.0.

To work more directly, I suggest first figuring out a reasonably accurate estimate of the ratio of GPU to CPU speed. Checking both VHAR and midrange is sensible. For your E7600 Core 2 duo and GTX 260 I found comparables which indicate the GPU is about 17 times faster than the CPU at VHAR, about 12 times faster at midrange. Splitting the difference, 14.5 should work as an overall estimate.

The actual times on that host for a comparable VHAR at AR=1.395 were 1858 seconds for CPU and 107.1 seconds for GPU. That's a 17.3 ratio, if we increase the GPU time and decrease the CPU time proportionally to get the ratio down to 14.5, about 1700 seconds for CPU and 117 for GPU matches well. Since 17.3 is about 19% more than 14.5, I adjusted each time by about 9.5%.

ALL VHAR tasks are given the same rsc_fpops_est value of 4.756e13 by the splitter. To calculate <flops>, we simply divide that by those adjusted times and get about 2.798e10 for CPU and 4.065e11 for GPU. Enter those in app_info.xml and set DCF to 1.0 in client_state.xml. For those doing Astropulse on CPU, its <flops> should be about 2.5 times the S@H Enhanced CPU <flops>.
                                                               Joe
ID: 1020023 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13835
Credit: 208,696,464
RAC: 304
Australia
Message 1020024 - Posted: 27 Jul 2010, 8:37:06 UTC - in response to Message 1019860.  

I think DA has implemented the new server side DCF calculations even for anon platforms again. I just got a bunch of WU's with really totally off calculated est times. Just like when he tried it the first time.

Yep same here- also explains the extended period of full network bandwidth useage. clients are trying to fill their caches based on the shortened completion times.
Grant
Darwin NT
ID: 1020024 · Report as offensive
Profile Miep
Volunteer moderator
Avatar

Send message
Joined: 23 Jul 99
Posts: 2412
Credit: 351,996
RAC: 0
Message 1020030 - Posted: 27 Jul 2010, 9:37:10 UTC

Right. SCREAM ok, now that's out of the system:

any estimate on how much the estimates have to be too small to trigger that -177 error? I think mine are off by a factor of 6-7.

I think it's pretty pointless to put flops in with the amount of testing I'm currently doing, but I don't want them to error out.

Right found the bound entry: 1.8e14 - what dimension is that? seconds? no, can't be that would be 5e6 years...
So, is that likely to be enough or do I have to edit? (current DCF 0.35)

P.S. I'm a woman, I'm allowed emotional responses this time of the month.
Carola
-------
I'm multilingual - I can misunderstand people in several languages!
ID: 1020030 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1020042 - Posted: 27 Jul 2010, 10:06:37 UTC - in response to Message 1020030.  

Right. SCREAM ok, now that's out of the system:

any estimate on how much the estimates have to be too small to trigger that -177 error? I think mine are off by a factor of 6-7.

I think it's pretty pointless to put flops in with the amount of testing I'm currently doing, but I don't want them to error out.

Right found the bound entry: 1.8e14 - what dimension is that? seconds? no, can't be that would be 5e6 years...
So, is that likely to be enough or do I have to edit? (current DCF 0.35)

P.S. I'm a woman, I'm allowed emotional responses this time of the month.

The fpops term means "floating point operations". If you divide it by a "floating point operations per second" value the result is seconds. The BOINC developers take the position that Whetstones are equivalent to "floating point operations per second" for practical purposes. IOW, in the absence of flops in the app_info.xml, that bound is 1.8e14/2.05214e9 = 87713 seconds for your computer.

Because the bound is just 10 times the estimate, the danger point is when actual crunch time would jump the DCF up to 10 or above. Having underestimates around 6 to 7 times is not dangerous when DCF is 0.35, but if it were already above 1 some close examination would be needed.

I think all of us are entitled to some emotional response to these Monday surprises. I keep thinking of what's supposed to be an old Chinese curse which translates as "May you live in interesting times."
                                                                Joe
ID: 1020042 · Report as offensive
Profile Miep
Volunteer moderator
Avatar

Send message
Joined: 23 Jul 99
Posts: 2412
Credit: 351,996
RAC: 0
Message 1020046 - Posted: 27 Jul 2010, 10:39:19 UTC - in response to Message 1020042.  

Thanks Joe, most appreciated!
Carola
-------
I'm multilingual - I can misunderstand people in several languages!
ID: 1020046 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1020052 - Posted: 27 Jul 2010, 11:11:26 UTC - in response to Message 1020023.  

Is there an easy how-to-do somewhere how I must calculate now my flops entries in my app_info.xml file?

Thanks!


EDIT: Which will work well with Fred's BOINC Rescheduler tool.

For those who had achieved a stable DCF with <flops> for the old standard, there's a very easy conversion: divide the old <flops> by the old DCF to produce the new <flops>. Mathematically the new DCF is found the same way, dividing the old DCF by the old DCF to produce 1.0.

To work more directly, I suggest first figuring out a reasonably accurate estimate of the ratio of GPU to CPU speed. Checking both VHAR and midrange is sensible. For your E7600 Core 2 duo and GTX 260 I found comparables which indicate the GPU is about 17 times faster than the CPU at VHAR, about 12 times faster at midrange. Splitting the difference, 14.5 should work as an overall estimate.

The actual times on that host for a comparable VHAR at AR=1.395 were 1858 seconds for CPU and 107.1 seconds for GPU. That's a 17.3 ratio, if we increase the GPU time and decrease the CPU time proportionally to get the ratio down to 14.5, about 1700 seconds for CPU and 117 for GPU matches well. Since 17.3 is about 19% more than 14.5, I adjusted each time by about 9.5%.

ALL VHAR tasks are given the same rsc_fpops_est value of 4.756e13 by the splitter. To calculate <flops>, we simply divide that by those adjusted times and get about 2.798e10 for CPU and 4.065e11 for GPU. Enter those in app_info.xml and set DCF to 1.0 in client_state.xml. For those doing Astropulse on CPU, its <flops> should be about 2.5 times the S@H Enhanced CPU <flops>.
                                                               Joe


This is really an easy how-to-do.. Thanks a lot!

I use now the BOINC Rescheduler of Fred and have a min DCF of 0.5 and max DCF of 1.5 set with the config.xml file.

Current I have a DCF of 1.4x .

Strange is, that I have renamed .vlar WUs from GPU on the CPU, which have estimate times of ~ 25 hours. Normally a VLAR WU need little bit more than 2 hours.
Also I have a lot of CPU WUs, which have estimate times of ~ 6 mins. After searching, they are AR 0.31x WUs. I don't know the crunching time of this AR. An AR 0.44x WU would need ~ 100 mins. A shorty ~ 30 mins. So I guess an AR 0.31x WU would need ~ 1 hour.

So I have a lot of CPU WUs in my BOINC which have much more and much less estimate times.
I guess this are all renamed GPU WUs.

They will result in the famous -177 error?
How I could bypass this?
I need to add (in the config.xml file) something for the BOINC Rescheduler of Fred?


In past it was much easier to crunch SETI@home. ;-)

ID: 1020052 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Server side DCF for anon platforms


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.