Message boards :
Number crunching :
Server side DCF for anon platforms
Message board moderation
Author | Message |
---|---|
Gatekeeper Send message Joined: 14 Jul 04 Posts: 887 Credit: 176,479,616 RAC: 0 |
My MB WU's seem to be pretty close to expected, but the AP WU's are off by a factor of 2, i.e., twice what "should" be the run time. |
Dirk Sadowski Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
(...) You mean the -177 errors? ;-) |
Tim Norton Send message Joined: 2 Jun 99 Posts: 835 Credit: 33,540,164 RAC: 0 |
Strange the last set of downloads have come in more in line with what i would expect most of today they have been fluctuating wildly from 40 sec to 7 hours - should be more like 2:30 hours and now i am seeing a difference between gpu and cpu which is also realistic so mileage must vary will see what they are like in the morning before the cut off :) Tim |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0 |
My flops count in my app_info seems to be keeping mine steady. I haven't noticed anything wild going on yet. PROUD MEMBER OF Team Starfire World BOINC |
JohnDK Send message Joined: 28 May 00 Posts: 1222 Credit: 451,243,443 RAC: 1,127 |
I have DCF and flops adjusted. The latests WUs downloaded seems to have around 50% longer estimated time. |
Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340 |
My latest CPU MB WUs are way underestimated - like as 17 minutes and 6 minutes; this will cause catastrophic failures (-177, here we come) going forward. I have to agree that DA seems to have fumbled the ball again. And it's a pity, as things seemed to be working quite well the last week or two. I have about 100 of these now; I have suspended new tasks until I find a way to handle these. Must I abort them, because they will all error -177 out? Or is there something simple I can do with them? Thanks for your help! |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
My MB WU's seem to be pretty close to expected, but the AP WU's are off by a factor of 2, i.e., twice what "should" be the run time. The server side scaling for an application version does not start until "Number of tasks completed" reaches 10, and Dr. Anderson has noted there's a bug which he hasn't found which causes the counting to fail for AP tasks. So all hosts which do both MB and AP work are going to have poor AP estimates until that bug is fixed. Count yourself lucky that it's only off by a factor of 2. Joe |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0 |
Ok, now I'm starting to see times way underestimated. Should I remove my flops counts and let it sort itself out before I get to the bad ones? Will they sort themselves out or will it just cause even more to lose their right timing? I've got a few days before I get to the underestimated ones, will that give me enough time to get the estimates right? PROUD MEMBER OF Team Starfire World BOINC |
rebest Send message Joined: 16 Apr 00 Posts: 1296 Credit: 45,357,093 RAC: 0 |
Damn! I must have had an error. Estimate is now 522 hours for Astropulse and 45 hours for MB CUDA. No more new work for me! Just lovely. Join the PACK! |
hiamps Send message Joined: 23 May 99 Posts: 4292 Credit: 72,971,319 RAC: 0 |
Damn! I must have had an error. Estimate is now 522 hours for Astropulse and 45 hours for MB CUDA. No more new work for me! Can always adjust your DCF...been doing it all night. Official Abuser of Boinc Buttons... And no good credit hound! |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
My latest CPU MB WUs are way underestimated - like as 17 minutes and 6 minutes; this will cause catastrophic failures (-177, here we come) going forward. There are at least two relatively simple fixes. The more sophisticated one is Fred M's new rescheduler which you can get from http://www.efmer.eu/forum_tt/index.php?topic=428.0. It can boost the rsc_fpops_bound values for all S@H MB tasks to 5e17 which amounts to more than a year on even the fastest hosts. That removes the protection against a hung task which the bound is meant to provide, but there's no other downside AFAIK. The even simpler alternative is to shut BOINC down completely and do a global replace in client_state.xml of all <rsc_fpops_bound> with <rsc_fpops_bound>3. That boosts the bound by a factor of 4 at least, but affects all tasks for all projects. If you can wait until the beginning of the outage, doing that just twice gives a boost of at least 34. That should be sufficient protection against -177 errors. Joe |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Ok, now I'm starting to see times way underestimated. Should I remove my flops counts and let it sort itself out before I get to the bad ones? Will they sort themselves out or will it just cause even more to lose their right timing? I've got a few days before I get to the underestimated ones, will that give me enough time to get the estimates right? The server-side scaling assumes a DCF of 1.0 on the host. If yours is much lower it makes those adjusted estimates look too short, but when your host does the first of the adjusted tasks the DCF ought to jump up to near the 1.0 level. That's Dr. Anderson's theory, anyhow. You can check if it may work out OK by dividing an apparently low estimate by your current DCF. IOW, if the DCF is 0.1 and it's showing a 2 minute estimate on a task which will actually run about 20 minutes then when the host finishes the first underestimated task DCF should jump up to about 1 and the rest of the tasks should then show nearly correct estimates. OTOH, the server may have a bad average of seconds/flop for either MB application perhaps because you had to reschedule a lot of VLARs. In that case, probably nothing will help the estimates much, the best thing to do may be just protect against -177 errors and hope the averages will adapt when work done during this outage is reported. I think <flops> in an app_info.xml ought to be kept. Without those to indicate the relative speed of CPU and GPU, the estimate scaling cannot work well. But if they were chosen based on the threads which aimed at stabilizing DCF near 0.2, that's fighting against the server-side assumption of 1.0 DCF. The problem is it's a delayed feedback system, the current <flops> settings were used in scaling the estimates for work sent today, but the seconds/flop average is only adjusted as results are checked by the Validator and the average changes only by 1% of the difference between it and the seconds/flop of each new included result. My best guess is that <flops> entries for 0.2 DCF should be immediately multiplied by 5 to make <flops> entries for 1.0 DCF, and in client_state.xml the DCF should be set to 1.0 and rsc_fpops_bound boosted to avoid any possible -177 errors. That ought to make future estimates fairly good. The tasks with low estimates will boost DCF well above 1.0 though, but if those can be finished during the outage then resetting DCF to 1.0 prior to getting new work Friday would give the best chance of sensible work fetch. Joe |
Dirk Sadowski Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
Is there an easy how-to-do somewhere how I must calculate now my flops entries in my app_info.xml file? Thanks! EDIT: Which will work well with Fred's BOINC Rescheduler tool. |
Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340 |
There are at least two relatively simple fixes. The more sophisticated one is Fred M's new rescheduler which you can get from http://www.efmer.eu/forum_tt/index.php?topic=428.0. It can boost the rsc_fpops_bound values for all S@H MB tasks to 5e17 which amounts to more than a year on even the fastest hosts. That removes the protection against a hung task which the bound is meant to provide, but there's no other downside AFAIK. Thanks, Joe. I think I'll try Fred's new rescheduler - I can't remember a hung WU on either of my systems, so maybe I'm immune to that problem, and all will be well. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Is there an easy how-to-do somewhere how I must calculate now my flops entries in my app_info.xml file? For those who had achieved a stable DCF with <flops> for the old standard, there's a very easy conversion: divide the old <flops> by the old DCF to produce the new <flops>. Mathematically the new DCF is found the same way, dividing the old DCF by the old DCF to produce 1.0. To work more directly, I suggest first figuring out a reasonably accurate estimate of the ratio of GPU to CPU speed. Checking both VHAR and midrange is sensible. For your E7600 Core 2 duo and GTX 260 I found comparables which indicate the GPU is about 17 times faster than the CPU at VHAR, about 12 times faster at midrange. Splitting the difference, 14.5 should work as an overall estimate. The actual times on that host for a comparable VHAR at AR=1.395 were 1858 seconds for CPU and 107.1 seconds for GPU. That's a 17.3 ratio, if we increase the GPU time and decrease the CPU time proportionally to get the ratio down to 14.5, about 1700 seconds for CPU and 117 for GPU matches well. Since 17.3 is about 19% more than 14.5, I adjusted each time by about 9.5%. ALL VHAR tasks are given the same rsc_fpops_est value of 4.756e13 by the splitter. To calculate <flops>, we simply divide that by those adjusted times and get about 2.798e10 for CPU and 4.065e11 for GPU. Enter those in app_info.xml and set DCF to 1.0 in client_state.xml. For those doing Astropulse on CPU, its <flops> should be about 2.5 times the S@H Enhanced CPU <flops>. Joe |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13835 Credit: 208,696,464 RAC: 304 |
I think DA has implemented the new server side DCF calculations even for anon platforms again. I just got a bunch of WU's with really totally off calculated est times. Just like when he tried it the first time. Yep same here- also explains the extended period of full network bandwidth useage. clients are trying to fill their caches based on the shortened completion times. Grant Darwin NT |
Miep Send message Joined: 23 Jul 99 Posts: 2412 Credit: 351,996 RAC: 0 |
Right. SCREAM ok, now that's out of the system: any estimate on how much the estimates have to be too small to trigger that -177 error? I think mine are off by a factor of 6-7. I think it's pretty pointless to put flops in with the amount of testing I'm currently doing, but I don't want them to error out. Right found the bound entry: 1.8e14 - what dimension is that? seconds? no, can't be that would be 5e6 years... So, is that likely to be enough or do I have to edit? (current DCF 0.35) P.S. I'm a woman, I'm allowed emotional responses this time of the month. Carola ------- I'm multilingual - I can misunderstand people in several languages! |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Right. SCREAM ok, now that's out of the system: The fpops term means "floating point operations". If you divide it by a "floating point operations per second" value the result is seconds. The BOINC developers take the position that Whetstones are equivalent to "floating point operations per second" for practical purposes. IOW, in the absence of flops in the app_info.xml, that bound is 1.8e14/2.05214e9 = 87713 seconds for your computer. Because the bound is just 10 times the estimate, the danger point is when actual crunch time would jump the DCF up to 10 or above. Having underestimates around 6 to 7 times is not dangerous when DCF is 0.35, but if it were already above 1 some close examination would be needed. I think all of us are entitled to some emotional response to these Monday surprises. I keep thinking of what's supposed to be an old Chinese curse which translates as "May you live in interesting times." Joe |
Miep Send message Joined: 23 Jul 99 Posts: 2412 Credit: 351,996 RAC: 0 |
Thanks Joe, most appreciated! Carola ------- I'm multilingual - I can misunderstand people in several languages! |
Dirk Sadowski Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
Is there an easy how-to-do somewhere how I must calculate now my flops entries in my app_info.xml file? This is really an easy how-to-do.. Thanks a lot! I use now the BOINC Rescheduler of Fred and have a min DCF of 0.5 and max DCF of 1.5 set with the config.xml file. Current I have a DCF of 1.4x . Strange is, that I have renamed .vlar WUs from GPU on the CPU, which have estimate times of ~ 25 hours. Normally a VLAR WU need little bit more than 2 hours. Also I have a lot of CPU WUs, which have estimate times of ~ 6 mins. After searching, they are AR 0.31x WUs. I don't know the crunching time of this AR. An AR 0.44x WU would need ~ 100 mins. A shorty ~ 30 mins. So I guess an AR 0.31x WU would need ~ 1 hour. So I have a lot of CPU WUs in my BOINC which have much more and much less estimate times. I guess this are all renamed GPU WUs. They will result in the famous -177 error? How I could bypass this? I need to add (in the config.xml file) something for the BOINC Rescheduler of Fred? In past it was much easier to crunch SETI@home. ;-) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.