Server side DCF for anon platforms

Author	Message
Gatekeeper Send message Joined: 14 Jul 04 Posts: 887 Credit: 176,479,616 RAC: 0	Message 1019862 - Posted: 26 Jul 2010, 23:15:25 UTC My MB WU's seem to be pretty close to expected, but the AP WU's are off by a factor of 2, i.e., twice what "should" be the run time. ID: 1019862 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 1019864 - Posted: 26 Jul 2010, 23:22:39 UTC - in response to Message 1019863. (...) People better watch out for a -700 error shi* storm. You mean the -177 errors? ;-) ID: 1019864 ·

Tim Norton Volunteer tester Send message Joined: 2 Jun 99 Posts: 835 Credit: 33,540,164 RAC: 0	Message 1019875 - Posted: 26 Jul 2010, 23:48:08 UTC Strange the last set of downloads have come in more in line with what i would expect most of today they have been fluctuating wildly from 40 sec to 7 hours - should be more like 2:30 hours and now i am seeing a difference between gpu and cpu which is also realistic so mileage must vary will see what they are like in the morning before the cut off :) Tim ID: 1019875 ·

perryjay Volunteer tester Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0	Message 1019883 - Posted: 27 Jul 2010, 0:00:23 UTC My flops count in my app_info seems to be keeping mine steady. I haven't noticed anything wild going on yet. PROUD MEMBER OF Team Starfire World BOINC ID: 1019883 ·

JohnDK Volunteer tester Send message Joined: 28 May 00 Posts: 1222 Credit: 451,243,443 RAC: 1,127	Message 1019907 - Posted: 27 Jul 2010, 1:00:45 UTC I have DCF and flops adjusted. The latests WUs downloaded seems to have around 50% longer estimated time. ID: 1019907 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1019916 - Posted: 27 Jul 2010, 1:41:44 UTC Last modified: 27 Jul 2010, 1:50:48 UTC My latest CPU MB WUs are way underestimated - like as 17 minutes and 6 minutes; this will cause catastrophic failures (-177, here we come) going forward. I have to agree that DA seems to have fumbled the ball again. And it's a pity, as things seemed to be working quite well the last week or two. I have about 100 of these now; I have suspended new tasks until I find a way to handle these. Must I abort them, because they will all error -177 out? Or is there something simple I can do with them? Thanks for your help! ID: 1019916 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1019928 - Posted: 27 Jul 2010, 2:15:47 UTC - in response to Message 1019862. My MB WU's seem to be pretty close to expected, but the AP WU's are off by a factor of 2, i.e., twice what "should" be the run time. The server side scaling for an application version does not start until "Number of tasks completed" reaches 10, and Dr. Anderson has noted there's a bug which he hasn't found which causes the counting to fail for AP tasks. So all hosts which do both MB and AP work are going to have poor AP estimates until that bug is fixed. Count yourself lucky that it's only off by a factor of 2. Joe ID: 1019928 ·

perryjay Volunteer tester Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0	Message 1019932 - Posted: 27 Jul 2010, 2:31:01 UTC Ok, now I'm starting to see times way underestimated. Should I remove my flops counts and let it sort itself out before I get to the bad ones? Will they sort themselves out or will it just cause even more to lose their right timing? I've got a few days before I get to the underestimated ones, will that give me enough time to get the estimates right? PROUD MEMBER OF Team Starfire World BOINC ID: 1019932 ·

rebest Volunteer tester Send message Joined: 16 Apr 00 Posts: 1296 Credit: 45,357,093 RAC: 0	Message 1019938 - Posted: 27 Jul 2010, 2:44:32 UTC Last modified: 27 Jul 2010, 2:45:10 UTC Damn! I must have had an error. Estimate is now 522 hours for Astropulse and 45 hours for MB CUDA. No more new work for me! Just lovely. Join the PACK! ID: 1019938 ·

hiamps Volunteer tester Send message Joined: 23 May 99 Posts: 4292 Credit: 72,971,319 RAC: 0	Message 1019940 - Posted: 27 Jul 2010, 2:47:49 UTC - in response to Message 1019938. Damn! I must have had an error. Estimate is now 522 hours for Astropulse and 45 hours for MB CUDA. No more new work for me! Just lovely. Can always adjust your DCF...been doing it all night. Official Abuser of Boinc Buttons... And no good credit hound! ID: 1019940 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1019942 - Posted: 27 Jul 2010, 2:53:07 UTC - in response to Message 1019916. My latest CPU MB WUs are way underestimated - like as 17 minutes and 6 minutes; this will cause catastrophic failures (-177, here we come) going forward. I have to agree that DA seems to have fumbled the ball again. And it's a pity, as things seemed to be working quite well the last week or two. I have about 100 of these now; I have suspended new tasks until I find a way to handle these. Must I abort them, because they will all error -177 out? Or is there something simple I can do with them? Thanks for your help! There are at least two relatively simple fixes. The more sophisticated one is Fred M's new rescheduler which you can get from http://www.efmer.eu/forum_tt/index.php?topic=428.0. It can boost the rsc_fpops_bound values for all S@H MB tasks to 5e17 which amounts to more than a year on even the fastest hosts. That removes the protection against a hung task which the bound is meant to provide, but there's no other downside AFAIK. The even simpler alternative is to shut BOINC down completely and do a global replace in client_state.xml of all <rsc_fpops_bound> with <rsc_fpops_bound>3. That boosts the bound by a factor of 4 at least, but affects all tasks for all projects. If you can wait until the beginning of the outage, doing that just twice gives a boost of at least 34. That should be sufficient protection against -177 errors. Joe ID: 1019942 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1019970 - Posted: 27 Jul 2010, 5:26:37 UTC - in response to Message 1019932. Ok, now I'm starting to see times way underestimated. Should I remove my flops counts and let it sort itself out before I get to the bad ones? Will they sort themselves out or will it just cause even more to lose their right timing? I've got a few days before I get to the underestimated ones, will that give me enough time to get the estimates right? The server-side scaling assumes a DCF of 1.0 on the host. If yours is much lower it makes those adjusted estimates look too short, but when your host does the first of the adjusted tasks the DCF ought to jump up to near the 1.0 level. That's Dr. Anderson's theory, anyhow. You can check if it may work out OK by dividing an apparently low estimate by your current DCF. IOW, if the DCF is 0.1 and it's showing a 2 minute estimate on a task which will actually run about 20 minutes then when the host finishes the first underestimated task DCF should jump up to about 1 and the rest of the tasks should then show nearly correct estimates. OTOH, the server may have a bad average of seconds/flop for either MB application perhaps because you had to reschedule a lot of VLARs. In that case, probably nothing will help the estimates much, the best thing to do may be just protect against -177 errors and hope the averages will adapt when work done during this outage is reported. I think <flops> in an app_info.xml ought to be kept. Without those to indicate the relative speed of CPU and GPU, the estimate scaling cannot work well. But if they were chosen based on the threads which aimed at stabilizing DCF near 0.2, that's fighting against the server-side assumption of 1.0 DCF. The problem is it's a delayed feedback system, the current <flops> settings were used in scaling the estimates for work sent today, but the seconds/flop average is only adjusted as results are checked by the Validator and the average changes only by 1% of the difference between it and the seconds/flop of each new included result. My best guess is that <flops> entries for 0.2 DCF should be immediately multiplied by 5 to make <flops> entries for 1.0 DCF, and in client_state.xml the DCF should be set to 1.0 and rsc_fpops_bound boosted to avoid any possible -177 errors. That ought to make future estimates fairly good. The tasks with low estimates will boost DCF well above 1.0 though, but if those can be finished during the outage then resetting DCF to 1.0 prior to getting new work Friday would give the best chance of sensible work fetch. Joe ID: 1019970 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 1019984 - Posted: 27 Jul 2010, 5:56:58 UTC - in response to Message 1019970. Last modified: 27 Jul 2010, 5:58:55 UTC Is there an easy how-to-do somewhere how I must calculate now my flops entries in my app_info.xml file? Thanks! EDIT: Which will work well with Fred's BOINC Rescheduler tool. ID: 1019984 ·

Cruncher-American Send message Joined: 25 Mar 02 Posts: 1513 Credit: 370,893,186 RAC: 340	Message 1019995 - Posted: 27 Jul 2010, 6:33:22 UTC - in response to Message 1019942. There are at least two relatively simple fixes. The more sophisticated one is Fred M's new rescheduler which you can get from http://www.efmer.eu/forum_tt/index.php?topic=428.0. It can boost the rsc_fpops_bound values for all S@H MB tasks to 5e17 which amounts to more than a year on even the fastest hosts. That removes the protection against a hung task which the bound is meant to provide, but there's no other downside AFAIK. The even simpler alternative is to shut BOINC down completely and do a global replace in client_state.xml of all <rsc_fpops_bound> with <rsc_fpops_bound>3. That boosts the bound by a factor of 4 at least, but affects all tasks for all projects. If you can wait until the beginning of the outage, doing that just twice gives a boost of at least 34. That should be sufficient protection against -177 errors. Joe Thanks, Joe. I think I'll try Fred's new rescheduler - I can't remember a hung WU on either of my systems, so maybe I'm immune to that problem, and all will be well. ID: 1019995 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1020023 - Posted: 27 Jul 2010, 8:35:28 UTC - in response to Message 1019984. Is there an easy how-to-do somewhere how I must calculate now my flops entries in my app_info.xml file? Thanks! EDIT: Which will work well with Fred's BOINC Rescheduler tool. For those who had achieved a stable DCF with <flops> for the old standard, there's a very easy conversion: divide the old <flops> by the old DCF to produce the new <flops>. Mathematically the new DCF is found the same way, dividing the old DCF by the old DCF to produce 1.0. To work more directly, I suggest first figuring out a reasonably accurate estimate of the ratio of GPU to CPU speed. Checking both VHAR and midrange is sensible. For your E7600 Core 2 duo and GTX 260 I found comparables which indicate the GPU is about 17 times faster than the CPU at VHAR, about 12 times faster at midrange. Splitting the difference, 14.5 should work as an overall estimate. The actual times on that host for a comparable VHAR at AR=1.395 were 1858 seconds for CPU and 107.1 seconds for GPU. That's a 17.3 ratio, if we increase the GPU time and decrease the CPU time proportionally to get the ratio down to 14.5, about 1700 seconds for CPU and 117 for GPU matches well. Since 17.3 is about 19% more than 14.5, I adjusted each time by about 9.5%. ALL VHAR tasks are given the same rsc_fpops_est value of 4.756e13 by the splitter. To calculate <flops>, we simply divide that by those adjusted times and get about 2.798e10 for CPU and 4.065e11 for GPU. Enter those in app_info.xml and set DCF to 1.0 in client_state.xml. For those doing Astropulse on CPU, its <flops> should be about 2.5 times the S@H Enhanced CPU <flops>. Joe ID: 1020023 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1020024 - Posted: 27 Jul 2010, 8:37:06 UTC - in response to Message 1019860. I think DA has implemented the new server side DCF calculations even for anon platforms again. I just got a bunch of WU's with really totally off calculated est times. Just like when he tried it the first time. Yep same here- also explains the extended period of full network bandwidth useage. clients are trying to fill their caches based on the shortened completion times. Grant Darwin NT ID: 1020024 ·

Miep Volunteer moderator Send message Joined: 23 Jul 99 Posts: 2412 Credit: 351,996 RAC: 0	Message 1020030 - Posted: 27 Jul 2010, 9:37:10 UTC Right. SCREAM ok, now that's out of the system: any estimate on how much the estimates have to be too small to trigger that -177 error? I think mine are off by a factor of 6-7. I think it's pretty pointless to put flops in with the amount of testing I'm currently doing, but I don't want them to error out. Right found the bound entry: 1.8e14 - what dimension is that? seconds? no, can't be that would be 5e6 years... So, is that likely to be enough or do I have to edit? (current DCF 0.35) P.S. I'm a woman, I'm allowed emotional responses this time of the month. Carola ------- I'm multilingual - I can misunderstand people in several languages! ID: 1020030 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 1020042 - Posted: 27 Jul 2010, 10:06:37 UTC - in response to Message 1020030. Right. SCREAM ok, now that's out of the system: any estimate on how much the estimates have to be too small to trigger that -177 error? I think mine are off by a factor of 6-7. I think it's pretty pointless to put flops in with the amount of testing I'm currently doing, but I don't want them to error out. Right found the bound entry: 1.8e14 - what dimension is that? seconds? no, can't be that would be 5e6 years... So, is that likely to be enough or do I have to edit? (current DCF 0.35) P.S. I'm a woman, I'm allowed emotional responses this time of the month. The fpops term means "floating point operations". If you divide it by a "floating point operations per second" value the result is seconds. The BOINC developers take the position that Whetstones are equivalent to "floating point operations per second" for practical purposes. IOW, in the absence of flops in the app_info.xml, that bound is 1.8e14/2.05214e9 = 87713 seconds for your computer. Because the bound is just 10 times the estimate, the danger point is when actual crunch time would jump the DCF up to 10 or above. Having underestimates around 6 to 7 times is not dangerous when DCF is 0.35, but if it were already above 1 some close examination would be needed. I think all of us are entitled to some emotional response to these Monday surprises. I keep thinking of what's supposed to be an old Chinese curse which translates as "May you live in interesting times." Joe ID: 1020042 ·

Miep Volunteer moderator Send message Joined: 23 Jul 99 Posts: 2412 Credit: 351,996 RAC: 0	Message 1020046 - Posted: 27 Jul 2010, 10:39:19 UTC - in response to Message 1020042. Thanks Joe, most appreciated! Carola ------- I'm multilingual - I can misunderstand people in several languages! ID: 1020046 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 1020052 - Posted: 27 Jul 2010, 11:11:26 UTC - in response to Message 1020023. Is there an easy how-to-do somewhere how I must calculate now my flops entries in my app_info.xml file? Thanks! EDIT: Which will work well with Fred's BOINC Rescheduler tool. For those who had achieved a stable DCF with <flops> for the old standard, there's a very easy conversion: divide the old <flops> by the old DCF to produce the new <flops>. Mathematically the new DCF is found the same way, dividing the old DCF by the old DCF to produce 1.0. To work more directly, I suggest first figuring out a reasonably accurate estimate of the ratio of GPU to CPU speed. Checking both VHAR and midrange is sensible. For your E7600 Core 2 duo and GTX 260 I found comparables which indicate the GPU is about 17 times faster than the CPU at VHAR, about 12 times faster at midrange. Splitting the difference, 14.5 should work as an overall estimate. The actual times on that host for a comparable VHAR at AR=1.395 were 1858 seconds for CPU and 107.1 seconds for GPU. That's a 17.3 ratio, if we increase the GPU time and decrease the CPU time proportionally to get the ratio down to 14.5, about 1700 seconds for CPU and 117 for GPU matches well. Since 17.3 is about 19% more than 14.5, I adjusted each time by about 9.5%. ALL VHAR tasks are given the same rsc_fpops_est value of 4.756e13 by the splitter. To calculate <flops>, we simply divide that by those adjusted times and get about 2.798e10 for CPU and 4.065e11 for GPU. Enter those in app_info.xml and set DCF to 1.0 in client_state.xml. For those doing Astropulse on CPU, its <flops> should be about 2.5 times the S@H Enhanced CPU <flops>. Joe This is really an easy how-to-do.. Thanks a lot! I use now the BOINC Rescheduler of Fred and have a min DCF of 0.5 and max DCF of 1.5 set with the config.xml file. Current I have a DCF of 1.4x . Strange is, that I have renamed .vlar WUs from GPU on the CPU, which have estimate times of ~ 25 hours. Normally a VLAR WU need little bit more than 2 hours. Also I have a lot of CPU WUs, which have estimate times of ~ 6 mins. After searching, they are AR 0.31x WUs. I don't know the crunching time of this AR. An AR 0.44x WU would need ~ 100 mins. A shorty ~ 30 mins. So I guess an AR 0.31x WU would need ~ 1 hour. So I have a lot of CPU WUs in my BOINC which have much more and much less estimate times. I guess this are all renamed GPU WUs. They will result in the famous -177 error? How I could bypass this? I need to add (in the config.xml file) something for the BOINC Rescheduler of Fred? In past it was much easier to crunch SETI@home. ;-) ID: 1020052 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.