Message boards :
Number crunching :
Shorties estimate up from three minutes to six hours after today's outage!
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next
Author | Message |
---|---|
LadyL Send message Joined: 14 Sep 11 Posts: 1679 Credit: 5,230,097 RAC: 0 |
NB if you run anon, inserting <flops> will circumvent this problem. Actually, as Joe explained in this post calculating flops is fairly easy. Also if you take APR from the app details page and multiply with 10e9, that's also flops. Yes, with the advent of APR to cope with the largely different speeds of applications, that a single client DCF can not handle, we stopped recommending flops entries - mainly because people find them difficult to get right and because they shouldn't be needed. But as Joe keeps reminding us, there are corners where having flops entries will stop you from jumping off the tracks. If everybody on anonymous platform had flops, neither the overblown APR nor the botched attempt at mitigating the effect of early exit tasks would have much effect (I think). We are still trying to work out if there is any way to have the installer calculate rough flops to insert into app_info.xml as part of the installation. We prefer not to have to jump through hoops though, getting David to do stuff 'right' is by far the better option. |
W-K 666 Send message Joined: 18 May 99 Posts: 19015 Credit: 40,757,560 RAC: 67 |
Snipped...... And from a long term sceptics view, almost b***dy impossible |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
Yes, I use again the <flops> entries in my app_info.xml file. In past I did it how Joe explained it. Then after it was easier (you don't need to calculate ;-) to use the values of the new application details overview of the host. Geek@Play made a manual later: 'How to add flops value'. BTW, <flops> entries could disturb CreditNew? - Best regards! - Sutaru Tsureku, team seti.international founder. - Optimize your PC for higher RAC. - SETI@home needs your help. - |
Floyd Send message Joined: 19 May 11 Posts: 524 Credit: 1,870,625 RAC: 0 |
I Hit the NNT button a couple days ago , when all this started , on my quad core cpu I had nothing But AP's and Vlars , about a weeks worth... AP's about 15 hrs each . NO gpu work at all , SO,,, Freds reschedular put a bunch of Vlars onto GPU , Only it is taking over 4 hrs for each one of them to run... 4 cpu tasks and 3 Gpu tasks at a time ... = a long time to get thru all this mess... Was hoping this would Clear up by the time I finish all the cache I have now. Still have 8 AP's that havn't even started yet... at Least I have work... LOL |
perryjay Send message Joined: 20 Aug 02 Posts: 3377 Credit: 20,676,751 RAC: 0 |
Well, things have improved a bit. The shorties that were showing estimates of over 7 hours yesterday are now showing around 2 hours and 20 minutes on my GPU. They are actually taking about 7 or 8 minutes to complete. They are working their way down though. When I first looked this morning they were at 2 hours 28 minutes guesstimates. My CPU work looks to be just about right. I don't know what will happen with any AP work I might get as I finished all of it last night and haven't got any new. Same thing for any longer midrange GPU work. All I have are shorties on my GPU. According to my details page my DCF is now at .51xxx down from .57xxx when I first looked. What should be interesting though is that I have Fred's tool set to check every 6 hours to adjust my limit RSC_ FPOPS bound. I'm not moving anything around, just making sure I don't get any 177 errors. I wonder what effect that is going to have? For those that don't know, I'm running my E5400 Intel and a GTS 450 GPU with the Lunatic's x39e flavor (Two at a time) and Raistmer's AP on NVIDIA app. Okay, I got a mid-range task for my GPU and it is guesstimating over 7 hours and my shorties are now guesstimating 2 hours and 6 minutes. Still a long way from the 7 or 8 minutes they are taking but improving. PROUD MEMBER OF Team Starfire World BOINC |
LadyL Send message Joined: 14 Sep 11 Posts: 1679 Credit: 5,230,097 RAC: 0 |
As it currently stands, only those using anonymous platform (e.g. optimised apps) without flops in app_info.xml are affected by the botched attempt to avoid overblown APR resulting in -177 errors. Those are experiencing workfetch issues, as tasks complete much faster than the far too large estimates. The problem is compounded by the 5 min minimum intervall between workfetches and having a shortie storm on top isn't helping either. At this point it would probably help inserting flops - until a higher cap goes in or until code to deal with the problem that started all this is in and works. A detailed analysis of the higher capping factor shows that, while being far better than the original one, is still too low - by up to a magnitude. task duration is currently being overestimated in the region of 25x (actually between 4x and 50x depending on card) after the limit increase (provided it stays at the current value of 10) that will drop to about 6x (1.5x - 10x). It will then affect all users... That may or may not be enough to keep the big ones afloat - it's hard to tell even if we can watch for a bit on beta. |
W-K 666 Send message Joined: 18 May 99 Posts: 19015 Credit: 40,757,560 RAC: 67 |
I'm not seeing any problems here, but that might be because I am keeping one cpu core busy with an AP task and explained here and in "Average processing rate - a little high?" my APR is still a little high and this brings the DCF back to >1.5 at the moment, that could change if the pendings get validated soon. I do feel that I may have been to blame for this event, but do I feel guilty. Not one little bit. |
Fred J. Verster Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 |
I have a bunch of VLARs, crunched by the ATI 5870 GPUs, also noticed a host with very little memory, O.S. Linux 2.6.27.7-9-pae BOINC version 6.4.5 Memory 182.09 MB Cache 1024 KB. This host?! Also produces errors. Vlar WU. Can't be working right, with so little memory? Or something else is wrong! Should he/she be PMed, cause of so much errors? |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
I'm not seeing any problems here, but that might be because I am keeping one cpu core busy with an AP task and explained here and in "Average processing rate - a little high?" my APR is still a little high and this brings the DCF back to >1.5 at the moment, that could change if the pendings get validated soon. You pointed out the original problem, you did not engineer the 'fix'. Although your complaints may have precipitated it, you don't have to feel guilty that the remedy proved worse for most than the disease. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
As it currently stands, only those using anonymous platform (e.g. optimised apps) without flops in app_info.xml are affected by the botched attempt to avoid overblown APR resulting in -177 errors. Those are experiencing workfetch issues, as tasks complete much faster than the far too large estimates. The problem is compounded by the 5 min minimum intervall between workfetches and having a shortie storm on top isn't helping either. The kitties would like to thank you for going to bat for the multitude of crunchers who this bit of codeplay has totally messed up. Hope you shall continue to monitor things here whilst the mucked up work makes it's way through the system. Also hoping that you can find a 'fix' for the original problem that does not penalize those who never had it to begin with. Meow. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
Dave Stegner Send message Joined: 20 Oct 04 Posts: 540 Credit: 65,583,328 RAC: 27 |
I second that motion !!!! Dave |
W-K 666 Send message Joined: 18 May 99 Posts: 19015 Credit: 40,757,560 RAC: 67 |
I'm not seeing any problems here, but that might be because I am keeping one cpu core busy with an AP task and explained here and in "Average processing rate - a little high?" my APR is still a little high and this brings the DCF back to >1.5 at the moment, that could change if the pendings get validated soon. I have no guilt problems at all about this episode, I am not the one responsible for failing to tackle the problem effectively. And if you read all my posts on the subject I have given a hint on solving the problem. And that is to remove the early finishing tasks, -9's and "too much blanking" from the APR calulation. See my latest in "Average processing rate - a little high?" and you can see quite clearly that a host running the default apps has the problem also. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14650 Credit: 200,643,578 RAC: 874 |
At this point, let's just accept that a mistake has been made, and, as an unexpected and unplanned side effect, some people will be downloading less work than they would have expected if everything was working fine - splitters working flat out, an unlimited supply of new tapes, a gigabit download pipe. We all know that nirvana doesn't exist. It's just a normal work supply glitch, OK? Cricket is still near maxx, work is being crunched, the project is searching for ET at the normal speed. No-one needs to be blamed, no-one needs to get upset. But: what we do need to do is to negotiate our way safely back to the path that we've stumbled off. If we ask for too rapid a return to how things were before, the recovery is going to be far, far worse the the problem. At the moment you're asking for too little work? That's not the end of the world. On the way back, you'll be asking for too much. Think about it. |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
At this point, let's just accept that a mistake has been made, and, as an unexpected and unplanned side effect, some people will be downloading less work than they would have expected if everything was working fine - splitters working flat out, an unlimited supply of new tapes, a gigabit download pipe. I agree, and now that we've taken a brody off of the path and gone way out into the rough, a return to 'normalcy' should be done a bit at a time lest things really go off into the tall timber. "Freedom is just Chaos, with better lighting." Alan Dean Foster |
Kevin Olley Send message Joined: 3 Aug 99 Posts: 906 Credit: 261,085,289 RAC: 572 |
I got swamped with VLAR's a couple of months ago, it was not pretty, ended up having to process a fair quantity of them on GPU's to prevent them being wasted, no fun at all. By the looks of this machine it may take a while for this to settle down, but the last thing we want is another attempted fix that makes it even worse. Kevin |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Some notes on changesets [trac]changeset:24128[/trac] and [trac]changeset:24217[/trac]: 24128 was deployed at SETI Beta Wednesday Sept. 7 or maybe the day before and caused very little disturbance because S@H v7 is being tested so most participants are running the stock CPU application. Those few running anonymous platform testing Raistmer's openCL ATI applications are advanced users willing and able to edit app_info.xml as needed. 24128 was deployed here about a week later, and that could have happened not as a deliberate test but simply because the server software needed to be rebuilt for other reasons. 24217 is supposed to be deployed at Beta today, here tomorrow (Friday). As Richard has pointed out, the increase to a factor of 10 will not be enough for many crunching with GPUs. But since rsc_fpops_bound is set to 10 times rsc_fpops_est that factor of 10 is about as far as it could be stretched and still make any kind of sense in a server routine which applies to all projects. For those without <flops> in their app_info.xml files and trying to decide whether adding them will be needed, there are checks you can do. I suggest first going to the computer's info page at http://setiathome.berkeley.edu/show_host_detail.php?hostid=xxxxxxx and writing down the value for "Measured floating point speed" except shift the decimal point 3 places to the left to put it into GFLOPS. For instance, 2455.93 million ops/sec becomes 2.45593 GFLOPS. Next, click the "Show" button for Application details and look at the Average processing rate entries for those applications listed as "SETI@home Enhanced (anonymous platform, xxxxx)" or "Astropulse v505 (anonymous platform, xxxxx)" where xxxxx is CPU, NVIDIA GPU, or ATI GPU. If any of those are more than 10 times the GFLOPS from the benchmark, runtime estimates for that application will be affected. If the ratio is not too much more than 10, the core client's DCF may adequately cope. If you're running more than one task per GPU, it's even worse. The core client will probably be using a guesstimated <flops> for GPU which is less than the CPU benchmark. Looking in BOINC's client_state.xml file, those <flops> can be seen within the <app_version> sections. Dr. Anderson's intended longer-term modification will allow a project's Validator to indicate when the runtime of a validated task should not be used in the averages, what Claggy asked for. I don't know any way to guess when that might be implemented. Joe |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
Joe....now you have me a bit confused again. I thought that the current problem was occurring client side, messing up the DCF and other important stats. This happens with or without validation. Sending out work with estimates artificially bloated makes one's cache settings pretty much impossible to reach. Or, are you saying that client side and server side estimates will combine to sort this? And what about the DCF tussle between CPU and GPU work? Meow? "Freedom is just Chaos, with better lighting." Alan Dean Foster |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
What I'm hoping is that when that longer-term protection against APR going ridiculously high is in place, the current short-term junk will be removed. If so, we'd be back in a mode where the server-side scaling could work to keep estimates reasonable and DCF would be fairly stable. Runtime estimates and their effects on work fetch can't be considered just client side or server side, what the client tells the server affects the estimates and vice versa. Joe |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
...I thought that the current problem was occurring client side, messing up the DCF and other important stats... Hi Mark, in case Joe is off catching some Z's I'll try answer & clarify a bit. Beat me, so was awake... :D Had posted
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
LadyL Send message Joined: 14 Sep 11 Posts: 1679 Credit: 5,230,097 RAC: 0 |
I think the 'intended longer term modification' being referred to is the future one that Dr.A. implements to fix whatever is happening now... The longterm fix is really about a statistical problem - whether or not you include outlying datapoints into your calculations. The whole CreditNew/APR cluster is a statistical approach. It currently uses all tasks for calculations of both credit and APR. Early exit conditions (-9 on MB, 30/30 on AP, some others) make the server believe, that the task was completed extremely fast. Consequently APR starts to go up - sometimes, as we learned from Winterknight absurdly high, with all the associated problems when that bloated value gets fed back into the runtime estimates. This can then lead to -177 errors. Based on the report on boinc_dev about this problem DA decided to implement two things - one 'short term' - the capping on the processing rate used in estimates (so irrealistic rates would not be used. Sadly he didn't have realistic expectations as to what realistic rates might be) and one 'long term' - removing outlying datapoints from the calculations by allowing the project to mark certain cases as outlying points not to be used in calculations. The project then still needs to define what it treats as outliers. Statistics can be wonderfully manipulated by careful selection of outliers... |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.