Message boards :
Number crunching :
Observation of CreditNew Impact (4)
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
It needs to be remembered that this isn't simply a SETI concern. Other projects issue credits too, and draw conclusions from them: Good to know. sounds like prudent people. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
ML1 Send message Joined: 25 Nov 01 Posts: 21235 Credit: 7,508,002 RAC: 20 |
... 'Boinc whetstone' versus app technology. The CN Author didn't understand that instruction level parallelism does more operations in the same time, by an average factor of, you guessed it, 3.3. (Also) Are the credits being compared against the credit granted to the hardware performance for the "median computer" that s@h sees? What happens when the "median computer" host becomes a GPU-based system rather than CPU-only? Would we then see a sudden unholy credits rate shift? Keep searchin', Martin (Jason: Good to see you're still optimizing!) See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
... 'Boinc whetstone' versus app technology. The CN Author didn't understand that instruction level parallelism does more operations in the same time, by an average factor of, you guessed it, 3.3. In 'principle' yes, though there is an interesting combination of factors there that leads to a global downscale to a quite specific under 'claiming' app. 1] the GPU [all brands] 'raw peak flop claims' are inflated, of course having been derived from 'Marketing flops' as opposed to Boinc's FPU Whetstone for CPU apps. In 'principle' that would be OK, because of the second point: 2] it isn't 'the median computer' that's used, but in fact the lowest 'claiming' one. In principle that might be OK too, but at least in Multibeam estimates embedded in the tasks, they are based on a theoretical minimum number of operations , such as k*nlogn for an fft portion for example. since you cannot actually do an fft in fewer operations than this, any claim below the estimate is actually 'suspicious' rather than the currently interpreted 'most efficient'. IOW, AVX does just as many operations as any other app, but no allowance for parallelism means the server codes 'believes' in magic, choosing it as 'the most efficient' in number of operations, globally downscaling everyone to below the immutable cobblestone scale. Where the problem here exists is in that the 'raw claims' for CPU use a knobbled FPU whetstone for SIMD applications/hosts [vectorised, instruction level parallelism]. SSE+ being by far dominant now, and AVX gaining traction, As a consequence, the two distinct 'unholy steps' we all observed match quite well to the introduction of creditnew itself and SSE+ optimisations into the stock CPU application, followed by more recent stock AVX CPU with V7. [A bit later this evening , local Oz time] I'll try find / post my graphs I have somewhere that visually illustrate the 2 key issues, improper scaling by inaccurate/improper choice of whetstone, and instability characteristics. There are more minor issues, though at this time it appear most symptoms originate from these two, and are relatively insensitive to the OK workunit estimates that might appear a reasonable first suspect "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Sirius B Send message Joined: 26 Dec 00 Posts: 24912 Credit: 3,081,182 RAC: 7 |
200 year old [control theory] technology that works is generally frowned upon in modern academia though, as it doesn't attract funding. Two points here. 200 years ago, technology was mainly mechanical with the commencement of electrical thereabouts. So if that worked for 200 years, why hasn't an updated version for the electronic age (we can safely say that the 50's started the electronic age), so that's 60 years so far, been produced? Secondly, can it be done to meet modern academia's approval? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
200 year old [control theory] technology that works is generally frowned upon in modern academia though, as it doesn't attract funding. Correct on both points. There are a lot of practical possibilities. One possible effective choice, and exceedingly simple to code, modern engineered version using 'classical' control theory is known as a PID controller. It is based on steam engine mechanical governors. See http://en.wikipedia.org/wiki/PID_controller. [OK: 1890's there, so more like 120+ years] At first educated glance, the existing mechanism looks like a PI controller, i.e. mising the 'D damping term. It isn;t quite that though, because it uses sampled averages [ sigma-delta controller with no delta ], instead of instantaneous cummulative error values. The weightings make it closer to a 'P' with some fudge factors to replace the 'I' and 'D' terms, which if there 'govern' long term drift and noise immunity. For the second part, there are 'stable' systems and 'unstable' ones, formal engineering definitions. CreditNew as currently implemented fits in the second category, with particular traits observable I'll describe with my graphs a bit later. Fortunately, instability and improper choices aside, CreditNew as a whole is 'relatively' sound. Short term very minor bandaids are feasible, and more carefully tuned control for the longer term ends up simpler and more robust by far, potentially with less server/database load and other advantages. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Sirius B Send message Joined: 26 Dec 00 Posts: 24912 Credit: 3,081,182 RAC: 7 |
Thanks Jason. So wouldn't it be more effective for Creditnew to get that fine control now rather than later as the projects as a whole can benefit? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Possibly. The mood from the project has been 'understaffed and occupied with other important stuff' like Android and GBT is my guess, which I happen to agree should be high priority. In that light, for example, I emailed months ago about moving Linux and Mac Cuda to Beta, as well as querying GPU reliability for factoring into x42's design. Understandably no response to date, so I'm moving forward regardless. As for CN itself, it does intrinsically tie into time estimates. I did work on modifying a 6.10.58 years ago for my own use that uses client side per application correction stabilised with such a PID scheme. That's on my Windows hosts. I still use that today: estimates are generally to within a few seconds either way, and it is robust to outliers like overflows etc So in principle a 'proper fix' is warranted and doable, though I have to factor in that multithreaded and heterogeneous forms of parallelism are very very near [ i.e. planned for phase two x42]. That sounds complex at first, just like the rest, but it does turn out there are easy ways to guage effective parallelism server side if the original work estimates are 'reasonable'. Considering all that though warrants care for future-proofing. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Sirius B Send message Joined: 26 Dec 00 Posts: 24912 Credit: 3,081,182 RAC: 7 |
Thanks again. I thought Time and Manpower would enter the equation. However, it would make more sense to get it fine tuned asap so it can reasonably run under it's own steam as well as the reduced server/database load. That must surely give them more time to spend on android/GBT and others. Another factor of that would be a hell of a reduction in credit complaints. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
So in principle a 'proper fix' is warranted and doable, though I have to factor in that multithreaded and heterogeneous forms of parallelism are very very near [ i.e. planned for phase two x42]. That sounds complex at first, just like the rest, but it does turn out there are easy ways to guage effective parallelism server side if the original work estimates are 'reasonable'. Considering all that though warrants care for future-proofing. As I understand it, "phase two x42" is a specific SETI-centric concept. Unfortunately, 'CreditNew' applies BOINC-wide: data is sparse about how many projects are currently using it, but I suspect the nay-sayers are wider of the mark than they realise. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
So in principle a 'proper fix' is warranted and doable, though I have to factor in that multithreaded and heterogeneous forms of parallelism are very very near [ i.e. planned for phase two x42]. That sounds complex at first, just like the rest, but it does turn out there are easy ways to guage effective parallelism server side if the original work estimates are 'reasonable'. Considering all that though warrants care for future-proofing. Indeed, though by very nature [evolving] heterogeneous design is somewhat universal, and inevitable now. Debate and new ideas are always welcome, especially in stuff that hasn't really been done before, but 'naysaying' achieves nothing but the destruction of motivation. [Edit:] e.g. 'looks like more trouble than it's worth', or 'What I'm doing isn't working, so it must be someone else's fault' "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
So in principle a 'proper fix' is warranted and doable, though I have to factor in that multithreaded and heterogeneous forms of parallelism are very very near [ i.e. planned for phase two x42]. That sounds complex at first, just like the rest, but it does turn out there are easy ways to guage effective parallelism server side if the original work estimates are 'reasonable'. Considering all that though warrants care for future-proofing. Sorry, I only meant 'naysaying' in the sense of people who say "very few projects have adopted CreditNew" - that was the version which had reached Eric's ears. Perhaps because the people who are most interested in Credit, and comment about it on message boards, have migrated to the projects which have moved furthest away from the BOINC norms for credit. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
So in principle a 'proper fix' is warranted and doable, though I have to factor in that multithreaded and heterogeneous forms of parallelism are very very near [ i.e. planned for phase two x42]. That sounds complex at first, just like the rest, but it does turn out there are easy ways to guage effective parallelism server side if the original work estimates are 'reasonable'. Considering all that though warrants care for future-proofing. Yes I need to post my graphs in a bit, but careful explanation is warranted. When I talk about practicality and feasibility, I am implying many things, including stabilised functionality in a formal engineering sense, and 'fair credit for work done in a more abstract sense. Current awards are a fraction of the cobblestone scale, so unfair and in contradiction to the intent in CreditNew's documentation. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Here's the first of two graphs, This one for 'stability' The Blue line is actual credit awards for a CPU SSE3 enabled anonymous machine 'X', on the Beta project. 'Shorties' only in issued/processed sequence. In this one I deliberately downscaled the Red graph, zooming in, to allow the graph extents to show the real [Blue] credit award instabilities closely. Notable features of the 'real' blue line include: - It 'looks like' it wants to be around a particular value but jumps around it. - When you work out the cobblestone scale against the task estimate, it should be well over 100 credits, as opposed to 30-40 - possible 'self similar' looking oscillations The Red line is the same input data[whetstone, elapsed, cobblestone scale] fed into a very rough PID controller, implemented in a google spreadsheet. I divided/downscaled its output as mentioned to illustrate what stability is. Notable characteristics include a small 10 percent initial overshoot acting as if starting as a new host, which is for rapid convergence within 10 tasks. No special setiathome multibeam specific factors were needed, and it seems more immune to 'noise', such as like periodic heavy machine usage that was indicated in the source data. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
shizaru Send message Joined: 14 Jun 04 Posts: 1130 Credit: 1,967,904 RAC: 0 |
At first glance it appears CN is auto-adjusting every few tasks. Assuming all tasks want to settle around 38, it looks like it's trying to compensate whenever you get low-balled and vice-versa. Which could also be an explanation for repeating patterns? Why it has to jump through these hoops is, of course, a different kettle of fish. But it looks (dare I say) fair? |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
At first glance it appears CN is auto-adjusting every few tasks. Assuming all tasks want to settle around 38, it looks like it's trying to compensate whenever you get low-balled and vice-versa. Which could also be an explanation for repeating patterns? Exactly, that's called oscillation, just like the ringing of a bell. Here is a link to famous video depicting 'Galloping Gertie', aka the Tacoma narrows bridge that collapsed. It led to a wider understanding of resonance in civil engineering. http://www.youtube.com/watch?v=j-zczJXSxnw [Edit:] Why it has to jump through these hoops is, of course, a different kettle of fish. But it looks (dare I say) fair? Oh it's perfectly fair to get a random [or more precisely chaotic] amount of credit for work done. I will employ you and pay you based on a random amount determined by magic elves that work faster than you, yet claim less. Sounds fair right ? [ wait for the next graph in a bit though before agreeing ;D ] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
And here is the second graph depicting 'Scaling': Same input data, now not down scaling the red graph. The blueline is the same 'real' data. The red line is now unscaled and governed by a commercial SSE whetstone [Sisoft Sandra Single-threaded SSE2] instead of Boinc's [knobbled] FPU one. First important points here include firstly that Whetstone is not a 'peak measure' at all, as implied in the CreditNew documentation, but a worst case. So this represents 'more reasonable' credit for the same work, but in fact the cobblestone scale specifies higher. Second point is that the actual work performed irrespective of processing device and elapsed time actually comes out even higher than this. That is the 'fair' cobblestone scale, and it's the system's inability to cope with parallelism [Via SSE and AVX SIMD] that is to blame. For at least multibeam and astropulse here, in between a simple bandaid and a comprehensive forward looking fix, there exists another option. Any [valid] claim lower than possible by mathematical and physical laws is using some form of effective parallelism [SIMD, multithreading, heterogeneous slave monkeys, etc] . Use the inverse of that to scale the claim and you have 'fair credit' that compensates for multiple threads and enslaved monkeys in parallel. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
I) What would be the effect of ... having a powerful host with at least 2 GPUs doing multiple tasks at at time: a) one or two heavily blanked AP work units that take a long time and then -- at the same time -- b) doing non blanked AP unit(s) that would execute a lot faster than normal since GPU is less busy waiting for the CPU to do the blanking for other task(s). -- and on top of that ... at the same time running mixed workloads -- c) running normal MB task(s) and/or vlar on NVIDIA GPU -- plus some CPU tasks. An example : A host confugured to run 2 MB or 3 AP per GPU and having 2 GPUs and at the same time doing 6 CPU AP or MB tasks and leaving 6 CPU cores free out of total 12 virtual cores (6 real FPUs). Then ... A transition from MB only to AP only and then after a few days the evident running out of AP work and transitioning back to MB only. During the transition there could be a real mixture of different workloads going on and having the most unusual run times. There can not be an AI system that can figure out the "normal" processing rate. My APR varies from the average about +- 30 for AP and +-20 for MB. The TDCF (task duration correction factor) is from 0.8 to 2,22. <-- the number seems random (i.e. varies too fast depending of the (too few) last accepted or last reported WU). II) When doing BOINC Wheatstone or whatever calculation I get 28000 when running 50% of the processors and about 12000 when running 100% of them. I know that there are just 6 physical AVX/SSE/math units in my CPU even though there are 12 virtual cores and when using all of them there is a penalty for switching the tasks (register file backup etc.) and a penalty for overcommitting the CPU cache. -- but -- if that number is used for determining the efficiency/efficacy of the CPU I'd get two totally different numbers if there was an optimization that favours running the application one or multiple at a time. The Question: How does the credit new know how efficient a host is? What would be a hosts maximum? How well is it doing right now? To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Another discrepancy is that BOINC's "Whetstone" benchmark is double precision while the AMD and NVIDIA peak GFLOPS are rated based on single precision. Since the benchmark is non-SIMD that doesn't make a huge difference, the canonical optimized Whetstone implementations from Roy Longbottom's PC benchmark collection give about 1092 MWIPS double precision and 1046 MWIPS single precision on my 1.4GHz Pentium-M (BOINC gives about 1292 MWIPS). Even for 64 bit BOINC builds where the benchmark would be run using SSE scalar operations there probably would be little difference. One way to sort of level the playing field would be to rate CPU peak GFLOPS using an approach similar to that used for GPUs. A post by an Intel engineer from a few years ago explains some of the considerations which would be needed for a fully detailed version. In practice, Intel has published export compliance metrics which include GFLOPS ratings for many of its CPUs (note those are for the full package so need to be divided by the number of CPUs in the package). The practical approach for BOINC would be to simply multiply the CPU clock rate by the number of single precision operations which can be done simultaneously by whatever SIMD capability a processor has. That would also work for Power PC, ARM, etc. CPUs. Joe |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
The Question: How does the credit new know how efficient a host is? What would be a hosts maximum? How well is it doing right now? All creditNew really has at the start is a somewhat reasonable in'our MB case' approximate of the number of calculation operations, minimum that the task takes. When the task returns it now has an elapsed time. That can yield a rate for the task representing throughput, which is averaged into APR. Moving averages from point samples, without proper damping controls, can be pretty volatile as seen in the prior similar DCF mechanism too. It'll be susceptible to all sorts of ringing, overshoot, drift and susceptibility to noise from even normal conditions like periodic heavy machine usage. These are loosely linked [the estimates and creditnew that is] through a scaling system and a set of averages, project DCF now disabled if using a relatively recent client. That was to put the estimate scaling server side in the hopes of addressing projects like here that mix different applications in the same project. It's a design choice I wouldn't have gone with due to increasing the server workload and slowing client estimate adaptation to new work fetches. The somewhat unstable APR value is probably the closest figure we have at the moment to some sortof reality. For multibeam applications, still containing the stderr flopcounter value, you can take this and divide it by your choice of elapsed or CPU time to yield a throughput figure. Either way what's needed is a smooth most of the time, responsive when needed, and relatively noise immune controlled number... Here the absolute value isn't all the critical, how it changes over time is. Obviously this throughput figure is going to vary for all sorts of reasons. Nonetheless for determining overall throughput and estimates etc, properly handled control loops work a lot better than sampled averages. It's at this point you diverge to control systems engineering theory, but it's perhaps analogous enough to driving a car. Most people don't sit 'feathering' the throttle every millisecond around the speedometer reading at the speed limit . That would be prone to all sorts of noise, overshoot etc. Instead you slip into a groove that's near enough,then make minor smooth adjustments as necessary. That's control. So in a sense in CreditNew's current form here, it's this aggressive use of statistics, where they a not the most robust or elegant choice,that is the Achilles' heel causing so much consternation. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Another discrepancy is that BOINC's "Whetstone" benchmark is double precision while the AMD and NVIDIA peak GFLOPS are rated based on single precision. Since the benchmark is non-SIMD that doesn't make a huge difference, the canonical optimized Whetstone implementations from Roy Longbottom's PC benchmark collection give about 1092 MWIPS double precision and 1046 MWIPS single precision on my 1.4GHz Pentium-M (BOINC gives about 1292 MWIPS). Even for 64 bit BOINC builds where the benchmark would be run using SSE scalar operations there probably would be little difference. Definitely worth detailed consideration IMO, especially for those, I suppose many, projects that don't have particularly good wu estimates to start with. We certainly don't need a super precise figure here. I'd be happy with plus or minus a few credits out of 100. For MB I'd like to see the 'PI like' weighted sigma averages replaced with a smoother control to compare, and see how close simply scaling the Boinc whetstone by: effective_parallelism = 1 if raw_claim < wu_est then effective_parallelism = wu_est/raw_claim proper_claim = raw_claim * effective_parallelism <--- yep, that's claiming the estimate.[ a lower cap for parallel tasks, SIMD etc] Something like that might even be 'close enough' for current and future SIMD variants, as well as cope with multithreading and maybe even non-symetricload balancing. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.