Message boards :
News :
SETI@home now supports Intel GPUs
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Since Albert's willing to let us use their beta to test & tune some things (in time), we're hopeful for an Apollo 13 style rescue, over a Coors-Light party train disaster. Last I heard, they were wrestling with some infrastructure upgrades. Probably I'll consult with the others on that during the week to try get a rough idea of timing. The difference here is that we have to do it while the vehicle's in motion and fully loaded with passengers. Sure :) nothing wrong with a little pressure :-X "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
I'll laugh at the Collatz joke when they start counting operations and using that to grant credit. The joke is fine as it is ;) I laughed loudly enough. Even if their credits not too good CreditScrew ones are counter-productive. They distract and hurt instead of attract and stimulate. It's absolutely enough reason to dump them. EDIT: regarding operations counting - are you sure that FLOPS == work done no matter what algorithm used? AstroPulse, for example does merely c=a+b in most of its parts. While some other project could do something like c=exp(a)*sin(b) for example. What FLOPS counting will give if the need in memory accesses will be accounted for ? SETI apps news We're not gonna fight them. We're gonna transcend them. |
juan BFP Send message Joined: 16 Mar 07 Posts: 9786 Credit: 572,710,851 RAC: 3,799 |
I'll laugh at the Collatz joke when they start counting operations and using that to grant credit. + 1000 |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
EDIT: regarding operations counting - are you sure that FLOPS == work done no matter what algorithm used? AstroPulse, for example does merely c=a+b in most of its parts. While some other project could do something like c=exp(a)*sin(b) for example. What FLOPS counting will give if the need in memory accesses will be accounted for ? You can of course factor in that Recent FFT developments reduced (serial) FFT algorithm complexity from knlogn ( O(nlogn) ) to a little bit less ( still knlogn, O(nlogn) ), but optimal compute complexity still remains more or less what it was, and ignores all memory/storage accesses (full latency hiding is assumed, before and now). That's the first major change in Fourier analysis in I think 30 years or so. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
Yes, if we went back to flop counting we would need to standardize. It makes sense to standardize on the most common algorithms for things like FFT, trig and exp. FFT would be 5*N*log(N), trig functions would be about 11 if the result is used as single precision and I've forgotten the number (17?) for double precision. Granting a standardized value rewards optimization that removes operations (i.e. sincosf() would get credit for 22 FLOPS rather than 16.) Of course, the project needs to be honest about whether it needs both values from the sincosf(). That said, I think the SETI@home FLOP counting grants 1 FLOP for sin() or cos(). @SETIEric@qoto.org (Mastodon) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Yes, if we went back to flop counting we would need to standardize. It makes sense to standardize on the most common algorithms for things like FFT, trig and exp. FFT would be 5*N*log(N), trig functions would be about 11 if the result is used as single precision and I've forgotten the number (17?) for double precision. The 'other' way I came up with that *should* remove the coarse scaling error, is to accept the initial (unscaled) wu estimate as minimum operations, or minimum x some constant for the benefit of allowing for some small overhead plus initial breathing room to prevent aborts ('make sure never to underestimate'). That's likely the main initial thing I'll be testing at Albert (when the time comes), because it allows the automatic scaling to compensate for SIMD, optimisation, and potentially multithreading while still keeping the automatic scaling for finetuning and sanity as intended. That of course would rely on projects setting the minimum estimate (*some constant). Heuristic might go something like this, while saving a lot of the costly sanity checks in place at the moment. // normal sanity checks here, minus some costly ones that won't be needed // anymore when system is stable in the engineering sense. // look for outliers properly ... // if not an outlier... credit_multiplier = raw_flop_claim/wu_estimate; if credit_multiplier < 1 then //... must be SIMD, or optimised, there's an inbuilt underclaim without this //... round this to steps if desired // raise some red flags if this is lower than say 1/6 or 1/8 ... // could be missing some outlier or old/broken clients ... credit_multiplier = 1 / credit_multiplier; else // either estimate was spot on, or the application is multithreaded // and sending back sum of elapsed per resource // (as good multithreaded should) // ... assume multithreaded for high credit_multplier, and allow whatever is // ... possible/consistent with app version & known host resources if app_is_mt && host_has_mt then ... // allow it else ... // probably we have some coarse overestimate ... // allow for usage variation .. // adjust credit_multiplier and app_ver_wu_scale used by scheduler end end ... //A PID controller smooths this, tune for rapid convergence // which allows for hardware change (small initial overshoot) // this is better than weighted sigma (undamped averages) host_app_ver_scale = host_app_ver_update_est_scale( app_ver, credit_multiplier ); // PID the global app version scale too, used for self-tuning initial estimates // globallly and finding the 'most efficient' app. // tune for slow response. wu_scale = wu_scale_update(app_ver, host_app_ver_scale); new_credit_claim = raw_credit_claim*wu_scale*host_app_ver_scale; Likely logic gremlins aside, cascading two controllers like that is fine, and creditnew currently does that. The problem is that when both are unstable it leads to the confusing effects we all see user side. Implementing these scales as PID controlled outputs allows noise rejection / damping, while potentially removing the need for certain costly sanity checks and database accesses. (e.g. no need to look up and adjust a database to average a bunch of values spanning a month) The three knobs, P, I and D (which are 'gains' )can be set to 1,0,0 to emulate the behaviour of the current system (ignoring the logic changes above), set to a 'classic' preset, or manually tuned (one-time). Tuning won't affect the coarse scale, just stability and noise rejection. The invisible internal controls self adjust, so no work there. If any initial tuning at all is too difficult for a project, and no classic presets seem suitable, then a fuzzy assist is doable (and not as big a deal as it sounds) All that would basically achieve is some convergence on the ( 'fair') COBBLESTONE_SCALE, noise immunity and better response to hardware or usage pattern changes. I've been using a modified 6.10.58 to track task estimates for a couple of years now, that implements the PID controller. Client side it's able to adapt near-real-time to machine usage and hardware change, without intervention. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
Another possibility I've considered. I've never confirmed that the credits are really scaled to the least efficient CPU version for a platform. In theory, if I were to create a CPU version of SETI@home with no threading or SIMD using the Ooura FFT and release it under the plan class "calibration". After 100 results come back from that version, the credits of everything else should go up. In theory, of course. More work would be required to allow short running calibration versions. Then for credit calibration, all a project would need to do is generate a calibration version of every application including GPU apps and some server code to greatly limit the number of calibration apps that go out. @SETIEric@qoto.org (Mastodon) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Another possibility I've considered. I've never confirmed that the credits are really scaled to the least efficient CPU version for a platform. In theory, if I were to create a CPU version of SETI@home with no threading or SIMD using the Ooura FFT and release it under the plan class "calibration". After 100 results come back from that version, the credits of everything else should go up. In theory, of course. From memory (needs another walk through when awake), it's scaling to the dodgy average of the lowest effective claim (which will always be overweighted toward AVX populating the last n results in the sample set). Raw claims there (for AVX) are about one fifth of [reality or] original estimate, mixed with a mid to dominant proportion of SSE-SSE3 by volume. Combined that brings the claim to about one third of [reality or] initial estimate (which I always interpreted as a minimum [based on fundamental compute complexity] ). ---> Shorties should be above 100 credits, not ~40 +/-25% . We added autocorrelation since the time they used to be 90-100. [there was a drop to ~60 in between, before AVX and autocorrelations, attributable to CreditNew's introduction not accounting for the existing SIMD optimisations.] That's reasonably close to the original old multiplier of 2.85, which more or less compensated for a lot of overhead and some flop count shortfalls (whether that was the intent or not). A possible middle ground with less logistical issues, but slightly less precision, would be to send out an app with just a bench, to grab CPU caps and several forms of Whetstone ( FPU double, FPU single, SSE-SSE3 signal/double and AVX... maybe even baseline GPU ) That should yield (at least for CPU) first a cross check for Boinc's Whetstone (approximating clock rate for x87 builds), detailed host capabilities, and coarse corrective multiplers for the scaling (given the server knows about app capabiltiies already somewhere). Anyway, still looking for the options with the least work involved first. Fingers crossed with the noise rejection and stability improved, and the coarse scaling assumptions repaired, the thing would converge on its own [Likely immediately around COBBLESTONE_SCALE when correct] "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
shizaru Send message Joined: 14 Jun 04 Posts: 1130 Credit: 1,967,904 RAC: 0 |
I've never confirmed that the credits are really scaled to the least efficient CPU version for a platform. Many moons ago I emailed Dr. A with (pretty much)* this exact question because no one here knew the answer. (I asked all over the place. Many, many, maaany times.) Naively, I thought a simple question would get a simple answer. And I did, just not the kind I was hoping for. Instead of a technical answer to a technical question, I managed to get Dr. A to run over and bug you by asking if credits are OK at SETI. Which makes me want to smile sardonically, bang my head on the desk (while shaking it in disbelief), and apologize... all at the same time!:) *if you replace the word 'least' to 'most' in quoted text. It doesn't matter which really, it just appears that whatever v6 was scaling to is missing in v7. I still think it's worth a look to make sure it wasn't the 'illegal' Intel opti version that everything was scaling to. |
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
That's what I never understood. There's enough information in the PFC values to determine credit scaling, but a pfc_scale factor is calculated instead. A scale less than 1 should never be possible (for a CPU app) if we're scaled to the least efficient. And a scale more than 1 should never be possible if we're scaled to the most efficient. Yet a quick check shows that our pfc_scales range from 0.51 to 1.30. So I'd say because of that our credit grants are probably low by at least 1/0.51=1.9X. The way the current code seems to work is that the most common CPU app (windows) sets the scaling. That needs to be fixed. @SETIEric@qoto.org (Mastodon) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
That's what I never understood. There's enough information in the PFC values to determine credit scaling, but a pfc_scale factor is calculated instead. A scale less than 1 should never be possible (for a CPU app) if we're scaled to the least efficient. And a scale more than 1 should never be possible if we're scaled to the most efficient. That's right, so the samples used for averages get weighted by the most commonly returned results (Windows SSE/AVX enabled by nature). It's scaling to the 'most efficient' but the the method used to determine throughput is faulty too, using Boinc FPU whetstone for a vector unit. --> impossibly low pfc_scale. -> compare sisoft Sandra FPU single thread WHetstone, to Boinc WHetsone [same] -> compare Sisoft Sandra SSE Whetsone to FPU Whetstone, [2-3x] pfc_scale will oscillate from about 0.3 to 2, depending on the population of the last n samples [platform, CPU, app caps]. Likewise, without damping those scales, the 'most efficient' app incorrectly selected can swap around too. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
Another question... do multi-threaded apps consistently report CPU time to be about n_compute_threads*elapsed_time on all platforms (so we could use CPU time/elapsed time to determine a multiplier)? (I realize that doesn't cover SIMD, but it's a start). @SETIEric@qoto.org (Mastodon) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Another question... do multi-threaded apps consistently report CPU time to be about n_compute_threads*elapsed_time on all platforms (so we could use CPU time/elapsed time to determine a multiplier)? I've been hopeful yes, for other purposes, but haven't been able to fully check the boincapi end completely yet. For compound [asymmetric] apps to work with runtime change I need the total across resources, and to ride the <flops> rate as well, which I expect would run into all sorts of safeties. [Edit:] The basic plan was to cut down projected GBT Astropulse from 6 months, by getting the entire GPU Users group working on one WU at a time, multithreaded and multiGPU'd. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
Yes, I can see some of these can be incremental changes, and some will require more under the hood. Fixing the "multi-threaded app get same credit as single threaded apps" problem and normalizing to the least (rather than most) efficient are something I can put into the code quickly. Other things that are design and database changes to track resources rather than code changes will take more time. @SETIEric@qoto.org (Mastodon) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
...Other things that are design and database changes to track resources rather than code changes will take more time. For the fine tuning/damping end (well after the coarse scaling issues), I can just pinch the sample arrays already being used for pfc_scale and host_scale rubbish averages. Full PID implementation would only need 6 sample spaces per scale. 3 for the fixed gain knobs, and 3 for the internal variables. I make that a saving of about 2*94 database lookups per host result validation, and so 188*sizeof(double) bytes per host result pending in the working set/Cache. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
shizaru Send message Joined: 14 Jun 04 Posts: 1130 Credit: 1,967,904 RAC: 0 |
...normalizing to the least (rather than most) efficient [app is] something I can put into the code quickly. This I'd love to see. Assuming CN doesn't have a built in failsafe to slap the change down, I'm especially curious about what happens next. Because if CN 'rewards' app efficiency then the change will have the opposite effect and credit will actually drop further. I hope my Cassandra instincts are wrong on both counts. |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14679 Credit: 200,643,578 RAC: 874 |
Another question... do multi-threaded apps consistently report CPU time to be about n_compute_threads*elapsed_time on all platforms (so we could use CPU time/elapsed time to determine a multiplier)? For Windows, and the limited number of MT applications I've run (AQUA, and some MilkyWay N-Body), yes. There is significant wastage from thread synchronisation issues (not every thread reaches a waypoint after the same elapsed time), and in Milkyway's case significant pre- and post-processing in single threaded mode. So expect the ratio to be consistently below n_threads, but unambiguously greater than 1. Stock MT tasks under BOINC are usually configured to use a thread count of (host CPU count)*(%age of CPUs specified in computing preferences), but that can be overruled in anonymous platform. If MilkyWay still has tasks, I can run some test cases if you want. [There was some concern that Linux MT apps use all available cores even when deployed in single-threaded mode, but IIRC still report CPU time honestly. But that may be MilkyWay's deployment quirks - it took me about six months to teach them how to use a plan_class properly] |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Another possibility I've considered. I've never confirmed that the credits are really scaled to the least efficient CPU version for a platform. In theory, if I were to create a CPU version of SETI@home with no threading or SIMD using the Ooura FFT and release it under the plan class "calibration". After 100 results come back from that version, the credits of everything else should go up. In theory, of course. How to consider to calibrate on best valid result instead of worst one? At least that would encourage stock optimization instead of discourage it.... SETI apps news We're not gonna fight them. We're gonna transcend them. |
Eric Korpela Send message Joined: 3 Apr 99 Posts: 1382 Credit: 54,506,847 RAC: 60 |
You've got it backwards. Calibrating for the best optimized version reduces the credit grants for all versions and penalizes optimizations. Calibrating for the worst increases the credit grants for all versions. @SETIEric@qoto.org (Mastodon) |
BilBg Send message Joined: 27 May 07 Posts: 3720 Credit: 9,385,827 RAC: 0 |
Calibrating for the worst increases the credit grants for all versions. Then problem is that projects often remove the slowest plain version when they have working SSE2/3 (especially for 64-bit applications as all such CPUs have SSE3) So if they are on CreditNew they need to keep the slowest app for 'calibration'? ;) Else the calibration will be done against SSE2 app as 'worst'? Can't your 'calibration' app be the current app (+ the same libfftw) but compiled to use only FPU Or simply use the cmdline Option: -default_functions       use the safe unoptimized default functions This will be the most 'fare' calibration app (same app/library but in old fashioned way) And of course distributed with very small probability (e.g. a host can have a chance to get a slow calibration app/task once a year) (if that's the way CreditNew works in the first place)  - ALF - "Find out what you don't do well ..... then don't do it!" :)  |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.