SETI@home now supports Intel GPUs


log in

Advanced search

Message boards : News : SETI@home now supports Intel GPUs

Previous · 1 · 2 · 3 · 4 · 5 · Next
Author Message
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 5081
Credit: 74,110,404
RAC: 4,310
Australia
Message 1499798 - Posted: 4 Apr 2014, 18:22:32 UTC - in response to Message 1499735.

Since Albert's willing to let us use their beta to test & tune some things (in time), we're hopeful for an Apollo 13 style rescue, over a Coors-Light party train disaster.

Any ideia when the test begins? I allready join Albert to help with that.

Last I heard, they were wrestling with some infrastructure upgrades. Probably I'll consult with the others on that during the week to try get a rough idea of timing.

The difference here is that we have to do it while the vehicle's in motion and fully loaded with passengers.

A nice challenge not? :)

Sure :) nothing wrong with a little pressure :-X
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3588
Credit: 48,772,835
RAC: 19,184
Russia
Message 1500111 - Posted: 5 Apr 2014, 10:43:00 UTC - in response to Message 1498945.
Last modified: 5 Apr 2014, 10:46:38 UTC

I'll laugh at the Collatz joke when they start counting operations and using that to grant credit.

The joke is fine as it is ;)
I laughed loudly enough. Even if their credits not too good CreditScrew ones are counter-productive. They distract and hurt instead of attract and stimulate. It's absolutely enough reason to dump them.

EDIT: regarding operations counting - are you sure that FLOPS == work done no matter what algorithm used? AstroPulse, for example does merely c=a+b in most of its parts. While some other project could do something like c=exp(a)*sin(b) for example. What FLOPS counting will give if the need in memory accesses will be accounted for ?
____________

juan BFBProject donor
Volunteer tester
Avatar
Send message
Joined: 16 Mar 07
Posts: 5472
Credit: 313,440,720
RAC: 95,786
Brazil
Message 1500112 - Posted: 5 Apr 2014, 10:45:37 UTC - in response to Message 1500111.

I'll laugh at the Collatz joke when they start counting operations and using that to grant credit.

The joke is fine as it is ;)
I laughed loudly enough. Even if their credits not too good CreditScrew ones are counter-productive. They distract and hurt instead of attract and stimulate. It's absolutely enough reason to dump them.

+ 1000
____________

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 5081
Credit: 74,110,404
RAC: 4,310
Australia
Message 1500163 - Posted: 5 Apr 2014, 15:48:48 UTC - in response to Message 1500111.
Last modified: 5 Apr 2014, 15:55:29 UTC

EDIT: regarding operations counting - are you sure that FLOPS == work done no matter what algorithm used? AstroPulse, for example does merely c=a+b in most of its parts. While some other project could do something like c=exp(a)*sin(b) for example. What FLOPS counting will give if the need in memory accesses will be accounted for ?


You can of course factor in that Recent FFT developments reduced (serial) FFT algorithm complexity from knlogn ( O(nlogn) ) to a little bit less ( still knlogn, O(nlogn) ), but optimal compute complexity still remains more or less what it was, and ignores all memory/storage accesses (full latency hiding is assumed, before and now). That's the first major change in Fourier analysis in I think 30 years or so.
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Eric KorpelaProject donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 3 Apr 99
Posts: 1112
Credit: 10,324,623
RAC: 9,470
United States
Message 1501759 - Posted: 9 Apr 2014, 16:52:50 UTC - in response to Message 1500163.

Yes, if we went back to flop counting we would need to standardize. It makes sense to standardize on the most common algorithms for things like FFT, trig and exp. FFT would be 5*N*log(N), trig functions would be about 11 if the result is used as single precision and I've forgotten the number (17?) for double precision.

Granting a standardized value rewards optimization that removes operations (i.e. sincosf() would get credit for 22 FLOPS rather than 16.) Of course, the project needs to be honest about whether it needs both values from the sincosf().

That said, I think the SETI@home FLOP counting grants 1 FLOP for sin() or cos().
____________

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 5081
Credit: 74,110,404
RAC: 4,310
Australia
Message 1502006 - Posted: 10 Apr 2014, 5:05:41 UTC - in response to Message 1501759.
Last modified: 10 Apr 2014, 5:31:54 UTC

Yes, if we went back to flop counting we would need to standardize. It makes sense to standardize on the most common algorithms for things like FFT, trig and exp. FFT would be 5*N*log(N), trig functions would be about 11 if the result is used as single precision and I've forgotten the number (17?) for double precision.

Granting a standardized value rewards optimization that removes operations (i.e. sincosf() would get credit for 22 FLOPS rather than 16.) Of course, the project needs to be honest about whether it needs both values from the sincosf().

That said, I think the SETI@home FLOP counting grants 1 FLOP for sin() or cos().


The 'other' way I came up with that *should* remove the coarse scaling error, is to accept the initial (unscaled) wu estimate as minimum operations, or minimum x some constant for the benefit of allowing for some small overhead plus initial breathing room to prevent aborts ('make sure never to underestimate').

That's likely the main initial thing I'll be testing at Albert (when the time comes), because it allows the automatic scaling to compensate for SIMD, optimisation, and potentially multithreading while still keeping the automatic scaling for finetuning and sanity as intended.

That of course would rely on projects setting the minimum estimate (*some constant). Heuristic might go something like this, while saving a lot of the costly sanity checks in place at the moment.

// normal sanity checks here, minus some costly ones that won't be needed // anymore when system is stable in the engineering sense. // look for outliers properly ... // if not an outlier... credit_multiplier = raw_flop_claim/wu_estimate; if credit_multiplier < 1 then //... must be SIMD, or optimised, there's an inbuilt underclaim without this //... round this to steps if desired // raise some red flags if this is lower than say 1/6 or 1/8 ... // could be missing some outlier or old/broken clients ... credit_multiplier = 1 / credit_multiplier; else // either estimate was spot on, or the application is multithreaded // and sending back sum of elapsed per resource // (as good multithreaded should) // ... assume multithreaded for high credit_multplier, and allow whatever is // ... possible/consistent with app version & known host resources if app_is_mt && host_has_mt then ... // allow it else ... // probably we have some coarse overestimate ... // allow for usage variation .. // adjust credit_multiplier and app_ver_wu_scale used by scheduler end end ... //A PID controller smooths this, tune for rapid convergence // which allows for hardware change (small initial overshoot) // this is better than weighted sigma (undamped averages) host_app_ver_scale = host_app_ver_update_est_scale( app_ver, credit_multiplier ); // PID the global app version scale too, used for self-tuning initial estimates // globallly and finding the 'most efficient' app. // tune for slow response. wu_scale = wu_scale_update(app_ver, host_app_ver_scale); new_credit_claim = raw_credit_claim*wu_scale*host_app_ver_scale;


Likely logic gremlins aside, cascading two controllers like that is fine, and creditnew currently does that. The problem is that when both are unstable it leads to the confusing effects we all see user side. Implementing these scales as PID controlled outputs allows noise rejection / damping, while potentially removing the need for certain costly sanity checks and database accesses. (e.g. no need to look up and adjust a database to average a bunch of values spanning a month)

The three knobs, P, I and D (which are 'gains' )can be set to 1,0,0 to emulate the behaviour of the current system (ignoring the logic changes above), set to a 'classic' preset, or manually tuned (one-time). Tuning won't affect the coarse scale, just stability and noise rejection. The invisible internal controls self adjust, so no work there.

If any initial tuning at all is too difficult for a project, and no classic presets seem suitable, then a fuzzy assist is doable (and not as big a deal as it sounds)

All that would basically achieve is some convergence on the ( 'fair') COBBLESTONE_SCALE, noise immunity and better response to hardware or usage pattern changes. I've been using a modified 6.10.58 to track task estimates for a couple of years now, that implements the PID controller. Client side it's able to adapt near-real-time to machine usage and hardware change, without intervention.
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Eric KorpelaProject donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 3 Apr 99
Posts: 1112
Credit: 10,324,623
RAC: 9,470
United States
Message 1502273 - Posted: 10 Apr 2014, 18:13:33 UTC

Another possibility I've considered. I've never confirmed that the credits are really scaled to the least efficient CPU version for a platform. In theory, if I were to create a CPU version of SETI@home with no threading or SIMD using the Ooura FFT and release it under the plan class "calibration". After 100 results come back from that version, the credits of everything else should go up. In theory, of course.

More work would be required to allow short running calibration versions.

Then for credit calibration, all a project would need to do is generate a calibration version of every application including GPU apps and some server code to greatly limit the number of calibration apps that go out.
____________

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 5081
Credit: 74,110,404
RAC: 4,310
Australia
Message 1502315 - Posted: 10 Apr 2014, 19:48:25 UTC - in response to Message 1502273.
Last modified: 10 Apr 2014, 20:28:37 UTC

Another possibility I've considered. I've never confirmed that the credits are really scaled to the least efficient CPU version for a platform. In theory, if I were to create a CPU version of SETI@home with no threading or SIMD using the Ooura FFT and release it under the plan class "calibration". After 100 results come back from that version, the credits of everything else should go up. In theory, of course.

More work would be required to allow short running calibration versions.

Then for credit calibration, all a project would need to do is generate a calibration version of every application including GPU apps and some server code to greatly limit the number of calibration apps that go out.



From memory (needs another walk through when awake), it's scaling to the dodgy average of the lowest effective claim (which will always be overweighted toward AVX populating the last n results in the sample set). Raw claims there (for AVX) are about one fifth of [reality or] original estimate, mixed with a mid to dominant proportion of SSE-SSE3 by volume. Combined that brings the claim to about one third of [reality or] initial estimate (which I always interpreted as a minimum [based on fundamental compute complexity] ). ---> Shorties should be above 100 credits, not ~40 +/-25% . We added autocorrelation since the time they used to be 90-100. [there was a drop to ~60 in between, before AVX and autocorrelations, attributable to CreditNew's introduction not accounting for the existing SIMD optimisations.]

That's reasonably close to the original old multiplier of 2.85, which more or less compensated for a lot of overhead and some flop count shortfalls (whether that was the intent or not).

A possible middle ground with less logistical issues, but slightly less precision, would be to send out an app with just a bench, to grab CPU caps and several forms of Whetstone ( FPU double, FPU single, SSE-SSE3 signal/double and AVX... maybe even baseline GPU )

That should yield (at least for CPU) first a cross check for Boinc's Whetstone (approximating clock rate for x87 builds), detailed host capabilities, and coarse corrective multiplers for the scaling (given the server knows about app capabiltiies already somewhere).

Anyway, still looking for the options with the least work involved first. Fingers crossed with the noise rejection and stability improved, and the coarse scaling assumptions repaired, the thing would converge on its own [Likely immediately around COBBLESTONE_SCALE when correct]
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile Alex Storey
Volunteer tester
Avatar
Send message
Joined: 14 Jun 04
Posts: 567
Credit: 1,684,362
RAC: 275
Greece
Message 1502352 - Posted: 10 Apr 2014, 20:55:46 UTC - in response to Message 1502273.

I've never confirmed that the credits are really scaled to the least efficient CPU version for a platform.


Many moons ago I emailed Dr. A with (pretty much)* this exact question because no one here knew the answer. (I asked all over the place. Many, many, maaany times.)

Naively, I thought a simple question would get a simple answer. And I did, just not the kind I was hoping for. Instead of a technical answer to a technical question, I managed to get Dr. A to run over and bug you by asking if credits are OK at SETI. Which makes me want to smile sardonically, bang my head on the desk (while shaking it in disbelief), and apologize... all at the same time!:)

*if you replace the word 'least' to 'most' in quoted text. It doesn't matter which really, it just appears that whatever v6 was scaling to is missing in v7. I still think it's worth a look to make sure it wasn't the 'illegal' Intel opti version that everything was scaling to.

Eric KorpelaProject donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 3 Apr 99
Posts: 1112
Credit: 10,324,623
RAC: 9,470
United States
Message 1502374 - Posted: 10 Apr 2014, 21:34:45 UTC - in response to Message 1502352.
Last modified: 10 Apr 2014, 21:41:11 UTC

That's what I never understood. There's enough information in the PFC values to determine credit scaling, but a pfc_scale factor is calculated instead. A scale less than 1 should never be possible (for a CPU app) if we're scaled to the least efficient. And a scale more than 1 should never be possible if we're scaled to the most efficient.

Yet a quick check shows that our pfc_scales range from 0.51 to 1.30. So I'd say because of that our credit grants are probably low by at least 1/0.51=1.9X.

The way the current code seems to work is that the most common CPU app (windows) sets the scaling. That needs to be fixed.
____________

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 5081
Credit: 74,110,404
RAC: 4,310
Australia
Message 1502389 - Posted: 10 Apr 2014, 21:47:16 UTC - in response to Message 1502374.
Last modified: 10 Apr 2014, 21:59:26 UTC

That's what I never understood. There's enough information in the PFC values to determine credit scaling, but a pfc_scale factor is calculated instead. A scale less than 1 should never be possible (for a CPU app) if we're scaled to the least efficient. And a scale more than 1 should never be possible if we're scaled to the most efficient.

Yet a quick check shows that our pfc_scales range from 0.51 to 1.30. So I'd say because of that our credit grants are probably low by about 1/0.51=1.9X.

The way the current code seems to work is that the most common CPU app (windows) sets the scaling. That needs to be fixed.


That's right, so the samples used for averages get weighted by the most commonly returned results (Windows SSE/AVX enabled by nature). It's scaling to the 'most efficient' but the the method used to determine throughput is faulty too, using Boinc FPU whetstone for a vector unit. --> impossibly low pfc_scale.

-> compare sisoft Sandra FPU single thread WHetstone, to Boinc WHetsone [same]
-> compare Sisoft Sandra SSE Whetsone to FPU Whetstone, [2-3x]

pfc_scale will oscillate from about 0.3 to 2, depending on the population of the last n samples [platform, CPU, app caps]. Likewise, without damping those scales, the 'most efficient' app incorrectly selected can swap around too.
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Eric KorpelaProject donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 3 Apr 99
Posts: 1112
Credit: 10,324,623
RAC: 9,470
United States
Message 1502394 - Posted: 10 Apr 2014, 22:04:26 UTC
Last modified: 10 Apr 2014, 22:08:36 UTC

Another question... do multi-threaded apps consistently report CPU time to be about n_compute_threads*elapsed_time on all platforms (so we could use CPU time/elapsed time to determine a multiplier)?

(I realize that doesn't cover SIMD, but it's a start).
____________

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 5081
Credit: 74,110,404
RAC: 4,310
Australia
Message 1502397 - Posted: 10 Apr 2014, 22:16:11 UTC - in response to Message 1502394.
Last modified: 10 Apr 2014, 22:27:34 UTC

Another question... do multi-threaded apps consistently report CPU time to be about n_compute_threads*elapsed_time on all platforms (so we could use CPU time/elapsed time to determine a multiplier)?


I've been hopeful yes, for other purposes, but haven't been able to fully check the boincapi end completely yet. For compound [asymmetric] apps to work with runtime change I need the total across resources, and to ride the <flops> rate as well, which I expect would run into all sorts of safeties.

[Edit:] The basic plan was to cut down projected GBT Astropulse from 6 months, by getting the entire GPU Users group working on one WU at a time, multithreaded and multiGPU'd.
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Eric KorpelaProject donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 3 Apr 99
Posts: 1112
Credit: 10,324,623
RAC: 9,470
United States
Message 1502410 - Posted: 10 Apr 2014, 22:30:00 UTC - in response to Message 1502397.

Yes, I can see some of these can be incremental changes, and some will require more under the hood. Fixing the "multi-threaded app get same credit as single threaded apps" problem and normalizing to the least (rather than most) efficient are something I can put into the code quickly. Other things that are design and database changes to track resources rather than code changes will take more time.
____________

Profile jason_gee
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 24 Nov 06
Posts: 5081
Credit: 74,110,404
RAC: 4,310
Australia
Message 1502414 - Posted: 10 Apr 2014, 22:42:32 UTC - in response to Message 1502410.
Last modified: 10 Apr 2014, 22:43:20 UTC

...Other things that are design and database changes to track resources rather than code changes will take more time.


For the fine tuning/damping end (well after the coarse scaling issues), I can just pinch the sample arrays already being used for pfc_scale and host_scale rubbish averages. Full PID implementation would only need 6 sample spaces per scale. 3 for the fixed gain knobs, and 3 for the internal variables. I make that a saving of about 2*94 database lookups per host result validation, and so 188*sizeof(double) bytes per host result pending in the working set/Cache.
____________
"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Profile Alex Storey
Volunteer tester
Avatar
Send message
Joined: 14 Jun 04
Posts: 567
Credit: 1,684,362
RAC: 275
Greece
Message 1502436 - Posted: 10 Apr 2014, 23:22:56 UTC

...normalizing to the least (rather than most) efficient [app is] something I can put into the code quickly.


This I'd love to see. Assuming CN doesn't have a built in failsafe to slap the change down, I'm especially curious about what happens next. Because if CN 'rewards' app efficiency then the change will have the opposite effect and credit will actually drop further.

I hope my Cassandra instincts are wrong on both counts.

Richard HaselgroveProject donor
Volunteer tester
Send message
Joined: 4 Jul 99
Posts: 8764
Credit: 52,716,463
RAC: 17,693
United Kingdom
Message 1502446 - Posted: 10 Apr 2014, 23:35:11 UTC - in response to Message 1502394.

Another question... do multi-threaded apps consistently report CPU time to be about n_compute_threads*elapsed_time on all platforms (so we could use CPU time/elapsed time to determine a multiplier)?

For Windows, and the limited number of MT applications I've run (AQUA, and some MilkyWay N-Body), yes.

There is significant wastage from thread synchronisation issues (not every thread reaches a waypoint after the same elapsed time), and in Milkyway's case significant pre- and post-processing in single threaded mode. So expect the ratio to be consistently below n_threads, but unambiguously greater than 1.

Stock MT tasks under BOINC are usually configured to use a thread count of (host CPU count)*(%age of CPUs specified in computing preferences), but that can be overruled in anonymous platform. If MilkyWay still has tasks, I can run some test cases if you want.

[There was some concern that Linux MT apps use all available cores even when deployed in single-threaded mode, but IIRC still report CPU time honestly. But that may be MilkyWay's deployment quirks - it took me about six months to teach them how to use a plan_class properly]

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3588
Credit: 48,772,835
RAC: 19,184
Russia
Message 1503115 - Posted: 12 Apr 2014, 11:18:50 UTC - in response to Message 1502273.

Another possibility I've considered. I've never confirmed that the credits are really scaled to the least efficient CPU version for a platform. In theory, if I were to create a CPU version of SETI@home with no threading or SIMD using the Ooura FFT and release it under the plan class "calibration". After 100 results come back from that version, the credits of everything else should go up. In theory, of course.

More work would be required to allow short running calibration versions.

Then for credit calibration, all a project would need to do is generate a calibration version of every application including GPU apps and some server code to greatly limit the number of calibration apps that go out.


How to consider to calibrate on best valid result instead of worst one?
At least that would encourage stock optimization instead of discourage it....
____________

Eric KorpelaProject donor
Volunteer moderator
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 3 Apr 99
Posts: 1112
Credit: 10,324,623
RAC: 9,470
United States
Message 1503216 - Posted: 12 Apr 2014, 16:35:44 UTC - in response to Message 1503115.

You've got it backwards. Calibrating for the best optimized version reduces the credit grants for all versions and penalizes optimizations. Calibrating for the worst increases the credit grants for all versions.
____________

Profile BilBg
Volunteer tester
Avatar
Send message
Joined: 27 May 07
Posts: 2881
Credit: 6,461,450
RAC: 2,706
Bulgaria
Message 1504200 - Posted: 15 Apr 2014, 3:07:21 UTC - in response to Message 1503216.
Last modified: 15 Apr 2014, 3:20:13 UTC

Calibrating for the worst increases the credit grants for all versions.

Then problem is that projects often remove the slowest plain version when they have working SSE2/3 (especially for 64-bit applications as all such CPUs have SSE3)
So if they are on CreditNew they need to keep the slowest app for 'calibration'? ;)
Else the calibration will be done against SSE2 app as 'worst'?

Can't your 'calibration' app be the current app (+ the same libfftw) but compiled to use only FPU
Or simply use the cmdline Option:
-default_functions       use the safe unoptimized default functions

This will be the most 'fare' calibration app (same app/library but in old fashioned way)
And of course distributed with very small probability (e.g. a host can have a chance to get a slow calibration app/task once a year)

(if that's the way CreditNew works in the first place)

____________



- ALF - "Find out what you don't do well ..... then don't do it!" :)

Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : News : SETI@home now supports Intel GPUs

Copyright © 2014 University of California