Posts by jason_gee


log in
1) Message boards : Number crunching : GTX 1060 or RX 480? Best bang for Buck seti-wise? (Message 1804981)
Posted 13 hours ago by Profile jason_gee
Sumpthin' like good 'ol Steve, eh? :-)


Yup, except probably more than an arm, leg and eyeball :)

On the surface I'm seeing that both the RX480 and 1060 seem to have learned from the 750ti successes. I think the next year for both breeds could be interesting.
2) Message boards : Number crunching : GPU Utilization: Thermal Throttling? (Message 1804973)
Posted 14 hours ago by Profile jason_gee
Evidently the Pascal chips have improved context-switching;

Significantly improved.

From a recent article at AnnandTech.
Asynchronous Concurrent Compute: Pascal Gets More Flexible and the following page on pre-emption are worth a read.


What I'm reading there accurately reflects the behaviour I see on GTX 980 when using multiple streams on Windows. Driving the device(s) to high load without bogging down the machine is a fiddly balancing act.

That hardware scheduling refinement in Pascal is potentially a very big thing for us. --- > looks like I need to try get a hold of one (preferably 1060/1070)

The missing part raised by the question in the article: ' if dynamic scheduling is so great, why didn’t NVIDIA do this sooner?', is that the transistor budget for that is high and wasn't very necessary until commercial VR. The low latency levels needed for VR to not make you motion sick, are much shorter than conventional displays.

For our purposes it just means we should see better loading without choking the host system as much.
3) Message boards : Number crunching : GTX 1060 or RX 480? Best bang for Buck seti-wise? (Message 1804971)
Posted 14 hours ago by Profile jason_gee
Yeah, for the Windows side, experiments went very well for the performance side of things, but quickly fell flat when considering general release needs.

Because of the dated code structure and build system, some line has been crossed where implementation, testing and debugging become far more difficult than they should be (all platforms). That's not helped by a massive influx of new cards, and multiple deprecations, of compilers, libraries and techniques, that differ by platform.

So while others are able to use the alpha code submission on small scales, my own energies are directed to new infrastructure.

That does mean X-branch temporarily drops out of the race with new applications, but also that so much has been learned from Petri's and my own work, that among us the picture of how the final result will be is pretty clear.

'we can rebuild [it], faster, stronger, better...'
4) Message boards : Number crunching : Gigabyte GA-EP45-UD3P Ver 1.6 MOBO with EVGA GTX-750TI SC... (Message 1804835)
Posted 1 day ago by Profile jason_gee
My only comment is that this demonstrates a lot of things we do need to care about. Accuracy & precision are important (though different). We learned that to have an idea of whether something is working right or not, you need a reference. For the general case inconclusive/pending works, just because pending will average itself depending on project health. So, for example, you are less than 5% mutated than project average, if your inconclusives/pendings is less than 5%
5) Message boards : Number crunching : Gigabyte GA-EP45-UD3P Ver 1.6 MOBO with EVGA GTX-750TI SC... (Message 1804830)
Posted 1 day ago by Profile jason_gee
Take your current steady state inconclusives and divide by pendings. Inconclusives / Pendings < 5% = good. >= 5% = questionable. >10% = there is a problem somewhere.

Well, looking yesterday afternoon, Out of 200 Units in queue there were 28 Pendings, (now 33), 8 Inconclusives, (now 7), and the one Error mentioned...

Yesterday, then, was 3.5% and today is 4.71%. This seems to be the new "Normal" for my system since switching to CUDA from OpenCL.


TL



Nice. You're golden dude :-D ------ > Thread closed ?
6) Message boards : Number crunching : Gigabyte GA-EP45-UD3P Ver 1.6 MOBO with EVGA GTX-750TI SC... (Message 1804827)
Posted 1 day ago by Profile jason_gee
Take your current steady state inconclusives and divide by pendings. Inconclusives / Pendings < 5% = good. >= 5% = questionable. >10% = there is a problem somewhere.
7) Message boards : Number crunching : GPU Utilization: Thermal Throttling? (Message 1804824)
Posted 1 day ago by Profile jason_gee
Wasn't all this discussed in some (credible) detail, a few weeks ago?


Yes, however as the parties (that I recall were) involved well know, the issues with concurrency versus computational efficiency, versus communications (memory access) complexity, become exceedingly complex; especially when throwing in some 4 or 5 generations of disparate hardware, underlying OS differences, and new generations of gpu rolling out faster than drivers can be made for them.

IMO what's happened is we caught up to game development, with respect to current and predicted future needs. So expect the landscape to be no less volatile and confusing for most people than what's going on with VR, DX12 and Vulkan.
8) Message boards : Number crunching : Average Credit Decreasing? (Message 1804682)
Posted 2 days ago by Profile jason_gee
Thankfully my driver version means that I don't get those OpenCL MB tasks. :-)

It's the feedback about them as well as the wingmen aborts/timeouts on those particular tasks (and usually around about their last contact with this project after some years).

I'm looking forward to the new CUDA app/s Jason that you won't rush them out the door as those others were. ;-)

Cheers.


I agree. The temptation arose probing/experimenting with some of the contributed performance code over the last week. Things looked very good in some respects, but very situation specific (so complex and fragile). Something that showed the possibilities, but showed all the design weaknesses we largely inherited from however many years of bandaid patching.

Time to rip the bandaids off, apply medical maggots to remove the dead flesh, and build something new from whatever remains.
9) Message boards : Number crunching : GPU Utilization: Thermal Throttling? (Message 1804669)
Posted 2 days ago by Profile jason_gee
Yes. In effect any clockrate above the base clocks is GPU-Boost doing its job, of overclocking to a large array of driving parameters.

[Sorry for length of mental dump]

Most likely one interpretation, more accurate than 'thermal throttling', would be 'reduced dynamic overclock', which could be temperature related, but also power, voltage etc (something like 23 paramaters IIRC, most hidden and stability related)

If you see things drop below the specified base clock, then you could consider that as a problem somewhere (in system or external).

I heard somewhere recently (forget where) that some people under water cooling flash their devices to disable GPU boost, though in general setting the clock lower/deeper into the stability region than GPU boost would be advised anyway. My understanding is one reason they do this is to present a more or less constant load, which simplifies fan/pump configuration, and stabilises the acoustics.

In general, the 'baseline' cuda applications should reach 90-100% with 2-3 tasks (GPU and system dependant). Any more than that will increase switching overheads and CPU cost.

I keep my very much CPU bottlenecked Windows 7, Core2Duo, system with GTX980, in that state so as to represent something close to worst case imbalance.

Saturating the CPU threads/cores such that the GPU drivers choke in the DPC queue would most likely be visible using LatencyMon or similar tool while running. Tuning the system, chipset drivers, BIOS, amount of OC, and possibly other system components like RAM latency, will change that balance, and potentially move any bottleneck around. In the case of baseline Cuda v8 apps on Windows Vista Onwards, the Latencies involved are higher than under the old XPsp2 regime, but less than on modern Mac OSX. Multiple instances does hide those latencies somewhat, but the best case down the road will be single instance binaries that will use all the allocated hardware effectively, so as to reduce switching overhead to absolute minimums while also hiding the latencies.

Newer techniques have been explored, pushing slowly (about as quickly as the gaming industry) to a better state. Most of the same considerations are relevant to the motivations behind DirectX12 and Vulkan as well (with the same amount of complicated work to get to the end goal of properly hetergeneous systems + applications)
10) Message boards : Number crunching : Average Credit Decreasing? (Message 1804655)
Posted 2 days ago by Profile jason_gee
Quick side observations only.
... After 6 months of MB V8 I'm now totally convinced that Credit New is fatally screwed beyond hope (and only a mental defective could think otherwise).

That's pretty much what last year's engineering focussed walkthroughs revealed, though in more technical words, so no argument there.

The release of an in-mature stock Nvidia OpenCL MB app...


Don't know much/anything about the situation there, though can certainly say things are rough right now on the Cuda side as well (IMO). There is light at the end of the (very long) tunnel after probing with Petri's pretty device/situation specific code, and discussing a lot of issues there. We seem to have reached agreement that the codebase reached an impasse, with broken cross platform support, and supporting the wide generalrange of hardware+Oses is going to need new 'proper' approaches. Lots of software engineering and refinement, ground-up, ahead, along with integrating new tools and techniques.

In that light, I suspect the sudden v8 changes had similar effects on the described OpenCL application, probably better in some areas (like VLAR performance) and worse in others (self-scaling/adaptiveness capability and 'nerdy option entropy')

So my guess is that things probably will remain a bit of a cluster all around for some time, though at least learning the hard lessons the hard way, tends to make them stick... necessity being the mother of invention and all that.
11) Message boards : Number crunching : I've Built a Couple OSX CUDA Apps... (Message 1803750)
Posted 6 days ago by Profile jason_gee
True that when Fermi Class was the thing, I didn't pull any punches, but now that v8 and Kepler-Maxwell-Pascal is a thing, it makes sense to me to open the floodgates.

With the newer code, I regard the precision and compatibility issues as par for the course. The current volatility in the OSes (all of them) is complicating matters. Just something we have to ride through I think.
12) Message boards : Number crunching : GPU Water Cooling GTX750TI Cards (Message 1803255)
Posted 8 days ago by Profile jason_gee
The cards are very capable of overclocking easily and the fans to an OK job at cooling them but to achieve longevity of the cards is my ultimate goal. Also it will give me a good grounding for when I build my QUAD 1080 Goliath.


My suggestion, given the rumours that GP102 based Titan might include HBM2, would be to hold the phone, unless you have so much money that it doesn't matter :D
13) Message boards : Number crunching : I've Built a Couple OSX CUDA Apps... (Message 1803248)
Posted 8 days ago by Profile jason_gee
In general (but not always) the more mature the applications, the less feedback I tend to receive. I'd attribute that to there being fewer problems, and increased user familiarity. Special exceptions do occur from time, for example I do receive occasional emails or PMs from people that managed to build the codebase for an unusual platform/situation (usual out of politeness, and rarely raising questions or problems), similar from other platform test builds.

In the case of Cuda 'baseline', that familiarity + just working is just boring.

Pushing the envelope with Petri's modifications/updates will be the next task IMO, which I'm sure will generate more excitement, questions, problems, and things not yet considered. Fortunately for me I learned near infinite patience along the way from hacking on Lunatics and AK code from 2007 onwards.

With the OSes, Devices/Drivers, Languages/Apis, and project in a confused state of flux, I predict that many users will just stick with whatever the project issues. IMO probably won't start to settle down until end of year.
14) Message boards : Number crunching : GPU Water Cooling GTX750TI Cards (Message 1803111)
Posted 9 days ago by Profile jason_gee
For performance/Custom work like you seem to be looking for, then EK seems to me the best option. But consider that things are changing, the cost over the whole lifespan, and maybe, maybe not, you might find the best option might be going a higher model on air cooling. I don;t know the answer, just saying the options are probably rough right now if you're aiming for efficiency.
15) Message boards : Number crunching : looking for accurate and simple metric for daily output (Message 1803097)
Posted 9 days ago by Profile jason_gee
{I'm starting to feel like I'm about to reinvent/demystify CreditNew! lol}


Basically yes, and feels like progress to me :D

CreditNew is a bunch of numbers. A simpler and more functional (i.e. aesthetically tolerable and actually useful) metric, would consider form and function, so your quest seems like a reasonable one to me, even if a more complex engineering challenge than you might have bargained for :)

So could an equation such as this work: 100x + 50y + 1z ?


If 3 bins proves useful/simple enough perhaps. It'll just place a given host/device/app as a 3 dimensional point (or a blob or smear if needing variance), and the overall and individual work mixes as volumes. If the axes were time, then best in class performance of host, applications and devices should drift toward the origin.

Starts to sound complicated again, but since you seem to be looking for useful and easily interpreted representation, then the balance is probably somewhere in visualisation of it (even though if completely different than I would picture/describe at the moment)

Something like that, more as a developer than end-user, would tell me things like 'this application needs more attention on shorties' or 'this group of devices are lemons'
16) Message boards : Number crunching : looking for accurate and simple metric for daily output (Message 1803077)
Posted 9 days ago by Profile jason_gee
You did specify 'Simple' in the title :) (as well as accurate)

For the simplicity part:
As angle range and telescope are complex functions affecting number of operations, that will rule out a single score. (the different devices+apps perform differently depending on AR)

How about reducing to say 3 'scores':
'Long task' (e.g. VLAR and Guppies) performance: xxxx 'points'
'mid task' performance: yyyy 'points'
'shorty performance': zzzz 'points'

where 'points' can come from whichever 'most accurate' metrics you find, absolute or relative (most likely GFlops, but tasks/day might work with only 3 boxes to worry about)

For the accuracy part:
Well to me that's the functional part that makes it useful for predicting some new device/app that hasn't accumulated enough data yet. Depends on what you want to use the figures for (e.g. comparing devices or applications).

In that case the 3 bins approach, taken for simplicity, might or might not be stable/accurate enough. The graphs in the other thread showing tasks/day with variance seem pretty intuitive to me. Perhaps 3 bins per device each with variance ?

If using averages, you might need several weeks to see a given device data stabilise. Try Median instead. If Median comes very different to average, then you automatically know the data is skewed (e.g. device is being used by the user sporadically, or other projects/apps).

Total performance score then could be the sum +/- the sum of the variances.

The problem here for me (over a long time) has been that the work keeps changing, so a synthetic bench that reflects that might be needed, to reflect the kinds of work, and changes in work mix.
17) Message boards : Number crunching : GPU FLOPS: Theory vs Reality (Message 1802968)
Posted 10 days ago by Profile jason_gee
I'm surprised but happy to see Fermi class (4x0/5x0) hanging in there, considering NV may be deprecating their support for anything after Cuda8. It would seem to confirm my suspicion than it may be too early to consider leaving these behind for us, so some inventive means might have to be adopted with integration of the new code, so as to avoid losing them.
18) Message boards : Number crunching : GPU FLOPS: Theory vs Reality (Message 1802930)
Posted 10 days ago by Profile jason_gee

a) The guppi wu's do not have more work in them, they just happen to have low ar that makes the current software not to parallelize the pulse find calculations. I've fixed that.
b) Cuda streams can be used to utilize the GPU more efficiently
c) Memory access pattern and cache utilization can be improved
d) Instruction level parallelism can be increased
e) The autocorrelation can use the nvidia R2C fft implementation more efficiently than the current C2C fft.


That's pretty impressive -- and looking at your dump your 1080 isn't quite saturated yet. It would be fantastic to get some of those optimizations integrated back into the main release.


Stock integration will happen. More slowly than 3rd party test and final variants, because stock distribution has quite a few other considerations (like the small example of cooking poorly maintained systems, among other issues). From what I can see pretty close to 'advanced user' wide testing depending on how much trouble the Windows + Mac builds give over the next few days (presuming Linux fairly straightforward). There are other general issues to solve not specifically related to Petri's massive contribution, but those likely will have to come out of the woodwork on their own, since the reliability is up (at least on the Linux variant so far).
19) Message boards : Number crunching : move daily stats updates to before Tuesday's maintenance? (Message 1802907)
Posted 10 days ago by Profile jason_gee
Not sure, but think scanning/exporting is part of the maintenance cycle.
20) Message boards : Number crunching : looking for accurate and simple metric for daily output (Message 1802902)
Posted 10 days ago by Profile jason_gee
Stepped on an ant nest :D Skip to end for <short_version>. (Sorry for the length)

There has been discussion/research, with respect to the shortcomings of the scheduling/estimate/credit mechanisms for direct comparison/control. For multibeam specifically, the needed source data are the elapsed times, theoretical peak flops, the unscaled fpops estimate already in the tasks (but can be derived from AR + task type anyway), optionally CPU fraction and number of instances (which is less available, but could be inferred with better prediction)

The closest 'úseful' metric is the APR (once settled), which connects to the majority of those parameters.
There are some problems with that APR, because averages are known to be sensitive to disturbance and slow to respond to change (specifically for estimates/control)

Using the existing crude APR, and ensuring Things 'settled', if you take a median value over time it will be more stable/predictable, with a better implementation option being running median of the source data, but a Kalman filter (linear or extended) being provably optimal instead, and tunable to desired response time (and giving useful covariance matrix for enhancing prediictions or estimates for new platforms/applications/hosts/hardware.)

Let's call the choice of APR, Median filtered APR, Running Median, or tuned Kalman estimate, just 'PR' for 'processing rate', which gives the estimated GFlops. The original unscaled #ops, for multibeam roughly +/- 10% from áctual compute operations, which IMO is useful enough for estimation/scheduling and comparison, and a lot more stable than Credit/Rac.

APR/theoretical_peak_flops gives 'compute efficiency', which is more useful on the development/optimisation side, and is mislabelled 'pfc_scale' in its current unstable implementation (unstable in Engineering and mathmatical terms).

It's this total 'çompute efficiency' where the Current Cuda and OpenCL implementations trade blows between lower efficiency more instances, Vs fewer or single-instance, in terms of total raw compute throughput ---> mostly complicated by CPU demand, limits on applicable hardware, and new breeds of hardware and application techniques rolling onstage.

Accumulated feedback from the mass scale then can be fed back to refine estimate quality (which GPS localisation does, via sensor fusion, in mobile devices, so nothing new/special)

<short_version>
APR refined, i.e. [Filter PR] GFlops, would be the most useful, provided all the caveats with GFlops are considered, along with well chosen indicators of quality of that estimate. (e.g. variance)
</short_version>


Next 20

Copyright © 2016 University of California