GPU FLOPS: Theory vs Reality

Message boards : Number crunching : GPU FLOPS: Theory vs Reality
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 17 · Next

AuthorMessage
Profile M_M
Avatar

Send message
Joined: 20 May 04
Posts: 76
Credit: 45,752,966
RAC: 8
Serbia
Message 1812956 - Posted: 27 Aug 2016, 13:11:39 UTC - in response to Message 1812955.  

I can only hope that this means that the modern GPUs are merely under-utilized.


From discussion in this and another threads, I would say you are right... Seems that current applications cannot fully utilize modern GPUs. We have seen that Petri33 custom linux binary is about 2.5-3x more efficient then SoG, so obviously space for improvement exist (specially for new and high end GPUs).

Some patience is needed, but I have no doubt that it is just a matter of time when new optimized applications shall be available...
ID: 1812956 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1812961 - Posted: 27 Aug 2016, 13:29:54 UTC - in response to Message 1812955.  

I can only hope that this means that the modern GPUs are merely under-utilized.


For the current type of algorithms we use, assuming you're using the fpops estimates and marketing (idealised) peak flops figures, you can pretty much assume an upper bound of ~30% compute efficiency might be attainable (if power/heat and other factors were managed, and precisely controlled conditions were contrived)

The missing (theory-reality)/theory % of >95%, is largely 'communications complexity', or 'shovelling data around' for which no cobblestones are awarded.

Unfortunately the computational density of our processing (mostly fourier analysis) isn't super high, relative to the dataset size, though some interesting possibilities to improve that significantly may come in a year or so, with more compute horsepower on tap than memory bandwidth.
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1812961 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1812962 - Posted: 27 Aug 2016, 13:42:12 UTC - in response to Message 1812955.  


I can only hope that this means that the modern GPUs are merely under-utilized.

And they are cause your data based on stock-running hosts perhaps and defaults mostly oriented on as lag-free as possible on low-end ones.
Some scaling in place but obviously not agressive enough to meet high-end GPU demands.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1812962 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1812986 - Posted: 27 Aug 2016, 15:04:45 UTC - in response to Message 1812962.  


I can only hope that this means that the modern GPUs are merely under-utilized.

And they are cause your data based on stock-running hosts perhaps and defaults mostly oriented on as lag-free as possible on low-end ones.
Some scaling in place but obviously not agressive enough to meet high-end GPU demands.


I was thinking about that and based on my own experiences tuning the command-line for stock I suspect that the way I winsorize the mean might be picking up more people that have tuned their command-lines than just defaults.

My 980Ti was running below average until I fiddled with the command-lines and now, with aggressive settings and a core-reserved, I'm closer to the mark.

After last weekend's Aricebo party I noticed my throughput go up about 20% -- so I also wonder if winsorizing the upper 25% is also selecting for people who randomly or intentionally aren't getting many guppies.

So I guess what I'm saying is that the way I aggregate the data is probably an overly optimistic measurement. It's also consistent with my own optimization efforts so I'm not sure that it's just because people run stock with default settings.
ID: 1812986 · Report as offensive
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 1813392 - Posted: 29 Aug 2016, 9:26:22 UTC

Regarding the 750ti this is a snippet of nvidia-smi on my Quad 750Ti host (It's 4 Gigabyte black editions and has been tested from factory a week in a 24/7 Environment)

nvidia-smi
Mon Aug 29 11:25:14 2016
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.35 Driver Version: 367.35 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 750 Ti Off | 0000:01:00.0 Off | N/A |
| 39% 57C P0 23W / 46W | 1016MiB / 1998MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 750 Ti Off | 0000:02:00.0 Off | N/A |
| 40% 58C P0 27W / 46W | 1016MiB / 2000MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 750 Ti Off | 0000:04:00.0 Off | N/A |
| 38% 53C P0 25W / 46W | 1016MiB / 2000MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 750 Ti Off | 0000:05:00.0 Off | N/A |
| 37% 51C P0 24W / 46W | 1016MiB / 2000MiB | 98% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 31331 C ...tiathome_x41zc_x86_64-pc-linux-gnu_cuda65 1014MiB |
| 1 31337 C ...tiathome_x41zc_x86_64-pc-linux-gnu_cuda65 1014MiB |
| 2 31322 C ...tiathome_x41zc_x86_64-pc-linux-gnu_cuda65 1014MiB |
| 3 31342 C ...tiathome_x41zc_x86_64-pc-linux-gnu_cuda65 1014MiB |
+-----------------------------------------------------------------------------+

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 1813392 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1816238 - Posted: 10 Sep 2016, 23:31:24 UTC
Last modified: 11 Sep 2016, 0:03:14 UTC

I ran another scan this morning and I was farting around trying to better visualize the data and I think I'm about ready to admit defeat; what I want is a scatter-plot like this except I can't figure out how to get the y-axis labels to be the names of the cards (I'm using Office 365).



I've read that you could somehow do a bar-graph with no bars but I can't quite figure that out either (plus whenever I try I end up with a bunch of tiny bars for each card all stacked together).

One thing that this seems to show up pretty well is that the outliers are pretty unusual -- the data is pretty well clustered and I'm guessing spans the gulf between regular Arecibo and Guppy VLARs. It's been about 20 years since I've taken a stats class but I'll see if I can find a better way to analyze it so I can communicate the simplified results to excel without a struggle.

Also exciting: someone's been using a Pascal Titan but there isn't enough data yet to qualify.
ID: 1816238 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 1816240 - Posted: 10 Sep 2016, 23:41:18 UTC - in response to Message 1816238.  

You can legitimately filter off the top 5% and lower 5%, leaving the 90% middle.

If you normalise each line to release MSRP, and then it becomes a more or less a vertical bar, you might have found their pricing strategy
"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 1816240 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1816250 - Posted: 11 Sep 2016, 0:27:47 UTC - in response to Message 1816240.  

You can legitimately filter off the top 5% and lower 5%, leaving the 90% middle.

The middle 90% had a lot of noise, even the middle 80% -- you only start to get clarity with the middle 60%. I really find the min/max of the cluster interesting -- I'll go back to the excel hole to see if I can format it nicely if I massage the data into min/max/median first.
ID: 1816250 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1816265 - Posted: 11 Sep 2016, 1:15:39 UTC

Winsorized middle 60%:



I really like how clearly this shows the spread between work-unit types -- I think it'll be much more useful for comparing with local experiments.
ID: 1816265 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1816267 - Posted: 11 Sep 2016, 1:24:31 UTC - in response to Message 1816265.  

For overall Credit/Hr The GTX 1060 is doing better than I though it would- on par (even slightly over) with the old Titan X, and putting it on par with the GTX 750Ti for work done per Watt-Hour. Very impressive result IMHO.
Grant
Darwin NT
ID: 1816267 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1816272 - Posted: 11 Sep 2016, 2:01:00 UTC - in response to Message 1816267.  

For overall Credit/Hr The GTX 1060 is doing better than I though it would- on par (even slightly over) with the old Titan X, and putting it on par with the GTX 750Ti for work done per Watt-Hour. Very impressive result IMHO.

I have some reservations about extrapolating these results too far: I have a box with 3 1070's and a box with 2 980 Ti: for the last few weeks I've been running both boxes with 2 work-units per GPU and the 980 Ti's are running considerably faster than the 1070's (450 cr/hr vs 575); both are getting better throughput than the average for single tasks but I'm surprised to see the 980Ti's that far ahead (*). I suspect that if I could get a proper set of data for two-tasks/GPU the picture would be disproportionately improved for the high-end cards with more CUs.

(*) It's not a perfectly controlled experiment, I'll admit, because the CPUs aren't the same and the PCIE bus won't be running as fast in the triple setup but I doubt that matters for SOG. They are running with the same command-line and both have a core reserved per GPU, though, so I wouldn't expect the results to be that far skewed. (I'm now testing a full core per task on the triple 1070 machine to see if that helps but I'm skeptical).

It's also tricky because on the high-end cards I've tested running one WU only uses a fraction of the TDP; so conversely a 1070 might get a single done faster than a 1060 for same average power but it's hard to say.

As much as I enjoy the fast Arecibo credit I'd much prefer it if we could focus on GBT data because it would make analyzing sconfiguration and performances so much easier.
ID: 1816272 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 1816274 - Posted: 11 Sep 2016, 2:05:52 UTC - in response to Message 1816272.  

I have some reservations about extrapolating these results too far


Interpolation good, extrapolation BAD, I can still remember the quote of my statistics professor
ID: 1816274 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1816302 - Posted: 11 Sep 2016, 5:34:59 UTC - in response to Message 1816272.  
Last modified: 11 Sep 2016, 5:38:00 UTC

(*) It's not a perfectly controlled experiment, I'll admit, because the CPUs aren't the same and the PCIE bus won't be running as fast in the triple setup but I doubt that matters for SOG.

Actually it may if there is insufficient bandwidth available.
When running CUDA the highest Bus Load I saw was about 2% peak. Most of the time it was 0, running 1WU at a time or 3.
With SoG, even running 1 WU at a time with the default settings the Bus Load was significant- around 10-13% from memory. With the command line to boost performance further the minimum Bus Load goes way up.
For Arecibo work on my i7 system, with a PCIe 2 *8 & the load is generally around 20%, spikes to 25%. That's with 1 CPU core per WU.
On my C2D with PCIe 1.1 *4 the Bus Load is generally around 33% (lower average with Guppies).


They are running with the same command-line and both have a core reserved per GPU, though, so I wouldn't expect the results to be that far skewed. (I'm now testing a full core per task on the triple 1070 machine to see if that helps but I'm skeptical).

Generally I've found 1 Core per WU gives the best results, particularly with aggressive command line settings. The more work the GPU does, the more calls in a shorter period of time it makes on the CPU application. Reduce the GPUs wait on the CPU & it's performance goes up accordingly.


It's also tricky because on the high-end cards I've tested running one WU only uses a fraction of the TDP; so conversely a 1070 might get a single done faster than a 1060 for same average power but it's hard to say.

On my cards, running 1WU at a time has GPU Load (poor indicator I know) & Memory Controller load & Bus Load & the Power Consumption levels are all way up.
On Arecibo WUs my i7 GTX 750Ti can hit peaks of 90% Power Consumption, generally it's around 80% (monitor connected).
On my C2D it's around 70% power load (no monitor connected).
EDIT- and those cards have different core voltages & clock speeds.
Grant
Darwin NT
ID: 1816302 · Report as offensive
Profile M_M
Avatar

Send message
Joined: 20 May 04
Posts: 76
Credit: 45,752,966
RAC: 8
Serbia
Message 1816323 - Posted: 11 Sep 2016, 7:12:37 UTC
Last modified: 11 Sep 2016, 7:42:04 UTC

One another observation - In raw processing power (cores, but also nVidia declared TFlops) GTX1060 is basically "a half" of GTX1080, but it achieves here around 80% of its processing speed. Yet in games, it achieves of average just 60-65 % max, meaning that games are more easily taking advantage of high-end GPUs.

Also, Cr/Wh as calculated and presented here is rough picture since we have seen that actual TDP usage for different cards is different. GTX750Ti often goes above 80% TDP average usage, while for example GTX1080 with current application is below 65% of its TDP, regardless of CPU and number of GPU instances.
ID: 1816323 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1816335 - Posted: 11 Sep 2016, 8:41:28 UTC - in response to Message 1816323.  
Last modified: 11 Sep 2016, 8:41:45 UTC

Games usually render same rectangular (and quite big one) area so plenty of work for GPU in each kernel.
With GPGPU in general and SETI in particular situation much harder. We need not to transform each pixel into another, we need to transform array (or matrix) into single or few numbers (search for signal can be represented in such terms). That is, reduction operation. And on reduction otherwise separate threads/workitems should interact between each other.
In current SoG some of such reduction implemented as task (vs kernel) enqueue, that is, single workitem. Other require single block (usually no more than 256 workitems and can't be bigger than max allowed threads per CU). All this definitely has performance hit for multi-CU devices.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1816335 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1816336 - Posted: 11 Sep 2016, 8:41:57 UTC - in response to Message 1816323.  

... Also, Cr/Wh as calculated and presented here is rough picture since we have seen that actual TDP usage for different cards is different. GTX750Ti often goes above 80% TDP average usage, while for example GTX1080 with current application is below 65% of its TDP, regardless of CPU and number of GPU instances.

I can confirm high TDP on 750ti's, have seen mine over 100% at times. mid 90s often. My 980s seem to hover around 75% max. If it matters, that's with 3 WUs per.
ID: 1816336 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1818083 - Posted: 19 Sep 2016, 0:15:47 UTC

I've been rewriting my analysis scripts this weekend and one of the things I have data for now is the difference between Arecibo and Greenbank data processing rates on GPUs.

As people have long complained, credit/hour for GUPPIs is substantially lower for GPUs across the board:

ID: 1818083 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1818100 - Posted: 19 Sep 2016, 1:14:53 UTC

Another thing the new scripts allow me to do is get some data that Jason asked for: approximately how much CPU usage there is for tasks by API:

ID: 1818100 · Report as offensive
Profile Jimbocous Project Donor
Volunteer tester
Avatar

Send message
Joined: 1 Apr 13
Posts: 1853
Credit: 268,616,081
RAC: 1,349
United States
Message 1818102 - Posted: 19 Sep 2016, 1:29:20 UTC - in response to Message 1818100.  

Another thing the new scripts allow me to do is get some data that Jason asked for: approximately how much CPU usage there is for tasks by API:

Some curious gaps in there. For example, no Cuda data for GTX980s? My three are running nothing else ... or am I misunderstanding?
ID: 1818102 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1818105 - Posted: 19 Sep 2016, 1:33:49 UTC - in response to Message 1818102.  

Another thing the new scripts allow me to do is get some data that Jason asked for: approximately how much CPU usage there is for tasks by API:

Some curious gaps in there. For example, no Cuda data for GTX980s? My three are running nothing else ... or am I misunderstanding?

This is only stock apps which seem to ask for CUDA only on rare occasions; there wasn't enough hosts with enough CUDA tasks for those cards in my scan. One of the nice things about the new stuff I'm working on is I'll be able to incrementally build my performance database week to week and so I'll get a more complete picture later.
ID: 1818105 · Report as offensive
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 17 · Next

Message boards : Number crunching : GPU FLOPS: Theory vs Reality


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.