GPU FLOPS: Theory vs Reality

Author	Message
M_M Send message Joined: 20 May 04 Posts: 76 Credit: 45,752,966 RAC: 8	Message 1812956 - Posted: 27 Aug 2016, 13:11:39 UTC - in response to Message 1812955. I can only hope that this means that the modern GPUs are merely under-utilized. From discussion in this and another threads, I would say you are right... Seems that current applications cannot fully utilize modern GPUs. We have seen that Petri33 custom linux binary is about 2.5-3x more efficient then SoG, so obviously space for improvement exist (specially for new and high end GPUs). Some patience is needed, but I have no doubt that it is just a matter of time when new optimized applications shall be available... ID: 1812956 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1812961 - Posted: 27 Aug 2016, 13:29:54 UTC - in response to Message 1812955. I can only hope that this means that the modern GPUs are merely under-utilized. For the current type of algorithms we use, assuming you're using the fpops estimates and marketing (idealised) peak flops figures, you can pretty much assume an upper bound of ~30% compute efficiency might be attainable (if power/heat and other factors were managed, and precisely controlled conditions were contrived) The missing (theory-reality)/theory % of >95%, is largely 'communications complexity', or 'shovelling data around' for which no cobblestones are awarded. Unfortunately the computational density of our processing (mostly fourier analysis) isn't super high, relative to the dataset size, though some interesting possibilities to improve that significantly may come in a year or so, with more compute horsepower on tap than memory bandwidth. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1812961 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1812962 - Posted: 27 Aug 2016, 13:42:12 UTC - in response to Message 1812955. I can only hope that this means that the modern GPUs are merely under-utilized. And they are cause your data based on stock-running hosts perhaps and defaults mostly oriented on as lag-free as possible on low-end ones. Some scaling in place but obviously not agressive enough to meet high-end GPU demands. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1812962 ·

Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196	Message 1812986 - Posted: 27 Aug 2016, 15:04:45 UTC - in response to Message 1812962. I can only hope that this means that the modern GPUs are merely under-utilized. And they are cause your data based on stock-running hosts perhaps and defaults mostly oriented on as lag-free as possible on low-end ones. Some scaling in place but obviously not agressive enough to meet high-end GPU demands. I was thinking about that and based on my own experiences tuning the command-line for stock I suspect that the way I winsorize the mean might be picking up more people that have tuned their command-lines than just defaults. My 980Ti was running below average until I fiddled with the command-lines and now, with aggressive settings and a core-reserved, I'm closer to the mark. After last weekend's Aricebo party I noticed my throughput go up about 20% -- so I also wonder if winsorizing the upper 25% is also selecting for people who randomly or intentionally aren't getting many guppies. So I guess what I'm saying is that the way I aggregate the data is probably an overly optimistic measurement. It's also consistent with my own optimization efforts so I'm not sure that it's just because people run stock with default settings. ID: 1812986 ·

-= Vyper =- Volunteer tester Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537	Message 1813392 - Posted: 29 Aug 2016, 9:26:22 UTC Regarding the 750ti this is a snippet of nvidia-smi on my Quad 750Ti host (It's 4 Gigabyte black editions and has been tested from factory a week in a 24/7 Environment) nvidia-smi Mon Aug 29 11:25:14 2016 +-----------------------------------------------------------------------------+ \| NVIDIA-SMI 367.35 Driver Version: 367.35 \| \|-------------------------------+----------------------+----------------------+ \| GPU Name Persistence-M\| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap\| Memory-Usage \| GPU-Util Compute M. \| \|===============================+======================+======================\| \| 0 GeForce GTX 750 Ti Off \| 0000:01:00.0 Off \| N/A \| \| 39% 57C P0 23W / 46W \| 1016MiB / 1998MiB \| 100% Default \| +-------------------------------+----------------------+----------------------+ \| 1 GeForce GTX 750 Ti Off \| 0000:02:00.0 Off \| N/A \| \| 40% 58C P0 27W / 46W \| 1016MiB / 2000MiB \| 99% Default \| +-------------------------------+----------------------+----------------------+ \| 2 GeForce GTX 750 Ti Off \| 0000:04:00.0 Off \| N/A \| \| 38% 53C P0 25W / 46W \| 1016MiB / 2000MiB \| 100% Default \| +-------------------------------+----------------------+----------------------+ \| 3 GeForce GTX 750 Ti Off \| 0000:05:00.0 Off \| N/A \| \| 37% 51C P0 24W / 46W \| 1016MiB / 2000MiB \| 98% Default \| +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ \| Processes: GPU Memory \| \| GPU PID Type Process name Usage \| \|=============================================================================\| \| 0 31331 C ...tiathome_x41zc_x86_64-pc-linux-gnu_cuda65 1014MiB \| \| 1 31337 C ...tiathome_x41zc_x86_64-pc-linux-gnu_cuda65 1014MiB \| \| 2 31322 C ...tiathome_x41zc_x86_64-pc-linux-gnu_cuda65 1014MiB \| \| 3 31342 C ...tiathome_x41zc_x86_64-pc-linux-gnu_cuda65 1014MiB \| +-----------------------------------------------------------------------------+ _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group ID: 1813392 ·

Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196	Message 1816238 - Posted: 10 Sep 2016, 23:31:24 UTC Last modified: 11 Sep 2016, 0:03:14 UTC I ran another scan this morning and I was farting around trying to better visualize the data and I think I'm about ready to admit defeat; what I want is a scatter-plot like this except I can't figure out how to get the y-axis labels to be the names of the cards (I'm using Office 365). I've read that you could somehow do a bar-graph with no bars but I can't quite figure that out either (plus whenever I try I end up with a bunch of tiny bars for each card all stacked together). One thing that this seems to show up pretty well is that the outliers are pretty unusual -- the data is pretty well clustered and I'm guessing spans the gulf between regular Arecibo and Guppy VLARs. It's been about 20 years since I've taken a stats class but I'll see if I can find a better way to analyze it so I can communicate the simplified results to excel without a struggle. Also exciting: someone's been using a Pascal Titan but there isn't enough data yet to qualify. ID: 1816238 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 1816240 - Posted: 10 Sep 2016, 23:41:18 UTC - in response to Message 1816238. You can legitimately filter off the top 5% and lower 5%, leaving the 90% middle. If you normalise each line to release MSRP, and then it becomes a more or less a vertical bar, you might have found their pricing strategy "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 1816240 ·

Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196	Message 1816250 - Posted: 11 Sep 2016, 0:27:47 UTC - in response to Message 1816240. You can legitimately filter off the top 5% and lower 5%, leaving the 90% middle. The middle 90% had a lot of noise, even the middle 80% -- you only start to get clarity with the middle 60%. I really find the min/max of the cluster interesting -- I'll go back to the excel hole to see if I can format it nicely if I massage the data into min/max/median first. ID: 1816250 ·

Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196	Message 1816265 - Posted: 11 Sep 2016, 1:15:39 UTC Winsorized middle 60%: I really like how clearly this shows the spread between work-unit types -- I think it'll be much more useful for comparing with local experiments. ID: 1816265 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1816267 - Posted: 11 Sep 2016, 1:24:31 UTC - in response to Message 1816265. For overall Credit/Hr The GTX 1060 is doing better than I though it would- on par (even slightly over) with the old Titan X, and putting it on par with the GTX 750Ti for work done per Watt-Hour. Very impressive result IMHO. Grant Darwin NT ID: 1816267 ·

Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196	Message 1816272 - Posted: 11 Sep 2016, 2:01:00 UTC - in response to Message 1816267. For overall Credit/Hr The GTX 1060 is doing better than I though it would- on par (even slightly over) with the old Titan X, and putting it on par with the GTX 750Ti for work done per Watt-Hour. Very impressive result IMHO. I have some reservations about extrapolating these results too far: I have a box with 3 1070's and a box with 2 980 Ti: for the last few weeks I've been running both boxes with 2 work-units per GPU and the 980 Ti's are running considerably faster than the 1070's (450 cr/hr vs 575); both are getting better throughput than the average for single tasks but I'm surprised to see the 980Ti's that far ahead (). I suspect that if I could get a proper set of data for two-tasks/GPU the picture would be disproportionately improved for the high-end cards with more CUs. () It's not a perfectly controlled experiment, I'll admit, because the CPUs aren't the same and the PCIE bus won't be running as fast in the triple setup but I doubt that matters for SOG. They are running with the same command-line and both have a core reserved per GPU, though, so I wouldn't expect the results to be that far skewed. (I'm now testing a full core per task on the triple 1070 machine to see if that helps but I'm skeptical). It's also tricky because on the high-end cards I've tested running one WU only uses a fraction of the TDP; so conversely a 1070 might get a single done faster than a 1060 for same average power but it's hard to say. As much as I enjoy the fast Arecibo credit I'd much prefer it if we could focus on GBT data because it would make analyzing sconfiguration and performances so much easier. ID: 1816272 ·

Kiska Volunteer tester Send message Joined: 31 Mar 12 Posts: 302 Credit: 3,067,762 RAC: 0	Message 1816274 - Posted: 11 Sep 2016, 2:05:52 UTC - in response to Message 1816272. I have some reservations about extrapolating these results too far Interpolation good, extrapolation BAD, I can still remember the quote of my statistics professor ID: 1816274 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13727 Credit: 208,696,464 RAC: 304	Message 1816302 - Posted: 11 Sep 2016, 5:34:59 UTC - in response to Message 1816272. Last modified: 11 Sep 2016, 5:38:00 UTC () It's not a perfectly controlled experiment, I'll admit, because the CPUs aren't the same and the PCIE bus won't be running as fast in the triple setup but I doubt that matters for SOG. Actually it may if there is insufficient bandwidth available. When running CUDA the highest Bus Load I saw was about 2% peak. Most of the time it was 0, running 1WU at a time or 3. With SoG, even running 1 WU at a time with the default settings the Bus Load was significant- around 10-13% from memory. With the command line to boost performance further the minimum Bus Load goes way up. For Arecibo work on my i7 system, with a PCIe 2 8 & the load is generally around 20%, spikes to 25%. That's with 1 CPU core per WU. On my C2D with PCIe 1.1 *4 the Bus Load is generally around 33% (lower average with Guppies). They are running with the same command-line and both have a core reserved per GPU, though, so I wouldn't expect the results to be that far skewed. (I'm now testing a full core per task on the triple 1070 machine to see if that helps but I'm skeptical). Generally I've found 1 Core per WU gives the best results, particularly with aggressive command line settings. The more work the GPU does, the more calls in a shorter period of time it makes on the CPU application. Reduce the GPUs wait on the CPU & it's performance goes up accordingly. It's also tricky because on the high-end cards I've tested running one WU only uses a fraction of the TDP; so conversely a 1070 might get a single done faster than a 1060 for same average power but it's hard to say. On my cards, running 1WU at a time has GPU Load (poor indicator I know) & Memory Controller load & Bus Load & the Power Consumption levels are all way up. On Arecibo WUs my i7 GTX 750Ti can hit peaks of 90% Power Consumption, generally it's around 80% (monitor connected). On my C2D it's around 70% power load (no monitor connected). EDIT- and those cards have different core voltages & clock speeds. Grant Darwin NT ID: 1816302 ·

M_M Send message Joined: 20 May 04 Posts: 76 Credit: 45,752,966 RAC: 8	Message 1816323 - Posted: 11 Sep 2016, 7:12:37 UTC Last modified: 11 Sep 2016, 7:42:04 UTC One another observation - In raw processing power (cores, but also nVidia declared TFlops) GTX1060 is basically "a half" of GTX1080, but it achieves here around 80% of its processing speed. Yet in games, it achieves of average just 60-65 % max, meaning that games are more easily taking advantage of high-end GPUs. Also, Cr/Wh as calculated and presented here is rough picture since we have seen that actual TDP usage for different cards is different. GTX750Ti often goes above 80% TDP average usage, while for example GTX1080 with current application is below 65% of its TDP, regardless of CPU and number of GPU instances. ID: 1816323 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1816335 - Posted: 11 Sep 2016, 8:41:28 UTC - in response to Message 1816323. Last modified: 11 Sep 2016, 8:41:45 UTC Games usually render same rectangular (and quite big one) area so plenty of work for GPU in each kernel. With GPGPU in general and SETI in particular situation much harder. We need not to transform each pixel into another, we need to transform array (or matrix) into single or few numbers (search for signal can be represented in such terms). That is, reduction operation. And on reduction otherwise separate threads/workitems should interact between each other. In current SoG some of such reduction implemented as task (vs kernel) enqueue, that is, single workitem. Other require single block (usually no more than 256 workitems and can't be bigger than max allowed threads per CU). All this definitely has performance hit for multi-CU devices. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1816335 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 1816336 - Posted: 11 Sep 2016, 8:41:57 UTC - in response to Message 1816323. ... Also, Cr/Wh as calculated and presented here is rough picture since we have seen that actual TDP usage for different cards is different. GTX750Ti often goes above 80% TDP average usage, while for example GTX1080 with current application is below 65% of its TDP, regardless of CPU and number of GPU instances. I can confirm high TDP on 750ti's, have seen mine over 100% at times. mid 90s often. My 980s seem to hover around 75% max. If it matters, that's with 3 WUs per. ID: 1816336 ·

Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196	Message 1818083 - Posted: 19 Sep 2016, 0:15:47 UTC I've been rewriting my analysis scripts this weekend and one of the things I have data for now is the difference between Arecibo and Greenbank data processing rates on GPUs. As people have long complained, credit/hour for GUPPIs is substantially lower for GPUs across the board: ID: 1818083 ·

Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196	Message 1818100 - Posted: 19 Sep 2016, 1:14:53 UTC Another thing the new scripts allow me to do is get some data that Jason asked for: approximately how much CPU usage there is for tasks by API: ID: 1818100 ·

Jimbocous Volunteer tester Send message Joined: 1 Apr 13 Posts: 1853 Credit: 268,616,081 RAC: 1,349	Message 1818102 - Posted: 19 Sep 2016, 1:29:20 UTC - in response to Message 1818100. Another thing the new scripts allow me to do is get some data that Jason asked for: approximately how much CPU usage there is for tasks by API: Some curious gaps in there. For example, no Cuda data for GTX980s? My three are running nothing else ... or am I misunderstanding? ID: 1818102 ·

Shaggie76 Send message Joined: 9 Oct 09 Posts: 282 Credit: 271,858,118 RAC: 196	Message 1818105 - Posted: 19 Sep 2016, 1:33:49 UTC - in response to Message 1818102. Another thing the new scripts allow me to do is get some data that Jason asked for: approximately how much CPU usage there is for tasks by API: Some curious gaps in there. For example, no Cuda data for GTX980s? My three are running nothing else ... or am I misunderstanding? This is only stock apps which seem to ask for CUDA only on rare occasions; there wasn't enough hosts with enough CUDA tasks for those cards in my scan. One of the nice things about the new stuff I'm working on is I'll be able to incrementally build my performance database week to week and so I'll get a more complete picture later. ID: 1818105 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.