MB v8: CPU vs GPU (in terms of efficiency)

Message boards : Number crunching : MB v8: CPU vs GPU (in terms of efficiency)
Message board moderation

To post messages, you must log in.

AuthorMessage
qbit
Volunteer tester
Avatar

Send message
Joined: 19 Sep 04
Posts: 630
Credit: 6,868,528
RAC: 0
Austria
Message 1815733 - Posted: 8 Sep 2016, 14:41:41 UTC
Last modified: 8 Sep 2016, 14:43:58 UTC

The question about the efficiency of current GPU apps has been raised in another thread but since it's quite OT there I thought I make a new thread for this.

I run a (dedicated) main cruncher with a GTX 750, known to be one of the most efficient cards out there. The TDP is 55 watts. The CPU on this machine is not used for crunching, it's reserved to feed the GPU.
http://setiathome.berkeley.edu/show_host_detail.php?hostid=7563243

From time to time I crunch a bit with one of my laptops, which has a Intel N3520, a CPU that is also known for a very good power/watt ratio. The TDP is 7.5 watts.
http://setiathome.berkeley.edu/show_host_detail.php?hostid=7433880

I run the apps from the latest lunatics installer on CPU and Raistmers latest openCL build (r3525) on GPU.

Seeing the task times on both machines I was wondering which one is more efficient. So time to do the math. (I know that Guppies are currently not handled very well by GPUs, so I'm talking about Arecibo tasks only)

Here are 2 examples of tasks with a very similar angle range, the first one crunched on my lappy, the second one on my main machine:

http://setiathome.berkeley.edu/result.php?resultid=5143371920
WU true angle range is : 0.423457
5 hours 5 min 18 sec

http://setiathome.berkeley.edu/result.php?resultid=5141706081
WU true angle range is : 0.423120
42 min 12 sec

Now we need to take into account that my lappy runs 4 tasks at a time, while I only do 2 at a time on my GTX750. That means that we have to double the time from the GPU. So it takes my main cruncher ~84 min to do the same work my lappy does in ~ 305 min. That means that the GPU crunches faster by a factor of 3.63. BUT, the GPU uses much more power to do this, the factor here is 7.33 (55/7.5). So in the end, the CPU is more then twice (2.02) as efficient as the GPU!
That's quite a surprise for me, AFAIR it was quite the opposite for v7. So at the moment, if efficiency is your main goal, you may think about using your CPU instead of your GPU / build a CPU monster cruncher instead of a multi GPU machine. (Ofc it all depends on the type of CPU/GPU).

PS: I will try to check other ARs soon.
ID: 1815733 · Report as offensive
AMDave
Volunteer tester

Send message
Joined: 9 Mar 01
Posts: 234
Credit: 11,671,730
RAC: 0
United States
Message 1815753 - Posted: 8 Sep 2016, 16:03:40 UTC - in response to Message 1815733.  

Now we need to take into account that my lappy runs 4 tasks at a time, while I only do 2 at a time on my GTX750. That means that we have to double the time from the GPU. So it takes my main cruncher ~84 min to do the same work my lappy does in ~ 305 min. That means that the GPU crunches faster by a factor of 3.63. BUT, the GPU uses much more power to do this, the factor here is 7.33 (55/7.5). So in the end, the CPU is more then twice (2.02) as efficient as the GPU!

Your math is inaccurate.  WU completion times are not linear when concurrency is increased.  Assuming both machines run only GPU WUs, and similar ARs (ie VLAR to VLAR, VHAR to VHAR, Mid to Mid), you can't state that doubling the completion time of 2 concurrent WUs run on machine A should equal the completion time of 4 concurrent WUs on machine B.  This would be true even if both machines were identical, and here they are not.

More often than not, increasing concurrency is less than linear (or < 1:1).  For example, a single WU may complete in 20 min, 2 concurrent WUs may complete in 37 min, and 3 concurrent WUs may complete in 52 min.  This will follow up to a point, wherein The Law of Diminishing Returns takes over.  Because of the multitude of hardware configurations, where this point lies is different for every machine.  If you read other threads, you'll see that for some with a GTX 750 Ti, it is 3 WUs, while for some with a GTX 1060, it is 2 WU.
ID: 1815753 · Report as offensive
qbit
Volunteer tester
Avatar

Send message
Joined: 19 Sep 04
Posts: 630
Credit: 6,868,528
RAC: 0
Austria
Message 1815768 - Posted: 8 Sep 2016, 18:10:04 UTC
Last modified: 8 Sep 2016, 18:12:10 UTC

Dave, I think you got me wrong there. Maybe my fault, I can't explain things in english as well as I can in german;-)

I suppose with "concurrency" you mean running 4 tasks at a time on GPU instead of just 2, right? But that's not what I'm talking about.
The CPU in my lappy has 4 cores. So if I wanna use its full capacity, I can only run 4 tasks at a time. AFAIK it's not possible to use more then 1 core for a task.
So, I have 4 tasks on CPU and 2 Tasks on GPU. If I wanna compare those, I have to scale. I can do this by either multiply the times from the GPU by 2 or divide the times from the lappy by 2.
Or, to put it another way, imagine I have 2 exact same main crunchers with the same GTX 750. If I would run the same two tasks on each of this two machines, it should take each of those machines exactly the same amount of time to finish those tasks (at least in theory, practically they may differ by a few seconds).
ID: 1815768 · Report as offensive
AMDave
Volunteer tester

Send message
Joined: 9 Mar 01
Posts: 234
Credit: 11,671,730
RAC: 0
United States
Message 1815797 - Posted: 8 Sep 2016, 20:48:21 UTC - in response to Message 1815768.  

Dave, I think you got me wrong there. Maybe my fault, I can't explain things in english as well as I can in german;-)

I suppose with "concurrency" you mean running 4 tasks at a time on GPU instead of just 2, right? But that's not what I'm talking about.

Yes, that's how I understood it.  Thanks for the clarification.
ID: 1815797 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1815815 - Posted: 8 Sep 2016, 23:42:33 UTC - in response to Message 1815733.  
Last modified: 9 Sep 2016, 0:03:17 UTC

I normally calculate the Watt Hours per task for my devices to rank them by their efficiency. I have done a few posts about the method I use here & also here.
Using the TDP, run time, & # of tasks per device for your CPU & GPU I get.
Note: I rounded the task times to the nearest minute.
Also this calculation expects the run times to be when the specified number of tasks were running.
Device		Watts	# Tasks		Run Time	Task/hr		Task/day	Wh/Day		Wh/Task
GTX 750		55	   2		   42		2.857142	68.571428	1320		19.25
N3520		 7.5	   4		  305		0.786885	18.885245	 180		 9.53

Which would make the N3520 ~2.02 times as efficient as the GTX 750, but it completes less than a third as much work in a given day.

However, actual GPU power consumption is typically lower than the rated TDP value when processing tasks. A good rule of thumb is to use ~80%.
Many have found that their GTX 750's run in the 40-45W range. If we figure 80% of 55W that does happen to be 44W. Which would give the GTX 750 a Wh/Task of 15.4 instead of 19.25 & make the N3520 only ~1.62 times as efficient.
I should also add that my GTX 750 Ti FTW would complete two 0.42AR tasks at once in ~25 min while using ~45w. Giving it a Wh/Task of 9.38.

This type of calculation does not take into account the whole power usage of the system the device is in. To do that you would really need to use a power meter, or UPS with a power usage display, and measure each system at idle and then when processing tasks. The delta could then be used to calculate Wh/Task.

Task/hr = (60/Run Time)/# Tasks
Task/day = Task/hr * 24
Wh/Day = Watts * 24
Wh/Task = (Wh/Day)/(Task/day)
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1815815 · Report as offensive
Kiska
Volunteer tester

Send message
Joined: 31 Mar 12
Posts: 302
Credit: 3,067,762
RAC: 0
Australia
Message 1815847 - Posted: 9 Sep 2016, 4:48:50 UTC
Last modified: 9 Sep 2016, 5:01:40 UTC

Lets take your N3520, it can do ~2.75 GFLOPS on average per second, each core takes 1.875W to operate, gives us ~1.47GFLOPS/W. The 750 does ~3GFLOPS per second, and a 55 TDP gives us 0.0545 GFLOPS/W...........


I will link you to this message on the forum. link and quote of it:

Actually I think its lower. If we take a sample of 2 tasks from my GT 840m, I can see that avg per second is 11.26 gflops or 13.35 gflops, taken from the flop counter. Lets take the higher value, BOINC tells me that the device can peak 863 gflops, yet averaged with time my dGPU outputs 13.35 gflops, that is about 1.6%
5/09/2016 21:22:02 PM | | CUDA: NVIDIA GPU 0: GeForce 840M (driver version 362.00, CUDA version 8.0, compute capability 5.0, 2048MB, 1679MB available, 863 GFLOPS peak)



Task 1
Task 2


This 'test' is running 1 task at a time, according to the internal flop counter. And the subsequent reply link:

Last time I checked it was using as TDP demanded. The GT 840m is a 35W(depending on which brand of laptop) part. Last I measured by hooking up probes(not by me, by my prof) it was hitting somewhere between 30W and 34.8W, so can't get anymore out of it. GPU-z measures ~95% avg usage. Not sure how accurate as I believe it measures only first CU.

Yay for being a student at WSU
ID: 1815847 · Report as offensive
Profile M_M
Avatar

Send message
Joined: 20 May 04
Posts: 76
Credit: 45,752,966
RAC: 8
Serbia
Message 1815848 - Posted: 9 Sep 2016, 4:50:56 UTC - in response to Message 1815815.  
Last modified: 9 Sep 2016, 5:16:27 UTC

Just to mention, if efficiency is a primary concern, undervolting and underclocking your GPU could significantly boost its power efficiency. For example, if you underclock your GPU by just 10% (and undervolt by another 10-15%, actually as much as you can to still keep it 100% stable), your GPU power usage will go down by 25-30%. This is the primary way how mobile GPUs are selected, testing them slightly on lower clock and much lower voltage.

On other hand, this means that overclocking (especially with overvolting) is significantly decreasing the power efficiency, which is nothing new but people usually overlook.

Worth mentioning is also that GPU apps are far away from their optimal efficiency, which is not so much case for CPU apps. For example, Petri33 custom optimized nV GPU application is 2-2.5x more efficient (and 2.5-3x faster) then standard app, and he is convinced there is still space for further improvement.

Reason is that it is much harder to properly optimize GPU applications, due to GPUs heavy parallelism and various architectures.
ID: 1815848 · Report as offensive
Profile George 254
Volunteer tester

Send message
Joined: 25 Jul 99
Posts: 155
Credit: 16,507,264
RAC: 19
United Kingdom
Message 1815852 - Posted: 9 Sep 2016, 5:26:32 UTC - in response to Message 1815733.  

qbit
Thanks for your post.
Clicked on the task links and got:
No such task: 5141706081
No such task: 5143371920 ????
ID: 1815852 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13727
Credit: 208,696,464
RAC: 304
Australia
Message 1815859 - Posted: 9 Sep 2016, 6:58:14 UTC - in response to Message 1815733.  

http://setiathome.berkeley.edu/result.php?resultid=5141706081
WU true angle range is : 0.423120
42 min 12 sec


At present i'm running my GTX 750Tis using the SoG application, with a modified version of one of the suggested command lines.
Running 1 WU at a time, they're crunching most Arecibo WUs in 13-14 min. The highest peak I've seen for power consumption is around 85%, generally it's around 70-75% (42-45W).
Grant
Darwin NT
ID: 1815859 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1815870 - Posted: 9 Sep 2016, 7:44:18 UTC - in response to Message 1815815.  
Last modified: 9 Sep 2016, 7:45:09 UTC


This type of calculation does not take into account the whole power usage of the system the device is in. To do that you would really need to use a power meter, or UPS with a power usage display, and measure each system at idle and then when processing tasks. The delta could then be used to calculate Wh/Task.

And that's quite important part. If some device completes task much faster - it needs correspondingly less amount of time when whole system powered ON. If very low-power device takes much longer to complete the same task - it requires full system support (with all its energy consuming overhead) through all this long time.

Hence, w/o accounting for whole system energy consumption overhead such CPU vs GPU efficiency comparisons are quite biased IMO and don't show real benefits of fast GPU computing.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1815870 · Report as offensive
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 1815876 - Posted: 9 Sep 2016, 8:44:50 UTC

In Another thread i posted this that reflects how much juice it's required for work:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.35 Driver Version: 367.35 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 750 Ti Off | 0000:01:00.0 Off | N/A |
| 39% 57C P0 23W / 46W | 1016MiB / 1998MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 750 Ti Off | 0000:02:00.0 Off | N/A |
| 40% 58C P0 27W / 46W | 1016MiB / 2000MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 750 Ti Off | 0000:04:00.0 Off | N/A |
| 38% 53C P0 25W / 46W | 1016MiB / 2000MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 750 Ti Off | 0000:05:00.0 Off | N/A |
| 37% 51C P0 24W / 46W | 1016MiB / 2000MiB | 98% Default |
+-------------------------------+----------------------+----------------------+

It seems like each card consumes about 25W when crunching on my Quad 750TI host.
http://setiathome.berkeley.edu/show_host_detail.php?hostid=8053171

But then again you need to take into account that you need a computer "around the cards" to drive it. But i presume that computer consumes around 200W at the wall but i can't confirm it though.

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 1815876 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1816031 - Posted: 10 Sep 2016, 0:00:10 UTC - in response to Message 1815870.  


This type of calculation does not take into account the whole power usage of the system the device is in. To do that you would really need to use a power meter, or UPS with a power usage display, and measure each system at idle and then when processing tasks. The delta could then be used to calculate Wh/Task.

And that's quite important part. If some device completes task much faster - it needs correspondingly less amount of time when whole system powered ON. If very low-power device takes much longer to complete the same task - it requires full system support (with all its energy consuming overhead) through all this long time.

Hence, w/o accounting for whole system energy consumption overhead such CPU vs GPU efficiency comparisons are quite biased IMO and don't show real benefits of fast GPU computing.

I find it most useful to calculate the CPU & GPU in the same system for each app or type of work from a project. Then I can use the information to determine which device in the system is most suited to running a given type of work.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1816031 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1816133 - Posted: 10 Sep 2016, 11:22:14 UTC - in response to Message 1816031.  


This type of calculation does not take into account the whole power usage of the system the device is in. To do that you would really need to use a power meter, or UPS with a power usage display, and measure each system at idle and then when processing tasks. The delta could then be used to calculate Wh/Task.

And that's quite important part. If some device completes task much faster - it needs correspondingly less amount of time when whole system powered ON. If very low-power device takes much longer to complete the same task - it requires full system support (with all its energy consuming overhead) through all this long time.

Hence, w/o accounting for whole system energy consumption overhead such CPU vs GPU efficiency comparisons are quite biased IMO and don't show real benefits of fast GPU computing.

I find it most useful to calculate the CPU & GPU in the same system for each app or type of work from a project. Then I can use the information to determine which device in the system is most suited to running a given type of work.

Yes, if system overhead power remains the same, needed corrections can be devised from pure device power data. But they are needed still, especially if throughput of devices in comparison differs by order of magnitude or more.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1816133 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20258
Credit: 7,508,002
RAC: 20
United Kingdom
Message 1816386 - Posted: 11 Sep 2016, 13:18:05 UTC

Very good theorizing...

However the best test is to test reality directly:

Directly measure power consumed as measured at your mains power wall socket over for example 48 hours and divide by total WUs or RAC.

You should get some interesting numbers, especially comparing the WU and RAC values.



Happy efficient crunchin!
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 1816386 · Report as offensive
qbit
Volunteer tester
Avatar

Send message
Joined: 19 Sep 04
Posts: 630
Credit: 6,868,528
RAC: 0
Austria
Message 1816401 - Posted: 11 Sep 2016, 14:11:08 UTC

Just for clarification: I can't provide "real" numbers for the whole systems because I lack the tools to measure real power usage. And I have no plans to get some because the cheap ones are pretty inaccurate and the professional ones are too expensive to buy just for fun.

But if anybody has such tools, feel free to share your findings here.
ID: 1816401 · Report as offensive
Profile M_M
Avatar

Send message
Joined: 20 May 04
Posts: 76
Credit: 45,752,966
RAC: 8
Serbia
Message 1816463 - Posted: 11 Sep 2016, 17:37:35 UTC - in response to Message 1816401.  
Last modified: 11 Sep 2016, 18:36:42 UTC

I think even the cheap power meters ($15-20) should be accurate enough to measure average power consumption, so why not try? My measurements at wall socket are below (I also have a APC SmartUPS that also draws some 5% from shown below).

My PC idle (i.e. ordinary desktop work, websurf etc) with 24" LCD is around 170W (100W in real idle with monitor sleeping).
With S&H running just on CPU, power draw is around 275W. (i7-2600k, overclocked to 4.5GHz)
With S&H running CPU + GTX1080, power draw is around 390W. So GTX1080 is responsible for around 115W draw, which is around 64% of its TDP, close to what I get as report from GPU-Z average power consumption.
ID: 1816463 · Report as offensive

Message boards : Number crunching : MB v8: CPU vs GPU (in terms of efficiency)


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.