A journey: iGPU slowing CPU processing

Message boards : Number crunching : A journey: iGPU slowing CPU processing
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6530
Credit: 189,815,445
RAC: 15,508
United States
Message 1543505 - Posted: 17 Jul 2014, 22:11:52 UTC
Last modified: 17 Jul 2014, 22:14:45 UTC

Initially I didn't think anything of my AP times on my i5-4670K as they were faster than any other system I have run previously. After finding older generation i5's, with lower clock speeds, running AP tasks faster than my two i5-4670K systems I did some observations.

Apps used:
ap6_win_x64_avx_cpu_r2163
ap6_win_x64_sse3_cpu_r2163
mb7_win_x86_sse_opencl_intel_r2170

Times are based on:
MB tasks with an AR near 0.42
AP tasks with blanking between 0-5%

Configuration: 4 CPU MB + 1 iGPU MB
CPU times ~2 hr, iGPU times ~1 hr

Configuration: 4 CPU MB, No iGPU running
CPU times ~2 hr

Configuration: 4 CPU AP + 1 iGPU MB
CPU times 7.5-8 hr, iGPU times ~1 hr

Configuration: 4 CPU AP, No iGPU running
CPU times 4-4.5 hr

Currently I am testing
Configuration: 3 CPU AP + 1 iGPU MB

If that provides fruitful I am going to test.
Configuration: 3 CPU AP + 1 CPU MB + 1 iGPU MB

From these observations it looks like running 4 AP tasks with MB on the iGPU might possibly be hitting the memory bandwidth limit. Given the iGPU was not slowed with the 4 AP tasks. It looks like the iGPU might be given priority to the memory.

I am also testing this on my Silvermont based system. It does not show a change in speed when not using the iGPU. Perhaps 4 AP tasks are maxing it out alone.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1543505 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6530
Credit: 189,815,445
RAC: 15,508
United States
Message 1543734 - Posted: 18 Jul 2014, 7:45:14 UTC
Last modified: 18 Jul 2014, 7:57:15 UTC

Forgot to list the MB CPU apps previously:
akv8c_r2202_x64_avx
akv8c_r2202_x64_ssse3

I ran several more tests.
Configuration: 3 CPU AP + 1 iGPU MB
Configuration: 2 CPU AP + 1 iGPU MB
Configuration: 1 CPU AP + 1 iGPU MB

In each test the results were:
CPU times 7.5-8 hr, iGPU times ~1 hr

Now it seem less like a memory bottleneck to me.
The iGPU seems to be doing something that is causing the AP tasks to run nearly twice as slow.

Giving a test with -cpu_lock to see what happens. If the AP CPU times are still slowed with that setting I will probably just configure to only run iGPU when running MB is running on all 4 CPU.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1543734 · Report as offensive
Darrell Wilcox
Volunteer tester

Send message
Joined: 11 Nov 99
Posts: 211
Credit: 124,619,723
RAC: 100,548
Vietnam
Message 1543790 - Posted: 18 Jul 2014, 11:09:52 UTC - in response to Message 1543734.  

I am not sure this is relevant to what you are seeing, but on my i7-4770k with 4 GTX750Ti cards, using the iGPU significantly slowed all the cards. I didn't bother to quantify it, though. It was just very obvious by looking at the CPU-Z numbers.
ID: 1543790 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 1543956 - Posted: 18 Jul 2014, 18:15:19 UTC

Hal, I suspect the Haswell "smart cache" implementation may be at least part of the problem. SaH v7 tasks use a lot of cache when doing various operations, and having that single 6 MiB cache shared between all processing units may thrash it.

It might also be interesting to see how AP on the iGPU interacts with CPU processing. There's an open Beta at Crunchers Anonymous for AP6_win_x86_SSE2_OpenCL_Intel_r2180.
                                                                   Joe
ID: 1543956 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6530
Credit: 189,815,445
RAC: 15,508
United States
Message 1543959 - Posted: 18 Jul 2014, 18:43:04 UTC - in response to Message 1543956.  
Last modified: 18 Jul 2014, 18:45:24 UTC

Hal, I suspect the Haswell "smart cache" implementation may be at least part of the problem. SaH v7 tasks use a lot of cache when doing various operations, and having that single 6 MiB cache shared between all processing units may thrash it.

It might also be interesting to see how AP on the iGPU interacts with CPU processing. There's an open Beta at Crunchers Anonymous for AP6_win_x86_SSE2_OpenCL_Intel_r2180.
                                                                   Joe

I was just wondering if the cache might be the issue & perhaps that could be why I am not seeing the same behavior on Silvermont/Bay Trail-D. It uses a separate 1MB cache for each pair of cores. I am not sure how the iGPU is allocated cache in that configuration. The block diagram here may be helpful in some way.

I did try the iGPU AP app a while back. I even pumped out 12 tasks on this machine. It was dreadfully slow. Around 12-13 hours per AP IIRC. However I was running 4 MB tasks on the CPU. At the time I just figured the iGPU must not be good at doing AP & moved on.

The next step is obviously to run AP on the iGPU alone & see how it responds.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1543959 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6530
Credit: 189,815,445
RAC: 15,508
United States
Message 1543968 - Posted: 18 Jul 2014, 19:17:41 UTC
Last modified: 18 Jul 2014, 19:21:12 UTC

Running on this machine.
Configuration: No CPU + 1 iGPU AP

About 10 minutes in on the AP task & the clock for the iGPU is all over the place(MB keeps it at a solid 1.2GHz), but at about 5%. So maybe around 3 hours depending on what the task turns out to have in it.
I am also running it with an empty ap_cmdline.txt. So some tweaks might even out the clock rate.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1543968 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 12190
Credit: 124,899,392
RAC: 40,565
United Kingdom
Message 1543971 - Posted: 18 Jul 2014, 19:21:17 UTC

Raistmer did some tuning work on his iGPU applications with Oliver Bock of Einstein, round about October 2013 (see Support for (integrated) Intel GPUs (Ivy Bridge and later)). I think you'll find the SETI builds later than that will run much better when the CPU is fully loaded.
ID: 1543971 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6530
Credit: 189,815,445
RAC: 15,508
United States
Message 1543976 - Posted: 18 Jul 2014, 19:38:23 UTC - in response to Message 1543971.  

Raistmer did some tuning work on his iGPU applications with Oliver Bock of Einstein, round about October 2013 (see Support for (integrated) Intel GPUs (Ivy Bridge and later)). I think you'll find the SETI builds later than that will run much better when the CPU is fully loaded.

I'll have to read through that.
It looks like all of the OpenCL builds I'm running are dated 2014 March. I grabbed them from Mikes site looks like early June from the compressed file date stamps.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1543976 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 12190
Credit: 124,899,392
RAC: 40,565
United Kingdom
Message 1544027 - Posted: 18 Jul 2014, 22:02:46 UTC - in response to Message 1543976.  

Raistmer did some tuning work on his iGPU applications with Oliver Bock of Einstein, round about October 2013 (see Support for (integrated) Intel GPUs (Ivy Bridge and later)). I think you'll find the SETI builds later than that will run much better when the CPU is fully loaded.

I'll have to read through that.
It looks like all of the OpenCL builds I'm running are dated 2014 March. I grabbed them from Mikes site looks like early June from the compressed file date stamps.

Look for revision numbers. This is one which runs well with a fully loaded CPU:

OpenCL version by Raistmer, r2061

Anything after that should be OK as well.
ID: 1544027 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 31324
Credit: 66,259,506
RAC: 25,083
Germany
Message 1544031 - Posted: 18 Jul 2014, 22:14:33 UTC

I`m hosting r_2180 which should be O.K.
With each crime and every kindness we birth our future.
ID: 1544031 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6530
Credit: 189,815,445
RAC: 15,508
United States
Message 1544040 - Posted: 18 Jul 2014, 22:29:06 UTC
Last modified: 18 Jul 2014, 22:30:16 UTC

The apps I have been using are the latest.

mb7_win_x86_sse_opencl_intel_r2170
ap6_win_x86_sse2_opencl_intel_r2180

In that thread on Einstein I believe Raistmer mentioned something about this could be driver dependent. Unless that was referring to something else.
I am using the most recent drive from Intel which they released in April after these apps. So perhaps that that might be part of it.

Running just the iGPU for AP looks alright
Configuration: No CPU + 1 iGPU AP
iGPU time ~3 hr

So I will give this a go and see what happens.
Configuration: 4 CPU AP + 1 iGPU AP

With the CPU usage of the iGPU AP application I may want to pull that down to 3 CPU AP, but I want to see what happens this way first.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1544040 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6530
Credit: 189,815,445
RAC: 15,508
United States
Message 1544048 - Posted: 18 Jul 2014, 22:49:54 UTC
Last modified: 18 Jul 2014, 22:52:56 UTC

Configuration: 4 CPU AP + 1 iGPU AP
At 24 minutes run time the 4 CPU AP they were 5% complete.
On track for ~8 hr run time
At 36 minutes run time the 1 iGPU AP it was 20% complete.
On track for ~3 hr run time

I suspended those 4 CPU AP and changed to this with 3 new tasks to see their times.
Configuration: 3 CPU AP + 1 iGPU AP

I have a feeling the CPU will come up with ~8hr run times.

I feel I may need to retest. I may have had those numbers with a different Intel driver or something else.
Configuration: 4 CPU MB + 1 iGPU MB
Configuration: 4 CPU MB, No iGPU running
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1544048 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 12190
Credit: 124,899,392
RAC: 40,565
United Kingdom
Message 1544051 - Posted: 18 Jul 2014, 23:04:03 UTC - in response to Message 1544048.  

Yes, I'd guess it matters *what* you're running on the CPU cores, not just 'how many'.

I normally run a couple of light-weight BOINC projects:
NumberFields@Home (integer only)
SIMAP (floating point, but no SIMD variants)

SIMAP is dark between batches at the moment, so I'm picking up LHCclassic, which has SSE3 apps and - unusually - a glut of work at the moment. The iGPU has slowed down.

A few weeks ago, I ran an intensive batch of Einstein CasA. That has probably inherited some derivative of Akos Fakete's hand-whittled machine code in its hot loops: it slowed the iGPU by something approaching 50%. These things are interconnected and influence each other.
ID: 1544051 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6530
Credit: 189,815,445
RAC: 15,508
United States
Message 1544064 - Posted: 18 Jul 2014, 23:44:37 UTC - in response to Message 1544051.  

Yes, I'd guess it matters *what* you're running on the CPU cores, not just 'how many'.

I normally run a couple of light-weight BOINC projects:
NumberFields@Home (integer only)
SIMAP (floating point, but no SIMD variants)

SIMAP is dark between batches at the moment, so I'm picking up LHCclassic, which has SSE3 apps and - unusually - a glut of work at the moment. The iGPU has slowed down.

A few weeks ago, I ran an intensive batch of Einstein CasA. That has probably inherited some derivative of Akos Fakete's hand-whittled machine code in its hot loops: it slowed the iGPU by something approaching 50%. These things are interconnected and influence each other.

It certainly does seem to be that way.
I was going to run CPU & GPU MB again to make sure it want an Intel driver issue. Then I remembered that I only just switched my 2nd i5-4670k over to running AP from 4 CPU MB + 1 iGPU MB. It still has pending MB tasks from last week with normal times CPU times & I last updated the Intel driver in May.

From the tests I have done these configurations are true for my Haswell systems. I speculate that Ivy Bridge would have similar results.
With red indicating extended, nearly double, run times for the app(s).
Configuration:
4 CPU MB + No iGPU running
4 CPU MB + 1 iGPU MB
4 CPU AP + No iGPU running
4 CPU AP + 1 iGPU MB
4 CPU AP + 1 iGPU AP
1 CPU AP + 1 iGPU AP

There is still a configuration I haven't tested yet to determine if there is a slow down in any of the apps.
4 CPU MB + 1 iGPU AP
3 CPU MB + 1 iGPU AP I should also check this since the iGPU AP app looks to use about 10-12% CPU.

I still have several other test to do on my Silvermont/Bay Trail system as it does not seem to incur any extended run time issues when using its iGPU.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1544064 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6530
Credit: 189,815,445
RAC: 15,508
United States
Message 1545023 - Posted: 20 Jul 2014, 19:59:10 UTC
Last modified: 20 Jul 2014, 20:02:00 UTC

Further testing on my Haswell systems has shown that I was mistaken on the MB CPU times when not running the iGPU. It could be that the MB CPU tasks were also VLARs and I just didn't notice at the time.

So I will revise my data from before.
Configuration: 4 CPU MB + 1 iGPU MB
CPU times ~2 hr, iGPU times ~1 hr

Configuration: 4 CPU MB, No iGPU running
CPU times ~1 hr

Further testing with the Silvermont/Bay Trail system is still on going. It system is much slower so testing takes much much longer.

I suppose I should also test with my ATI GPUs to see if there is a similar reaction, but that shall wait for cooler weather.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1545023 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6014
Credit: 84,593,374
RAC: 35,659
Russia
Message 1545041 - Posted: 20 Jul 2014, 20:50:26 UTC - in response to Message 1545023.  

check thoroughly that all freq reducing CPU/mobo features are disabled.
These chips seems can't work at full load. Power or temp constrains lead to reduced freqs. For either CPU or GPU part. Cache contention effects can be big too, but if freqs go down effect will be much bigger.
ID: 1545041 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6530
Credit: 189,815,445
RAC: 15,508
United States
Message 1545066 - Posted: 20 Jul 2014, 22:25:54 UTC - in response to Message 1545041.  
Last modified: 20 Jul 2014, 22:28:57 UTC

check thoroughly that all freq reducing CPU/mobo features are disabled.
These chips seems can't work at full load. Power or temp constrains lead to reduced freqs. For either CPU or GPU part. Cache contention effects can be big too, but if freqs go down effect will be much bigger.


On my i5-4670K gaming box. Host 5837483
I have all power saving features disabled.
OS settings are 100% minimum/100% maximum.
Cooler is Noctura D-14. It keeps temps down very well.
Max temp on hottest core right now reads 52ºC with 28ºC ambient.

On my i5-4670K HTPC. Host 5255585
I have speed stepping enabled.
OS settings are 50% minimum/100% maximum.
Cooler is NZXT Respire T20. Not as good as the Noctura, but still gets the job done.
Max temp on hottest core right now reads 62ºC with 28ºC ambient.

I haven not seen the CPU down clock on either system with BOINC running.
With MB on the iGPU the clock only drops from 1.2GHz when loading a new task. With AP on the iGPU the clock rate varies. GPUz graph of clock rate was very spiky. Probably as portions switched to CPU? I can get screen shot when we have AP again if needed.

I think perhaps it could be CPU & iGPU fighting over cache. Even with running 1 CPU + 1 iGPU. With iGPU always being the winner. As iGPU does not get slowed.
My Bay Trail system, host 7324426, not seem to show this same kind of CPU slowdown. This may point to the cache also. Since Bay Trail has separate cache for each pair of CPU cores.
I still have much more testing to do with that system to determine if there is even any slowdown. Maybe there is a small % slowdown, but so little the iGPU more than makes up for it. Much like HT slows times but increases output.

Also if it is cache issue I wonder if Intel GT3+e GPU will not see the same slowdown with the extra cache it adds. Maybe 8th generation iGPU will have a completely separate cache.

Even with slowdown I think performance of CPU+iGPU is good. I ran for months & I only noticed the slowdown after comparing times with Sandy/Ivy Bridge CPUs. Highest efficiency might be to run different CPU apps with SETI@home iGPU app. Only with more testing will the answer be known.

EDIT:
I also am speculating if the slowdown could be related to those machine having additional GPU's. In the form of ATI cards. ATI cars are primary video with iGPU only being used for crunching. On Bay Trail system it only has iGPU for display & crunching.
I do not think this could be the issue, but plan to give it a try at some point.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1545066 · Report as offensive
Profile FalconFly
Avatar

Send message
Joined: 5 Oct 99
Posts: 394
Credit: 18,053,892
RAC: 0
Germany
Message 1546022 - Posted: 22 Jul 2014, 21:12:32 UTC - in response to Message 1545066.  
Last modified: 22 Jul 2014, 21:40:36 UTC

As far as I have observed, iGPUs causing massive slowdown of other CPU tasks is simply caused by it downright crippling the RAM bandwidth.

iGPUs that use System RAM easily peak-load the entire RAM bandwidth all on their own.
Adding CPU tasks that both occupy cache and then also compete for the little RAM bandwidth just slows down everything.

iGPUs are usually better left running all on their own (no CPU tasks at all, except the one keeping the GPU loaded).
That typically yields the best overall performance - albeit it does leave some CPU cores unused.

Although that feels like a waste, shared memory GPUs (i.e. AMD APUs or CPU-integrated intel GPUs) force the user to make that decision or reserve the CPUs for another non-CPU intensive BOINC project.

On the positive side, having the CPU cores run idle reduces power consumption while the iGPU yields optimum total performance in terms of output.
Also, once the iGPU is on its own, its output performance on the project will also maximize.

The only exception to that rule would be tasks that fit nearly entirely into the CPU caches and do not compete with RAM bandwidth to any significant extent.

Quick example :
Dual-Channel DDR3-1866 yields about some 14-16GB/sec realistic max. sustained performance under typical conditions and a standard system overhead (*).
That bandwidth is easily consumed completely by modern iGPUs all on its own.
Adding any other RAM-Intensive tasks will divide that figure by the total amount of cores competing for its bandwidth.
In the end, all of them would operate under extremely RAM bandwidth-starved conditions, yielding comparably poor performance on both CPU and iGPU runtimes.

(*)
Note that Dual Channel does not double the RAM bandwidth in any way, although marketing never stopped suggesting that.
It's maximum sustained advantage over single channel is approx. 15-20%.
This won't change until someone develops and integrates a fully functional RAID 0 RAM-controller (RAM striping), which doesn't exist so far.

PS.
I've just recently put my notebook to use for SETI.
As it both holds an intel HD4000 as well as an NVidia GT720M 1GB, I've set the Core i3-3110M 2.4GHz GPU to run only the two GPU tasks.
The HD4000 is given the max. available RAM bandwidth of dual channel DDR3-1600, while the GT720M is working off its own VRAM.
The four logical CPU cores (2x phys + 2 HT) keep those loaded to the optmium extent while keeping their resulting RAM bandwidth competition to a minimum, and having the Caches all for themselfes to feed the GPUs.
The remaining CPU idle time is reducing the thermal stress on the notebook's cooling system, which is highly desirable.
ID: 1546022 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6530
Credit: 189,815,445
RAC: 15,508
United States
Message 1546066 - Posted: 22 Jul 2014, 22:31:21 UTC

Running tests with reduced number of CPU tasks did not show an increase in CPU performance. I would expect some significant change if the system memory bandwidth was being being maximized causing a bottleneck. I don't know enough about how the applications use memory to say if my expectations are accurate.

I did test in these scenario.
4 CPU + 1 iGPU
3 CPU + 1 iGPU
2 CPU + 1 iGPU
1 CPU + 1 iGPU
With CPU times showing about double vs running CPU alone.

With my Bay-Trail-D system not apparently displaying a slowing of processing times on the CPU while running the iGPU. I think Cache thrashing does seems the most likely to me right now. I am still doing tests on that system. So results may change as more data points are collected.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the BP6/VP6 User Group today!
ID: 1546066 · Report as offensive
Profile FalconFly
Avatar

Send message
Joined: 5 Oct 99
Posts: 394
Credit: 18,053,892
RAC: 0
Germany
Message 1546694 - Posted: 24 Jul 2014, 3:40:05 UTC - in response to Message 1546066.  

I wasn't referring CPU performance.

The impact of CPU performance when the iGPU is already bottlenecking the RAM bandwidth is not very important.

The overall performance loss caused by the iGPU not getting max. RAM bandwidth is the key. The CPU cores anyway hardly contribute much to the total output, unless it is a very potent CPU coupled with a rather slow iGPU (although that can depend on the project).

The potential output gain from running CPU cores is typically more than offset but the significantly reduced iGPU performance under these conditions.
ID: 1546694 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : A journey: iGPU slowing CPU processing


 
©2018 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.