Message boards :
Number crunching :
A journey: iGPU slowing CPU processing
Message board moderation
Author | Message |
---|---|
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Initially I didn't think anything of my AP times on my i5-4670K as they were faster than any other system I have run previously. After finding older generation i5's, with lower clock speeds, running AP tasks faster than my two i5-4670K systems I did some observations. Apps used: ap6_win_x64_avx_cpu_r2163 ap6_win_x64_sse3_cpu_r2163 mb7_win_x86_sse_opencl_intel_r2170 Times are based on: MB tasks with an AR near 0.42 AP tasks with blanking between 0-5% Configuration: 4 CPU MB + 1 iGPU MB CPU times ~2 hr, iGPU times ~1 hr Configuration: 4 CPU MB, No iGPU running CPU times ~2 hr Configuration: 4 CPU AP + 1 iGPU MB CPU times 7.5-8 hr, iGPU times ~1 hr Configuration: 4 CPU AP, No iGPU running CPU times 4-4.5 hr Currently I am testing Configuration: 3 CPU AP + 1 iGPU MB If that provides fruitful I am going to test. Configuration: 3 CPU AP + 1 CPU MB + 1 iGPU MB From these observations it looks like running 4 AP tasks with MB on the iGPU might possibly be hitting the memory bandwidth limit. Given the iGPU was not slowed with the 4 AP tasks. It looks like the iGPU might be given priority to the memory. I am also testing this on my Silvermont based system. It does not show a change in speed when not using the iGPU. Perhaps 4 AP tasks are maxing it out alone. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Forgot to list the MB CPU apps previously: akv8c_r2202_x64_avx akv8c_r2202_x64_ssse3 I ran several more tests. Configuration: 3 CPU AP + 1 iGPU MB Configuration: 2 CPU AP + 1 iGPU MB Configuration: 1 CPU AP + 1 iGPU MB In each test the results were: CPU times 7.5-8 hr, iGPU times ~1 hr Now it seem less like a memory bottleneck to me. The iGPU seems to be doing something that is causing the AP tasks to run nearly twice as slow. Giving a test with -cpu_lock to see what happens. If the AP CPU times are still slowed with that setting I will probably just configure to only run iGPU when running MB is running on all 4 CPU. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Darrell Wilcox Send message Joined: 11 Nov 99 Posts: 303 Credit: 180,954,940 RAC: 118 |
I am not sure this is relevant to what you are seeing, but on my i7-4770k with 4 GTX750Ti cards, using the iGPU significantly slowed all the cards. I didn't bother to quantify it, though. It was just very obvious by looking at the CPU-Z numbers. |
Josef W. Segur Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0 |
Hal, I suspect the Haswell "smart cache" implementation may be at least part of the problem. SaH v7 tasks use a lot of cache when doing various operations, and having that single 6 MiB cache shared between all processing units may thrash it. It might also be interesting to see how AP on the iGPU interacts with CPU processing. There's an open Beta at Crunchers Anonymous for AP6_win_x86_SSE2_OpenCL_Intel_r2180. Joe |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Hal, I suspect the Haswell "smart cache" implementation may be at least part of the problem. SaH v7 tasks use a lot of cache when doing various operations, and having that single 6 MiB cache shared between all processing units may thrash it. I was just wondering if the cache might be the issue & perhaps that could be why I am not seeing the same behavior on Silvermont/Bay Trail-D. It uses a separate 1MB cache for each pair of cores. I am not sure how the iGPU is allocated cache in that configuration. The block diagram here may be helpful in some way. I did try the iGPU AP app a while back. I even pumped out 12 tasks on this machine. It was dreadfully slow. Around 12-13 hours per AP IIRC. However I was running 4 MB tasks on the CPU. At the time I just figured the iGPU must not be good at doing AP & moved on. The next step is obviously to run AP on the iGPU alone & see how it responds. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Running on this machine. Configuration: No CPU + 1 iGPU AP About 10 minutes in on the AP task & the clock for the iGPU is all over the place(MB keeps it at a solid 1.2GHz), but at about 5%. So maybe around 3 hours depending on what the task turns out to have in it. I am also running it with an empty ap_cmdline.txt. So some tweaks might even out the clock rate. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
Raistmer did some tuning work on his iGPU applications with Oliver Bock of Einstein, round about October 2013 (see Support for (integrated) Intel GPUs (Ivy Bridge and later)). I think you'll find the SETI builds later than that will run much better when the CPU is fully loaded. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Raistmer did some tuning work on his iGPU applications with Oliver Bock of Einstein, round about October 2013 (see Support for (integrated) Intel GPUs (Ivy Bridge and later)). I think you'll find the SETI builds later than that will run much better when the CPU is fully loaded. I'll have to read through that. It looks like all of the OpenCL builds I'm running are dated 2014 March. I grabbed them from Mikes site looks like early June from the compressed file date stamps. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
Raistmer did some tuning work on his iGPU applications with Oliver Bock of Einstein, round about October 2013 (see Support for (integrated) Intel GPUs (Ivy Bridge and later)). I think you'll find the SETI builds later than that will run much better when the CPU is fully loaded. Look for revision numbers. This is one which runs well with a fully loaded CPU: OpenCL version by Raistmer, r2061 Anything after that should be OK as well. |
Mike Send message Joined: 17 Feb 01 Posts: 34253 Credit: 79,922,639 RAC: 80 |
I`m hosting r_2180 which should be O.K. With each crime and every kindness we birth our future. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
The apps I have been using are the latest. mb7_win_x86_sse_opencl_intel_r2170 ap6_win_x86_sse2_opencl_intel_r2180 In that thread on Einstein I believe Raistmer mentioned something about this could be driver dependent. Unless that was referring to something else. I am using the most recent drive from Intel which they released in April after these apps. So perhaps that that might be part of it. Running just the iGPU for AP looks alright Configuration: No CPU + 1 iGPU AP iGPU time ~3 hr So I will give this a go and see what happens. Configuration: 4 CPU AP + 1 iGPU AP With the CPU usage of the iGPU AP application I may want to pull that down to 3 CPU AP, but I want to see what happens this way first. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Configuration: 4 CPU AP + 1 iGPU AP At 24 minutes run time the 4 CPU AP they were 5% complete. On track for ~8 hr run time At 36 minutes run time the 1 iGPU AP it was 20% complete. On track for ~3 hr run time I suspended those 4 CPU AP and changed to this with 3 new tasks to see their times. Configuration: 3 CPU AP + 1 iGPU AP I have a feeling the CPU will come up with ~8hr run times. I feel I may need to retest. I may have had those numbers with a different Intel driver or something else. Configuration: 4 CPU MB + 1 iGPU MB Configuration: 4 CPU MB, No iGPU running SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Richard Haselgrove Send message Joined: 4 Jul 99 Posts: 14649 Credit: 200,643,578 RAC: 874 |
Yes, I'd guess it matters *what* you're running on the CPU cores, not just 'how many'. I normally run a couple of light-weight BOINC projects: NumberFields@Home (integer only) SIMAP (floating point, but no SIMD variants) SIMAP is dark between batches at the moment, so I'm picking up LHCclassic, which has SSE3 apps and - unusually - a glut of work at the moment. The iGPU has slowed down. A few weeks ago, I ran an intensive batch of Einstein CasA. That has probably inherited some derivative of Akos Fakete's hand-whittled machine code in its hot loops: it slowed the iGPU by something approaching 50%. These things are interconnected and influence each other. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Yes, I'd guess it matters *what* you're running on the CPU cores, not just 'how many'. It certainly does seem to be that way. I was going to run CPU & GPU MB again to make sure it want an Intel driver issue. Then I remembered that I only just switched my 2nd i5-4670k over to running AP from 4 CPU MB + 1 iGPU MB. It still has pending MB tasks from last week with normal times CPU times & I last updated the Intel driver in May. From the tests I have done these configurations are true for my Haswell systems. I speculate that Ivy Bridge would have similar results. With red indicating extended, nearly double, run times for the app(s). Configuration: 4 CPU MB + No iGPU running 4 CPU MB + 1 iGPU MB 4 CPU AP + No iGPU running 4 CPU AP + 1 iGPU MB 4 CPU AP + 1 iGPU AP 1 CPU AP + 1 iGPU AP There is still a configuration I haven't tested yet to determine if there is a slow down in any of the apps. 4 CPU MB + 1 iGPU AP 3 CPU MB + 1 iGPU AP I should also check this since the iGPU AP app looks to use about 10-12% CPU. I still have several other test to do on my Silvermont/Bay Trail system as it does not seem to incur any extended run time issues when using its iGPU. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Further testing on my Haswell systems has shown that I was mistaken on the MB CPU times when not running the iGPU. It could be that the MB CPU tasks were also VLARs and I just didn't notice at the time. So I will revise my data from before. Configuration: 4 CPU MB + 1 iGPU MB CPU times ~2 hr, iGPU times ~1 hr Configuration: 4 CPU MB, No iGPU running CPU times ~1 hr Further testing with the Silvermont/Bay Trail system is still on going. It system is much slower so testing takes much much longer. I suppose I should also test with my ATI GPUs to see if there is a similar reaction, but that shall wait for cooler weather. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
check thoroughly that all freq reducing CPU/mobo features are disabled. These chips seems can't work at full load. Power or temp constrains lead to reduced freqs. For either CPU or GPU part. Cache contention effects can be big too, but if freqs go down effect will be much bigger. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
check thoroughly that all freq reducing CPU/mobo features are disabled. On my i5-4670K gaming box. Host 5837483 I have all power saving features disabled. OS settings are 100% minimum/100% maximum. Cooler is Noctura D-14. It keeps temps down very well. Max temp on hottest core right now reads 52ºC with 28ºC ambient. On my i5-4670K HTPC. Host 5255585 I have speed stepping enabled. OS settings are 50% minimum/100% maximum. Cooler is NZXT Respire T20. Not as good as the Noctura, but still gets the job done. Max temp on hottest core right now reads 62ºC with 28ºC ambient. I haven not seen the CPU down clock on either system with BOINC running. With MB on the iGPU the clock only drops from 1.2GHz when loading a new task. With AP on the iGPU the clock rate varies. GPUz graph of clock rate was very spiky. Probably as portions switched to CPU? I can get screen shot when we have AP again if needed. I think perhaps it could be CPU & iGPU fighting over cache. Even with running 1 CPU + 1 iGPU. With iGPU always being the winner. As iGPU does not get slowed. My Bay Trail system, host 7324426, not seem to show this same kind of CPU slowdown. This may point to the cache also. Since Bay Trail has separate cache for each pair of CPU cores. I still have much more testing to do with that system to determine if there is even any slowdown. Maybe there is a small % slowdown, but so little the iGPU more than makes up for it. Much like HT slows times but increases output. Also if it is cache issue I wonder if Intel GT3+e GPU will not see the same slowdown with the extra cache it adds. Maybe 8th generation iGPU will have a completely separate cache. Even with slowdown I think performance of CPU+iGPU is good. I ran for months & I only noticed the slowdown after comparing times with Sandy/Ivy Bridge CPUs. Highest efficiency might be to run different CPU apps with SETI@home iGPU app. Only with more testing will the answer be known. EDIT: I also am speculating if the slowdown could be related to those machine having additional GPU's. In the form of ATI cards. ATI cars are primary video with iGPU only being used for crunching. On Bay Trail system it only has iGPU for display & crunching. I do not think this could be the issue, but plan to give it a try at some point. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
FalconFly Send message Joined: 5 Oct 99 Posts: 394 Credit: 18,053,892 RAC: 0 |
As far as I have observed, iGPUs causing massive slowdown of other CPU tasks is simply caused by it downright crippling the RAM bandwidth. iGPUs that use System RAM easily peak-load the entire RAM bandwidth all on their own. Adding CPU tasks that both occupy cache and then also compete for the little RAM bandwidth just slows down everything. iGPUs are usually better left running all on their own (no CPU tasks at all, except the one keeping the GPU loaded). That typically yields the best overall performance - albeit it does leave some CPU cores unused. Although that feels like a waste, shared memory GPUs (i.e. AMD APUs or CPU-integrated intel GPUs) force the user to make that decision or reserve the CPUs for another non-CPU intensive BOINC project. On the positive side, having the CPU cores run idle reduces power consumption while the iGPU yields optimum total performance in terms of output. Also, once the iGPU is on its own, its output performance on the project will also maximize. The only exception to that rule would be tasks that fit nearly entirely into the CPU caches and do not compete with RAM bandwidth to any significant extent. Quick example : Dual-Channel DDR3-1866 yields about some 14-16GB/sec realistic max. sustained performance under typical conditions and a standard system overhead (*). That bandwidth is easily consumed completely by modern iGPUs all on its own. Adding any other RAM-Intensive tasks will divide that figure by the total amount of cores competing for its bandwidth. In the end, all of them would operate under extremely RAM bandwidth-starved conditions, yielding comparably poor performance on both CPU and iGPU runtimes. (*) Note that Dual Channel does not double the RAM bandwidth in any way, although marketing never stopped suggesting that. It's maximum sustained advantage over single channel is approx. 15-20%. This won't change until someone develops and integrates a fully functional RAID 0 RAM-controller (RAM striping), which doesn't exist so far. PS. I've just recently put my notebook to use for SETI. As it both holds an intel HD4000 as well as an NVidia GT720M 1GB, I've set the Core i3-3110M 2.4GHz GPU to run only the two GPU tasks. The HD4000 is given the max. available RAM bandwidth of dual channel DDR3-1600, while the GT720M is working off its own VRAM. The four logical CPU cores (2x phys + 2 HT) keep those loaded to the optmium extent while keeping their resulting RAM bandwidth competition to a minimum, and having the Caches all for themselfes to feed the GPUs. The remaining CPU idle time is reducing the thermal stress on the notebook's cooling system, which is highly desirable. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Running tests with reduced number of CPU tasks did not show an increase in CPU performance. I would expect some significant change if the system memory bandwidth was being being maximized causing a bottleneck. I don't know enough about how the applications use memory to say if my expectations are accurate. I did test in these scenario. 4 CPU + 1 iGPU 3 CPU + 1 iGPU 2 CPU + 1 iGPU 1 CPU + 1 iGPU With CPU times showing about double vs running CPU alone. With my Bay-Trail-D system not apparently displaying a slowing of processing times on the CPU while running the iGPU. I think Cache thrashing does seems the most likely to me right now. I am still doing tests on that system. So results may change as more data points are collected. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
FalconFly Send message Joined: 5 Oct 99 Posts: 394 Credit: 18,053,892 RAC: 0 |
I wasn't referring CPU performance. The impact of CPU performance when the iGPU is already bottlenecking the RAM bandwidth is not very important. The overall performance loss caused by the iGPU not getting max. RAM bandwidth is the key. The CPU cores anyway hardly contribute much to the total output, unless it is a very potent CPU coupled with a rather slow iGPU (although that can depend on the project). The potential output gain from running CPU cores is typically more than offset but the significantly reduced iGPU performance under these conditions. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.