Message boards :
Number crunching :
APU load influence on total device throughput, MultiBeam
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Marco Franceschini Send message Joined: 4 Jul 01 Posts: 54 Credit: 69,877,354 RAC: 135 |
Sure...this is link at my own build fftw 3.3.5 64 bit float libraries. Renamed to 3.3.4 only for quickly employ under Seti@home. https://drive.google.com/file/d/0B9iU4E_jpim0MEtZWDQtM2xmcGc/view?usp=sharing Marco. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
It's just amazing how precisely total device performance (blue dots) stays the same irregarding what part of device loaded for Trinity: http://lunatics.kwsn.info/index.php?action=dlattach;topic=1735.0;attach=11487;image 2CPU+GPU or 3CPU - doesn't matter. Only total number of running computational processes matter. I consider it as very strong evidence that computational performance of this device completely ruined by its inadequately small cache memory. Maybe it has lower AVX throughput due to shared FP modules... but GPU is separate part! Still it behaves just as 5th core. Definitely botteleneck not in FPU but in data transfer stalls. Next experiment will be to improve CPu performance by precise pinning of CPu apps similarly to GPU ones. After re-ckecking 5th dot for fully utilized device (it placed lower than underloaded 4 CPU + idle GPU or 3 CPU + GPU) SETI apps news We're not gonna fight them. We're gonna transcend them. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Sure...this is link at my own build fftw 3.3.5 64 bit float libraries. Renamed to 3.3.4 only for quickly employ under Seti@home. Thanks. Checked once more on Trinity host - "stock" x64 DLL doesn't generate AVX codelets. Your AVX x64 3.3.5 - does. But there is some issue that unallow to distribute your DLL instead current one: when I tried AVX2 version it gave exception. And SSSE3 didn't generate AVX codelets again. That is, builds are strongly SIMD-version related that prevent generic distribution.Is it possible to get more neutral build that will detect available SIMD set and use it if its safe (just as "stock" 3.3.4 x86 DLL does for example)? SETI apps news We're not gonna fight them. We're gonna transcend them. |
Marco Franceschini Send message Joined: 4 Jul 01 Posts: 54 Credit: 69,877,354 RAC: 135 |
Hi Raistmer, in fact i build fftw 3.3.5 library with incremental support to simd instruction set (i.e are not neutral). SSSE3 version built for my cpus that have this simd set (Q6600, E6600 etc.) with enable-sse2 switch on; AVX2 for Haswell cpus and generally for all cpus that have this simd with enable-avx2 switch on that do generate codelets for register->sse2->avx->avx2. AVX for Ivy Bridge cores. SSE4.1 for cpus like my Q9450, SSE4.2 for Ivy Bridge without AVX support. I'll build more neutral version of fftw 3.3.5. Marco. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Methodology for affinity-related benchmarking: http://lunatics.kwsn.info/index.php/topic,1735.msg61155.html#msg61155 SETI apps news We're not gonna fight them. We're gonna transcend them. |
EdwardPF Send message Joined: 26 Jul 99 Posts: 389 Credit: 236,772,605 RAC: 374 |
Next experiment will be to improve CPu performance by precise pinning of CPu apps similarly to GPU ones. my experience with the AMD FX-8350 has consistently been (in terms of RAC) and running 4 S@H concurrently and each locked to a single cpu - 0,2,4,&6 that cpu4 is highest RAC, cpu0 next, cpu2 next, and cpu6 slowest (if this is of any help). This is observation only, not rigorous testing. Not staggering the cpu's only slows down the paired cpu's (I.E. 0-1,2-3,4-5, and 6-7). I will be interested to see your rigorous results from a modern cpu. Ed F P.S. the results are similar with an intel core-7 (gen-1) cpu RAC order CPU 2, 4, 6, 0 |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Test in progress. Cause I try to get some averaging also it take quite a lot time for each run (>2h per run). So far data for affinity 0x01 + 0x02 collected but not processed, 0x01 + 0x04 (different modules) in progress. There are some updates to older tests with GPU + CPU load comes in background. Hope I'll post updated picture for IvyBridge soon. SETI apps news We're not gonna fight them. We're gonna transcend them. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Update for IvyBridge picture: http://lunatics.kwsn.info/index.php?action=dlattach;topic=1735.0;attach=11489;image Instead of Trinity, IvyBridge computational parts well decoupled. GPU processing with MultiBeam adds some overhead for CPU part but not so big one. With GPU busy almost linear scaling of CPU part remains (red/black dots). But iGPU weak enough on this device and can't compensate loss of CPU core. So, 3CPU+GPU perform worse than 4 CPU cores with idle GPU. Same remains for other configs. SETI apps news We're not gonna fight them. We're gonna transcend them. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Final update of Trinity APU results. http://lunatics.kwsn.info/index.php/topic,1735.msg61161.html#msg61161 It's 2-core CPU with kind of hyperthreading from SETI purposes point of view. EDIT: maybe one more test worth to take. Pin one process to first 2 CPUs and second one - to last 2 CPUs (BTW, my test clearly showed that 0+1- first module and 2+3 - second module, that's how CPU# mapped to hardware). This would allow service thread not to pre-empt computing one and could result in slightly better performance (what degree - the aim of test). SETI apps news We're not gonna fight them. We're gonna transcend them. |
petri33 Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156 |
Next experiment will be to improve CPu performance by precise pinning of CPu apps similarly to GPU ones. Sounds like that your system CPU 0 serves a lot of interrupts (By default). Those odd-numbered CPUs (1,3,5,7) are HT cores and share the FPU of its 'real core pair'. If I remember correct my Linux has cores 0-5 as real cores and the 6-11 are their corresponding ht pairs on an i7-3930K. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Allowing each app to use both CPUs from same module doesn't change performance: http://lunatics.kwsn.info/index.php?action=dlattach;topic=1735.0;attach=11497;image (x3xc dot) But allowing 2 different CPU per aech app but from different modules (x5xa) makes performance even worse than xfxf case. So, on underloaded Bulldozer-based devices direct affinity management with binding each app instance to different module strongly recommended. Just for completeness next test will be fully loaded CPU part with each process pinned to separate CPU. And then - 2 CPU apps bound to different modules + GPU app. Is it possible to achieve smth better than 4CPU+0GPU case through direct affinity management?... SETI apps news We're not gonna fight them. We're gonna transcend them. |
EdwardPF Send message Joined: 26 Jul 99 Posts: 389 Credit: 236,772,605 RAC: 374 |
Petri33: My experience indicates that on the AMD FX-8350 cpu0&1 share fpu0, cpu 2&3 share fpu1, cpu4&5 share fpu2, and cpu6&7 share fpu3. also, windows seems to run on the highest numbered cpu. on mu core-7 (gen 1) cpu0&1 are HT twins, cpu2&3 are HT twins, cpu 4&5 are HT twins, and cpu6&7 are HT twins. in both cases I try to keep cpu7 as lightly loaded as possible to allow the exec to run without interfering with BOINC WU's running This seems to work out well in my fun playing around ... no serious rigor here just trying to understand the best hardware/work-unit combination. Ed F |
EdwardPF Send message Joined: 26 Jul 99 Posts: 389 Credit: 236,772,605 RAC: 374 |
Raistmer: I don't understand the nomenclature you are using on your graphs ... therefore I don't understand what they represent ... sigh ... how are designating cpu's and cpu combinations etc ... Ed F |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Raistmer: xIxJ is the affinity settings (in hex) used for particular test. Affinity is the number where each enabled bit means CPU in use. So, for example 0x2 means only CPU1 used. 0x1 - CPU) used (and only it). 0x3=11b => both CPU0 and CPU1 used and so on. 0xFF means all 32 bits enabled so no restrictions at all. SETI apps news We're not gonna fight them. We're gonna transcend them. |
HAL9000 Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57 |
Hopefully I'll get some time over the weekend to run the iGPU on my Haswell DT and new Skylake laptop for more data points. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Returning to this topic: I tried 4CPUs with each app instance pinned to single CPU. Device performance not improved. SETI apps news We're not gonna fight them. We're gonna transcend them. |
Mike Send message Joined: 17 Feb 01 Posts: 34255 Credit: 79,922,639 RAC: 80 |
Returning to this topic: I tried 4CPUs with each app instance pinned to single CPU. Device performance not improved. Running on 3 CPU cores whilst using GPU should give best throughput. With each crime and every kindness we birth our future. |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
Returning to this topic: I tried 4CPUs with each app instance pinned to single CPU. Device performance not improved. So far it returns same performance as 4 CPU only w/o GPU part. With pinned variant I got slightly better throughput with all 5 parts of device enabled but need to repeat to be sure. After that I'll try 3 pinned CPU threads + GPU (GPU pinned by default). SETI apps news We're not gonna fight them. We're gonna transcend them. |
rob smith Send message Joined: 7 Mar 03 Posts: 22190 Credit: 416,307,556 RAC: 380 |
The result may well be different if one used an AMD FX series processor as they have shared FPU units. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? |
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
The result may well be different if one used an AMD FX series processor as they have shared FPU units. That's why I explore specifically such CPU. SETI apps news We're not gonna fight them. We're gonna transcend them. |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.