APU load influence on total device throughput, MultiBeam

Author	Message
Marco Franceschini Volunteer tester Send message Joined: 4 Jul 01 Posts: 54 Credit: 69,877,354 RAC: 135	Message 1826288 - Posted: 23 Oct 2016, 12:18:24 UTC - in response to Message 1826273. Sure...this is link at my own build fftw 3.3.5 64 bit float libraries. Renamed to 3.3.4 only for quickly employ under Seti@home. https://drive.google.com/file/d/0B9iU4E_jpim0MEtZWDQtM2xmcGc/view?usp=sharing Marco. ID: 1826288 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1826328 - Posted: 23 Oct 2016, 16:25:10 UTC Last modified: 23 Oct 2016, 16:33:18 UTC It's just amazing how precisely total device performance (blue dots) stays the same irregarding what part of device loaded for Trinity: http://lunatics.kwsn.info/index.php?action=dlattach;topic=1735.0;attach=11487;image 2CPU+GPU or 3CPU - doesn't matter. Only total number of running computational processes matter. I consider it as very strong evidence that computational performance of this device completely ruined by its inadequately small cache memory. Maybe it has lower AVX throughput due to shared FP modules... but GPU is separate part! Still it behaves just as 5th core. Definitely botteleneck not in FPU but in data transfer stalls. Next experiment will be to improve CPu performance by precise pinning of CPu apps similarly to GPU ones. After re-ckecking 5th dot for fully utilized device (it placed lower than underloaded 4 CPU + idle GPU or 3 CPU + GPU) SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1826328 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1826462 - Posted: 24 Oct 2016, 12:03:44 UTC - in response to Message 1826288. Sure...this is link at my own build fftw 3.3.5 64 bit float libraries. Renamed to 3.3.4 only for quickly employ under Seti@home. https://drive.google.com/file/d/0B9iU4E_jpim0MEtZWDQtM2xmcGc/view?usp=sharing Marco. Thanks. Checked once more on Trinity host - "stock" x64 DLL doesn't generate AVX codelets. Your AVX x64 3.3.5 - does. But there is some issue that unallow to distribute your DLL instead current one: when I tried AVX2 version it gave exception. And SSSE3 didn't generate AVX codelets again. That is, builds are strongly SIMD-version related that prevent generic distribution.Is it possible to get more neutral build that will detect available SIMD set and use it if its safe (just as "stock" 3.3.4 x86 DLL does for example)? SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1826462 ·

Marco Franceschini Volunteer tester Send message Joined: 4 Jul 01 Posts: 54 Credit: 69,877,354 RAC: 135	Message 1826466 - Posted: 24 Oct 2016, 12:26:33 UTC Hi Raistmer, in fact i build fftw 3.3.5 library with incremental support to simd instruction set (i.e are not neutral). SSSE3 version built for my cpus that have this simd set (Q6600, E6600 etc.) with enable-sse2 switch on; AVX2 for Haswell cpus and generally for all cpus that have this simd with enable-avx2 switch on that do generate codelets for register->sse2->avx->avx2. AVX for Ivy Bridge cores. SSE4.1 for cpus like my Q9450, SSE4.2 for Ivy Bridge without AVX support. I'll build more neutral version of fftw 3.3.5. Marco. ID: 1826466 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1826492 - Posted: 24 Oct 2016, 14:22:23 UTC Last modified: 24 Oct 2016, 14:22:38 UTC Methodology for affinity-related benchmarking: http://lunatics.kwsn.info/index.php/topic,1735.msg61155.html#msg61155 SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1826492 ·

EdwardPF Volunteer tester Send message Joined: 26 Jul 99 Posts: 389 Credit: 236,772,605 RAC: 374	Message 1826496 - Posted: 24 Oct 2016, 14:51:36 UTC - in response to Message 1826328. Next experiment will be to improve CPu performance by precise pinning of CPu apps similarly to GPU ones. my experience with the AMD FX-8350 has consistently been (in terms of RAC) and running 4 S@H concurrently and each locked to a single cpu - 0,2,4,&6 that cpu4 is highest RAC, cpu0 next, cpu2 next, and cpu6 slowest (if this is of any help). This is observation only, not rigorous testing. Not staggering the cpu's only slows down the paired cpu's (I.E. 0-1,2-3,4-5, and 6-7). I will be interested to see your rigorous results from a modern cpu. Ed F P.S. the results are similar with an intel core-7 (gen-1) cpu RAC order CPU 2, 4, 6, 0 ID: 1826496 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1826516 - Posted: 24 Oct 2016, 16:27:05 UTC - in response to Message 1826496. Test in progress. Cause I try to get some averaging also it take quite a lot time for each run (>2h per run). So far data for affinity 0x01 + 0x02 collected but not processed, 0x01 + 0x04 (different modules) in progress. There are some updates to older tests with GPU + CPU load comes in background. Hope I'll post updated picture for IvyBridge soon. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1826516 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1826540 - Posted: 24 Oct 2016, 17:52:52 UTC Last modified: 24 Oct 2016, 17:55:47 UTC Update for IvyBridge picture: http://lunatics.kwsn.info/index.php?action=dlattach;topic=1735.0;attach=11489;image Instead of Trinity, IvyBridge computational parts well decoupled. GPU processing with MultiBeam adds some overhead for CPU part but not so big one. With GPU busy almost linear scaling of CPU part remains (red/black dots). But iGPU weak enough on this device and can't compensate loss of CPU core. So, 3CPU+GPU perform worse than 4 CPU cores with idle GPU. Same remains for other configs. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1826540 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1828038 - Posted: 2 Nov 2016, 23:16:02 UTC Last modified: 2 Nov 2016, 23:19:57 UTC Final update of Trinity APU results. http://lunatics.kwsn.info/index.php/topic,1735.msg61161.html#msg61161 It's 2-core CPU with kind of hyperthreading from SETI purposes point of view. EDIT: maybe one more test worth to take. Pin one process to first 2 CPUs and second one - to last 2 CPUs (BTW, my test clearly showed that 0+1- first module and 2+3 - second module, that's how CPU# mapped to hardware). This would allow service thread not to pre-empt computing one and could result in slightly better performance (what degree - the aim of test). SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1828038 ·

petri33 Volunteer tester Send message Joined: 6 Jun 02 Posts: 1668 Credit: 623,086,772 RAC: 156	Message 1828147 - Posted: 3 Nov 2016, 14:00:26 UTC - in response to Message 1826496. Next experiment will be to improve CPu performance by precise pinning of CPu apps similarly to GPU ones. my experience with the AMD FX-8350 has consistently been (in terms of RAC) and running 4 S@H concurrently and each locked to a single cpu - 0,2,4,&6 that cpu4 is highest RAC, cpu0 next, cpu2 next, and cpu6 slowest (if this is of any help). This is observation only, not rigorous testing. Not staggering the cpu's only slows down the paired cpu's (I.E. 0-1,2-3,4-5, and 6-7). I will be interested to see your rigorous results from a modern cpu. Ed F P.S. the results are similar with an intel core-7 (gen-1) cpu RAC order CPU 2, 4, 6, 0 Sounds like that your system CPU 0 serves a lot of interrupts (By default). Those odd-numbered CPUs (1,3,5,7) are HT cores and share the FPU of its 'real core pair'. If I remember correct my Linux has cores 0-5 as real cores and the 6-11 are their corresponding ht pairs on an i7-3930K. To overcome Heisenbergs: "You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones ID: 1828147 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1828298 - Posted: 4 Nov 2016, 14:21:41 UTC Allowing each app to use both CPUs from same module doesn't change performance: http://lunatics.kwsn.info/index.php?action=dlattach;topic=1735.0;attach=11497;image (x3xc dot) But allowing 2 different CPU per aech app but from different modules (x5xa) makes performance even worse than xfxf case. So, on underloaded Bulldozer-based devices direct affinity management with binding each app instance to different module strongly recommended. Just for completeness next test will be fully loaded CPU part with each process pinned to separate CPU. And then - 2 CPU apps bound to different modules + GPU app. Is it possible to achieve smth better than 4CPU+0GPU case through direct affinity management?... SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1828298 ·

EdwardPF Volunteer tester Send message Joined: 26 Jul 99 Posts: 389 Credit: 236,772,605 RAC: 374	Message 1829757 - Posted: 11 Nov 2016, 3:04:30 UTC - in response to Message 1828147. Petri33: My experience indicates that on the AMD FX-8350 cpu0&1 share fpu0, cpu 2&3 share fpu1, cpu4&5 share fpu2, and cpu6&7 share fpu3. also, windows seems to run on the highest numbered cpu. on mu core-7 (gen 1) cpu0&1 are HT twins, cpu2&3 are HT twins, cpu 4&5 are HT twins, and cpu6&7 are HT twins. in both cases I try to keep cpu7 as lightly loaded as possible to allow the exec to run without interfering with BOINC WU's running This seems to work out well in my fun playing around ... no serious rigor here just trying to understand the best hardware/work-unit combination. Ed F ID: 1829757 ·

EdwardPF Volunteer tester Send message Joined: 26 Jul 99 Posts: 389 Credit: 236,772,605 RAC: 374	Message 1829762 - Posted: 11 Nov 2016, 3:08:10 UTC - in response to Message 1828298. Raistmer: I don't understand the nomenclature you are using on your graphs ... therefore I don't understand what they represent ... sigh ... how are designating cpu's and cpu combinations etc ... Ed F ID: 1829762 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1829788 - Posted: 11 Nov 2016, 6:54:51 UTC - in response to Message 1829762. Raistmer: I don't understand the nomenclature you are using on your graphs ... therefore I don't understand what they represent ... sigh ... how are designating cpu's and cpu combinations etc ... Ed F xIxJ is the affinity settings (in hex) used for particular test. Affinity is the number where each enabled bit means CPU in use. So, for example 0x2 means only CPU1 used. 0x1 - CPU) used (and only it). 0x3=11b => both CPU0 and CPU1 used and so on. 0xFF means all 32 bits enabled so no restrictions at all. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1829788 ·

HAL9000 Volunteer tester Send message Joined: 11 Sep 99 Posts: 6534 Credit: 196,805,888 RAC: 57	Message 1829804 - Posted: 11 Nov 2016, 12:01:27 UTC Hopefully I'll get some time over the weekend to run the iGPU on my Haswell DT and new Skylake laptop for more data points. SETI@home classic workunits: 93,865 CPU time: 863,447 hours Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[ ID: 1829804 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1834005 - Posted: 4 Dec 2016, 12:44:23 UTC Returning to this topic: I tried 4CPUs with each app instance pinned to single CPU. Device performance not improved. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1834005 ·

Mike Volunteer tester Send message Joined: 17 Feb 01 Posts: 34255 Credit: 79,922,639 RAC: 80	Message 1834088 - Posted: 4 Dec 2016, 18:49:12 UTC - in response to Message 1834005. Last modified: 4 Dec 2016, 18:49:31 UTC Returning to this topic: I tried 4CPUs with each app instance pinned to single CPU. Device performance not improved. Running on 3 CPU cores whilst using GPU should give best throughput. With each crime and every kindness we birth our future. ID: 1834088 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1834115 - Posted: 4 Dec 2016, 20:53:48 UTC - in response to Message 1834088. Returning to this topic: I tried 4CPUs with each app instance pinned to single CPU. Device performance not improved. Running on 3 CPU cores whilst using GPU should give best throughput. So far it returns same performance as 4 CPU only w/o GPU part. With pinned variant I got slightly better throughput with all 5 parts of device enabled but need to repeat to be sure. After that I'll try 3 pinned CPU threads + GPU (GPU pinned by default). SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1834115 ·

rob smith Volunteer moderator Volunteer tester Send message Joined: 7 Mar 03 Posts: 22182 Credit: 416,307,556 RAC: 380	Message 1834120 - Posted: 4 Dec 2016, 21:13:52 UTC The result may well be different if one used an AMD FX series processor as they have shared FPU units. Bob Smith Member of Seti PIPPS (Pluto is a Planet Protest Society) Somewhere in the (un)known Universe? ID: 1834120 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 1834128 - Posted: 4 Dec 2016, 21:31:06 UTC - in response to Message 1834120. The result may well be different if one used an AMD FX series processor as they have shared FPU units. That's why I explore specifically such CPU. SETI apps news We're not gonna fight them. We're gonna transcend them. ID: 1834128 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.