APU load influence on total device throughput, MultiBeam

Message boards : Number crunching : APU load influence on total device throughput, MultiBeam
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Marco Franceschini
Volunteer tester
Avatar

Send message
Joined: 4 Jul 01
Posts: 54
Credit: 69,877,354
RAC: 135
Italy
Message 1826288 - Posted: 23 Oct 2016, 12:18:24 UTC - in response to Message 1826273.  

Sure...this is link at my own build fftw 3.3.5 64 bit float libraries. Renamed to 3.3.4 only for quickly employ under Seti@home.
https://drive.google.com/file/d/0B9iU4E_jpim0MEtZWDQtM2xmcGc/view?usp=sharing

Marco.
ID: 1826288 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1826328 - Posted: 23 Oct 2016, 16:25:10 UTC
Last modified: 23 Oct 2016, 16:33:18 UTC

It's just amazing how precisely total device performance (blue dots) stays the same irregarding what part of device loaded for Trinity:
http://lunatics.kwsn.info/index.php?action=dlattach;topic=1735.0;attach=11487;image
2CPU+GPU or 3CPU - doesn't matter. Only total number of running computational processes matter.
I consider it as very strong evidence that computational performance of this device completely ruined by its inadequately small cache memory.
Maybe it has lower AVX throughput due to shared FP modules... but GPU is separate part! Still it behaves just as 5th core. Definitely botteleneck not in FPU but in data transfer stalls.

Next experiment will be to improve CPu performance by precise pinning of CPu apps similarly to GPU ones.
After re-ckecking 5th dot for fully utilized device (it placed lower than underloaded 4 CPU + idle GPU or 3 CPU + GPU)
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1826328 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1826462 - Posted: 24 Oct 2016, 12:03:44 UTC - in response to Message 1826288.  

Sure...this is link at my own build fftw 3.3.5 64 bit float libraries. Renamed to 3.3.4 only for quickly employ under Seti@home.
https://drive.google.com/file/d/0B9iU4E_jpim0MEtZWDQtM2xmcGc/view?usp=sharing

Marco.

Thanks.
Checked once more on Trinity host - "stock" x64 DLL doesn't generate AVX codelets.
Your AVX x64 3.3.5 - does.
But there is some issue that unallow to distribute your DLL instead current one:
when I tried AVX2 version it gave exception. And SSSE3 didn't generate AVX codelets again. That is, builds are strongly SIMD-version related that prevent generic distribution.Is it possible to get more neutral build that will detect available SIMD set and use it if its safe (just as "stock" 3.3.4 x86 DLL does for example)?
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1826462 · Report as offensive
Marco Franceschini
Volunteer tester
Avatar

Send message
Joined: 4 Jul 01
Posts: 54
Credit: 69,877,354
RAC: 135
Italy
Message 1826466 - Posted: 24 Oct 2016, 12:26:33 UTC

Hi Raistmer, in fact i build fftw 3.3.5 library with incremental support to simd instruction set (i.e are not neutral).
SSSE3 version built for my cpus that have this simd set (Q6600, E6600 etc.) with enable-sse2 switch on;
AVX2 for Haswell cpus and generally for all cpus that have this simd with enable-avx2 switch on that do generate codelets for register->sse2->avx->avx2.
AVX for Ivy Bridge cores.
SSE4.1 for cpus like my Q9450, SSE4.2 for Ivy Bridge without AVX support.
I'll build more neutral version of fftw 3.3.5.

Marco.
ID: 1826466 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1826492 - Posted: 24 Oct 2016, 14:22:23 UTC
Last modified: 24 Oct 2016, 14:22:38 UTC

Methodology for affinity-related benchmarking: http://lunatics.kwsn.info/index.php/topic,1735.msg61155.html#msg61155
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1826492 · Report as offensive
EdwardPF
Volunteer tester

Send message
Joined: 26 Jul 99
Posts: 389
Credit: 236,772,605
RAC: 374
United States
Message 1826496 - Posted: 24 Oct 2016, 14:51:36 UTC - in response to Message 1826328.  

Next experiment will be to improve CPu performance by precise pinning of CPu apps similarly to GPU ones.


my experience with the AMD FX-8350 has consistently been (in terms of RAC) and running 4 S@H concurrently and each locked to a single cpu - 0,2,4,&6 that cpu4 is highest RAC, cpu0 next, cpu2 next, and cpu6 slowest (if this is of any help). This is observation only, not rigorous testing. Not staggering the cpu's only slows down the paired cpu's (I.E. 0-1,2-3,4-5, and 6-7).

I will be interested to see your rigorous results from a modern cpu.

Ed F

P.S. the results are similar with an intel core-7 (gen-1)

cpu RAC order CPU 2, 4, 6, 0
ID: 1826496 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1826516 - Posted: 24 Oct 2016, 16:27:05 UTC - in response to Message 1826496.  

Test in progress. Cause I try to get some averaging also it take quite a lot time for each run (>2h per run). So far data for affinity 0x01 + 0x02 collected but not processed, 0x01 + 0x04 (different modules) in progress.
There are some updates to older tests with GPU + CPU load comes in background.
Hope I'll post updated picture for IvyBridge soon.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1826516 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1826540 - Posted: 24 Oct 2016, 17:52:52 UTC
Last modified: 24 Oct 2016, 17:55:47 UTC

Update for IvyBridge picture: http://lunatics.kwsn.info/index.php?action=dlattach;topic=1735.0;attach=11489;image
Instead of Trinity, IvyBridge computational parts well decoupled. GPU processing with MultiBeam adds some overhead for CPU part but not so big one. With GPU busy almost linear scaling of CPU part remains (red/black dots). But iGPU weak enough on this device and can't compensate loss of CPU core. So, 3CPU+GPU perform worse than 4 CPU cores with idle GPU. Same remains for other configs.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1826540 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1828038 - Posted: 2 Nov 2016, 23:16:02 UTC
Last modified: 2 Nov 2016, 23:19:57 UTC

Final update of Trinity APU results.
http://lunatics.kwsn.info/index.php/topic,1735.msg61161.html#msg61161
It's 2-core CPU with kind of hyperthreading from SETI purposes point of view.

EDIT: maybe one more test worth to take. Pin one process to first 2 CPUs and second one - to last 2 CPUs (BTW, my test clearly showed that 0+1- first module and 2+3 - second module, that's how CPU# mapped to hardware).
This would allow service thread not to pre-empt computing one and could result in slightly better performance (what degree - the aim of test).
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1828038 · Report as offensive
Profile petri33
Volunteer tester

Send message
Joined: 6 Jun 02
Posts: 1668
Credit: 623,086,772
RAC: 156
Finland
Message 1828147 - Posted: 3 Nov 2016, 14:00:26 UTC - in response to Message 1826496.  

Next experiment will be to improve CPu performance by precise pinning of CPu apps similarly to GPU ones.


my experience with the AMD FX-8350 has consistently been (in terms of RAC) and running 4 S@H concurrently and each locked to a single cpu - 0,2,4,&6 that cpu4 is highest RAC, cpu0 next, cpu2 next, and cpu6 slowest (if this is of any help). This is observation only, not rigorous testing. Not staggering the cpu's only slows down the paired cpu's (I.E. 0-1,2-3,4-5, and 6-7).

I will be interested to see your rigorous results from a modern cpu.

Ed F

P.S. the results are similar with an intel core-7 (gen-1)

cpu RAC order CPU 2, 4, 6, 0


Sounds like that your system CPU 0 serves a lot of interrupts (By default).
Those odd-numbered CPUs (1,3,5,7) are HT cores and share the FPU of its 'real core pair'.

If I remember correct my Linux has cores 0-5 as real cores and the 6-11 are their corresponding ht pairs on an i7-3930K.
To overcome Heisenbergs:
"You can't always get what you want / but if you try sometimes you just might find / you get what you need." -- Rolling Stones
ID: 1828147 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1828298 - Posted: 4 Nov 2016, 14:21:41 UTC

Allowing each app to use both CPUs from same module doesn't change performance:
http://lunatics.kwsn.info/index.php?action=dlattach;topic=1735.0;attach=11497;image (x3xc dot)
But allowing 2 different CPU per aech app but from different modules (x5xa) makes performance even worse than xfxf case.

So, on underloaded Bulldozer-based devices direct affinity management with binding each app instance to different module strongly recommended.

Just for completeness next test will be fully loaded CPU part with each process pinned to separate CPU.

And then - 2 CPU apps bound to different modules + GPU app.
Is it possible to achieve smth better than 4CPU+0GPU case through direct affinity management?...
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1828298 · Report as offensive
EdwardPF
Volunteer tester

Send message
Joined: 26 Jul 99
Posts: 389
Credit: 236,772,605
RAC: 374
United States
Message 1829757 - Posted: 11 Nov 2016, 3:04:30 UTC - in response to Message 1828147.  

Petri33:

My experience indicates that on the AMD FX-8350 cpu0&1 share fpu0, cpu 2&3 share fpu1, cpu4&5 share fpu2, and cpu6&7 share fpu3.

also, windows seems to run on the highest numbered cpu.

on mu core-7 (gen 1) cpu0&1 are HT twins, cpu2&3 are HT twins, cpu 4&5 are HT twins, and cpu6&7 are HT twins.

in both cases I try to keep cpu7 as lightly loaded as possible to allow the exec to run without interfering with BOINC WU's running

This seems to work out well in my fun playing around ... no serious rigor here just trying to understand the best hardware/work-unit combination.

Ed F
ID: 1829757 · Report as offensive
EdwardPF
Volunteer tester

Send message
Joined: 26 Jul 99
Posts: 389
Credit: 236,772,605
RAC: 374
United States
Message 1829762 - Posted: 11 Nov 2016, 3:08:10 UTC - in response to Message 1828298.  

Raistmer:

I don't understand the nomenclature you are using on your graphs ... therefore I don't understand what they represent ... sigh ...

how are designating cpu's and cpu combinations etc ...

Ed F
ID: 1829762 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1829788 - Posted: 11 Nov 2016, 6:54:51 UTC - in response to Message 1829762.  

Raistmer:

I don't understand the nomenclature you are using on your graphs ... therefore I don't understand what they represent ... sigh ...

how are designating cpu's and cpu combinations etc ...

Ed F

xIxJ is the affinity settings (in hex) used for particular test.
Affinity is the number where each enabled bit means CPU in use.
So, for example 0x2 means only CPU1 used. 0x1 - CPU) used (and only it).
0x3=11b => both CPU0 and CPU1 used and so on.
0xFF means all 32 bits enabled so no restrictions at all.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1829788 · Report as offensive
Profile HAL9000
Volunteer tester
Avatar

Send message
Joined: 11 Sep 99
Posts: 6534
Credit: 196,805,888
RAC: 57
United States
Message 1829804 - Posted: 11 Nov 2016, 12:01:27 UTC

Hopefully I'll get some time over the weekend to run the iGPU on my Haswell DT and new Skylake laptop for more data points.
SETI@home classic workunits: 93,865 CPU time: 863,447 hours
Join the [url=http://tinyurl.com/8y46zvu]BP6/VP6 User Group[
ID: 1829804 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1834005 - Posted: 4 Dec 2016, 12:44:23 UTC

Returning to this topic: I tried 4CPUs with each app instance pinned to single CPU. Device performance not improved.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1834005 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34255
Credit: 79,922,639
RAC: 80
Germany
Message 1834088 - Posted: 4 Dec 2016, 18:49:12 UTC - in response to Message 1834005.  
Last modified: 4 Dec 2016, 18:49:31 UTC

Returning to this topic: I tried 4CPUs with each app instance pinned to single CPU. Device performance not improved.


Running on 3 CPU cores whilst using GPU should give best throughput.


With each crime and every kindness we birth our future.
ID: 1834088 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1834115 - Posted: 4 Dec 2016, 20:53:48 UTC - in response to Message 1834088.  

Returning to this topic: I tried 4CPUs with each app instance pinned to single CPU. Device performance not improved.


Running on 3 CPU cores whilst using GPU should give best throughput.


So far it returns same performance as 4 CPU only w/o GPU part.
With pinned variant I got slightly better throughput with all 5 parts of device enabled but need to repeat to be sure.
After that I'll try 3 pinned CPU threads + GPU (GPU pinned by default).
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1834115 · Report as offensive
rob smith Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer moderator
Volunteer tester

Send message
Joined: 7 Mar 03
Posts: 22182
Credit: 416,307,556
RAC: 380
United Kingdom
Message 1834120 - Posted: 4 Dec 2016, 21:13:52 UTC

The result may well be different if one used an AMD FX series processor as they have shared FPU units.
Bob Smith
Member of Seti PIPPS (Pluto is a Planet Protest Society)
Somewhere in the (un)known Universe?
ID: 1834120 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1834128 - Posted: 4 Dec 2016, 21:31:06 UTC - in response to Message 1834120.  

The result may well be different if one used an AMD FX series processor as they have shared FPU units.

That's why I explore specifically such CPU.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1834128 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : APU load influence on total device throughput, MultiBeam


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.