Ryzen 16T / 8C vs. 8T / 8C. How much better?

Message boards : Number crunching : Ryzen 16T / 8C vs. 8T / 8C. How much better?
Message board moderation

To post messages, you must log in.

AuthorMessage
Gene Project Donor

Send message
Joined: 26 Apr 99
Posts: 150
Credit: 48,393,279
RAC: 118
United States
Message 1936966 - Posted: 24 May 2018, 17:57:55 UTC

The answer is 19%.

So, what was the question? When using a Ryzen 1700, with 8 cores, capable of 16 concurrent multi-threads, how much improvement can one expect in Seti processing with 16 threads compared to just 8 threads?

In the Ryzen architecture, each of the 8 cores has multiple floating-point and integer logic units. Code optimization by the compiler, in addition to hardware architecture features, aims to utilize all the core's resources. But there are inevitably some logic resources left idle, especially when the local cache cannot supply the required data. Having a second process (thread) executing in a core, therefore, increases the utilization of the hardware resources but not without the adverse effects of two processes now contending for the fixed resources.

Here's the system being tested:
Ryzen 7 1700 at 3.2 Ghz. DDR4 memory at 2667 Mhz. Linux 4.15.11 kernel.
The CPU application is: MBv8_8.22r3711_sse41.
The GPU application is: setiathome_x41p_zi3v (cuda90)
The test work units are all "blc05_2bit_guppi...vlar" with estimated computation size of 20384 to 21162 gflops.

Here are the measurements. For full details, see the paragraph "method" below.
[list]
In an 8 thread (in 8 cores) configuration:
Average estimated computation size: __ 20703 gflops
Average CPU execution time:____________ 2609.6 seconds
standard deviation:______________________+/- 40.88 seconds (1.6%)**
Work units per day: (per core)___________33.1
x8 cores_________________________________ 264.8

In a 16 thread (in 8 cores) configuration:
Average estimated computation size:__ 20714 gflops
Average CPU execution time: ____________ 4379.9 seconds
standard deviation:______________________+/- 274.96 seconds (6.3%)**
Work units per day: (per thread)__________19.7
x16 threads______________________________315.2
[\list]
And 315.2 is 119% times 264.8. QED
**Noted that multi-threads introduce additional variation in execution time as a result of (more or less) random contention between concurrent threads.

"method" i.e. Here's what I did.
The aim was to configure a "typical" Seti host that runs Seti tasks at essentially full capacity while assuming some resources are being used for other user activities and/or other BOINC projects.
(1) Set NNT with a full task cache. Choose, and suspend, 40 nearly identical work units to be used as samples.
(2) Run the 8T/8C case.
(In Linux, one uses "chcpu" to control cpu-thread-core configuration to prevent the scheduler from assigning 2 threads to any core.)
(2a) Establish 6 concurrent Seti CPU tasks + 1 Seti GPU task + 1 other BOINC (NFS@home) to fully load all 8 cores;
(2b) "resume" 20 sample work units. Do not run any other user apps.
(2c) monitor until all sample work units have completed. (~2.5 hours)

(3) Run the 16T/8C case.
(3a) Establish 13 concurrent Seti CPU tasks + 1 Seti GPU task + 2 other BOINC (NFS@home and asteroids@home) to fully load all 16 threads;
(3b) "resume" 20 sample work units. Do not run any other user apps.
(3c) monitor until all sample work units have completed. (~2 hours)

(4) Check Seti task status periodically until all sample tasks have validated. (Record the credit awarded. Of some interest for credit vs. cpu time statistics, but not directly relevant to this multi-thread issue.)

Final thoughts - I have no control over the OS scheduling of apps-cores-threads and just hope that an average of 20 tasks will be realistic. The benefit of multi-threads could very well be greater with a different mix of applications. Different results, no doubt, on other cpu architectures and Windows OS.
ID: 1936966 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1936968 - Posted: 24 May 2018, 18:20:42 UTC - in response to Message 1936966.  

**Noted that multi-threads introduce additional variation in execution time as a result of (more or less) random contention between concurrent threads.

Some people here with Ryzen systems make use of locking the running application to a particular thread to help improve performance.
Grant
Darwin NT
ID: 1936968 · Report as offensive
Cosmic_Ocean
Avatar

Send message
Joined: 23 Dec 00
Posts: 3027
Credit: 13,516,867
RAC: 13
United States
Message 1936972 - Posted: 24 May 2018, 18:39:31 UTC

your results seem to be about what I observed when I just put my 2700X build together this past weekend. I know Cinebench 15 isn't fully exhaustive as a definitive answer, but I did single thread and then did the full gambit for all 16 threads and it resulted in 10.24x. Doing the quick simple math of 10.24 / 8 comes up with 1.28. So about a 25% increase over what you'd expect out of 8 real cores.

I suspect that it entirely depends on the type of processing being run and resource sharing and so forth. In this case with Cinebench, all of the cores were working on the same task, effectively, so they had some shared memory/resources. But running independent single-threaded tasks would be less efficient due to not sharing some of the resources/cached data.

This isn't definitive statistics though, but it seems the ballpark for those extra 8 threads for the CPU seems to be +20-25%
Linux laptop:
record uptime: 1511d 20h 19m (ended due to the power brick giving-up)
ID: 1936972 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1937029 - Posted: 25 May 2018, 7:52:44 UTC

Very informative post. Glad I found it. I still cling to the axiom of run_time should equal cpu_time. Since I am primarily a gpu focused cruncher I am mostly concerned with keeping the gpus well fed and any cpu ancillary work is just gravy. I have found that 8 cpu tasks locked onto the 8 physical cores does respectable work.

Thanks for the well designed experiment and for posting the results.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1937029 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1937067 - Posted: 25 May 2018, 10:58:00 UTC - in response to Message 1936966.  

Thanks for study.


(2a) Establish 6 concurrent Seti CPU tasks + 1 Seti GPU task + 1 other BOINC (NFS@home) to fully load all 8 cores;

(3a) Establish 13 concurrent Seti CPU tasks + 1 Seti GPU task + 2 other BOINC (NFS@home and asteroids@home) to fully load all 16 threads;


If you will have time and inclination could you repeat similar experiment on same hardware with all cores busy by SETI CPU tasks please.

That is:
(2a_mod) Establish 8 concurrent Seti CPU tasks + 0 Seti GPU task + 0 other BOINC (NFS@home) to fully load all 8 cores;

(3a_mod) Establish 16 concurrent Seti CPU tasks + 0 Seti GPU task + 0 other BOINC (NFS@home and asteroids@home) to fully load all 16 threads;

Would be quite interesting if that 19% improvement still apply.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1937067 · Report as offensive
Gene Project Donor

Send message
Joined: 26 Apr 99
Posts: 150
Credit: 48,393,279
RAC: 118
United States
Message 1937165 - Posted: 25 May 2018, 23:47:35 UTC

@Grant
With N-threads greater than M-cores there will have to be some cores with 2 theads. I'm trusting the Linux scheduler to do its best in thread allocation. I believe it will try to keep "compute bound" tasks (such as Seti) in the same core. I'm not sure I can intervene in any intelligent way to make better cpu/core affinity decisions. My observation (via "top") of the task/processor assignments shows Seti tasks stick with the same core(s) for very long times - many minutes. However, I have also seen situations in which one, or more, core(s) are essentially idle, with %CPU less than 5%, while some other core has two Seti tasks/threads competing head-to-head at 100%. Perhaps there is a kernel parameter, or setting, that would "encourage" the scheduler not to assign two Seti, or any long-running task for that matter, to the same core until all 8 cores are in use. That, of course, runs counter to the strategy of not moving a running task to a different core. Not a straight forward decision for the kernel to make. More research needed...

@Raistmer
Will do... At the moment all tasks in my cache are about 16K gflops (est. computation size) compared to the 20K of the previous samples. I would like to do your suggested experiment, but with work units of approximately the same computation size just to try to keep things comparable. I run through the 100-task cache fairly quickly so it is likely I'll get more work units of the right size in a day or two. I won't wait "forever" but your experiment is on my "to do" list. (I'll have to sacrifice several hours of Seti GPU production... but it's all in a good cause!)
ID: 1937165 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13736
Credit: 208,696,464
RAC: 304
Australia
Message 1937168 - Posted: 25 May 2018, 23:56:05 UTC - in response to Message 1937165.  
Last modified: 25 May 2018, 23:56:52 UTC

My observation (via "top") of the task/processor assignments shows Seti tasks stick with the same core(s) for very long times - many minutes. However, I have also seen situations in which one, or more, core(s) are essentially idle, with %CPU less than 5%, while some other core has two Seti tasks/threads competing head-to-head at 100%.

Very different from Windows on my i7s.
On the occasions when I've been here when the system is low on work, and there are several unused cores, looking at Task Manager graphs shows all cores being used, with peaks & troughs in per core usage.
I think it's only when half or less of the total cores are in use that there will be the odd core that shows zero usage, but it doesn't last long as the work tends to move around the cores.
The only cores that don't vary are the ones supporting my GPUs as I've used the -cpu_lock command to hold them to a specific core.
Grant
Darwin NT
ID: 1937168 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1937189 - Posted: 26 May 2018, 1:00:44 UTC - in response to Message 1937165.  

@Grant
With N-threads greater than M-cores there will have to be some cores with 2 theads. I'm trusting the Linux scheduler to do its best in thread allocation. I believe it will try to keep "compute bound" tasks (such as Seti) in the same core. I'm not sure I can intervene in any intelligent way to make better cpu/core affinity decisions. My observation (via "top") of the task/processor assignments shows Seti tasks stick with the same core(s) for very long times - many minutes. However, I have also seen situations in which one, or more, core(s) are essentially idle, with %CPU less than 5%, while some other core has two Seti tasks/threads competing head-to-head at 100%. Perhaps there is a kernel parameter, or setting, that would "encourage" the scheduler not to assign two Seti, or any long-running task for that matter, to the same core until all 8 cores are in use. That, of course, runs counter to the strategy of not moving a running task to a different core. Not a straight forward decision for the kernel to make. More research needed...

@Raistmer
Will do... At the moment all tasks in my cache are about 16K gflops (est. computation size) compared to the 20K of the previous samples. I would like to do your suggested experiment, but with work units of approximately the same computation size just to try to keep things comparable. I run through the 100-task cache fairly quickly so it is likely I'll get more work units of the right size in a day or two. I won't wait "forever" but your experiment is on my "to do" list. (I'll have to sacrifice several hours of Seti GPU production... but it's all in a good cause!)

@Gene
Not what I am observing with my Ryzen's. Even though I have cpu tasks assigned to even numbered cores, I see those cpu tasks move around a bit to other even numbered cores. Seems to be the ondemand governor moving tasks around in conjunction with the XFR2 core load balancing inherent to Ryzen. I see a gpu task move to a different odd numbered core only when the task finishes usually mainly because they only run for two minutes usually.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1937189 · Report as offensive
Profile Shaggie76
Avatar

Send message
Joined: 9 Oct 09
Posts: 282
Credit: 271,858,118
RAC: 196
Canada
Message 1937249 - Posted: 26 May 2018, 12:54:49 UTC

That's an interesting experiment and curiously it arrives at about the same as I recall in benchmarks I did with years ago on some of my own software: about 20%.

I haven't tested but from what I know of architecture I would expect there to more of a difference on systems with slow memory (ie: laptops) where you'll have more time waiting for caches to fill.

The real question in my mind, though, is what is the power used in each scenario? 20% more throughput for 20% more power consumption or is it cheaper than that?
ID: 1937249 · Report as offensive
Gene Project Donor

Send message
Joined: 26 Apr 99
Posts: 150
Credit: 48,393,279
RAC: 118
United States
Message 1937825 - Posted: 31 May 2018, 17:04:02 UTC

@Raistmer
Here are the results with the experiment adjusted to load the cores exclusively with Seti CPU tasks. Either 8x or 16x, respectively, for the two test cases. As before, I chose work units as nearly identical as possible. For this test they were all of the "blc12_...vlar" form.

    In an 8 thread (in 8 cores) configuration:
    Average estimated computation size___19739 gflops (just slightly smaller than before)
    Average CPU execution time_____________2728.8 seconds
    standard deviation_______________________+/- 40.23 seconds
    work units per day: (per core)______________31.7
    x8 cores___________________________________253.3 (~6% less than before)

    In a 16 thread (in 8 cores) configuration:
    Average estimated computation size___19770 gflops
    Average CPU execution time_____________4941.0 seconds
    standard deviation______________________+/- 140.0 seconds
    work units per day: (per thread)____________17.5
    x16 threads________________________________279.8



So, a net gain of ~11% in 16T vs. 8T.

@Shaggie
For this experiment I looked at power consumption at a time when all cores/threads were in a steady load condition. (Using one of those "kill-A-watt" AC line power monitors.) For 8T/8C it was 165 watts; for 16T/8C it was 171 watts. Roughly +4% power for 11% production gain, makes it a net-positive in that respect. Likely the "real CPU" power increases more substantially but it is swamped in the "overhead" of motherboard, RAM, disk, and video power consumption.

The thoughtful reader (all of you!) will be wondering "What is the impact on GPU productivity when additional CPU threads are active?" I did another test with 7 Seti CPU + 1 Seti GPU and compared (just the GPU production) with 15 Seti CPU + 1 Seti GPU. The loss was 2.5%. From an average elapsed time of 206.8 seconds up to 212.4 seconds. Based on a sample of 20 GPU tasks in each test.

In the original post I didn't state that the GPU is an EVGA GTX 1060 SC (6GB) since the objective was to examine strictly the Ryzen core/thread context. But it is relevant here if one is interested in GPU times. Refer to the original post for the GPU application that was used.

ID: 1937825 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1937842 - Posted: 31 May 2018, 20:59:17 UTC

Great post Gene. And now Shaggie76 can put some numbers to the guesswork of how the CUDA special app and Ryzen cpu core loading behaves under the Anonymous platform.

I would be curious of how much a multiple gpu system impacts the run_time and production per day of cpu tasks like my 4 card system. It is my weakest cpu system with only 6c/12t cpu and the cpu tasks per day is about half of the Ryzens based on the BoincStats daily and weekly production numbers. But it is also my best performing gpu task system with 60% more gpu production per day with only 1 card more than all my 3 card systems.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1937842 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1937942 - Posted: 1 Jun 2018, 21:03:24 UTC

Thanks
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1937942 · Report as offensive

Message boards : Number crunching : Ryzen 16T / 8C vs. 8T / 8C. How much better?


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.