My 2990WX

Message boards : Number crunching : My 2990WX
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1970200 - Posted: 13 Dec 2018, 22:38:17 UTC

I'm in the middle of testing another approach to the memory bandwidth issue. I just switched from UMA mode to NUMA mode on my 2920X and the biggest difference is in the memory latencies. I went from 99ns down to 60ns which is what my memory latencies are with the Ryzen 2700X. The memory bandwidth halved as expected from 96Mb/sec to 48MB/sec but the peak memory transfers went from 86MB/sec to 99MB/sec.

Since I rarely use more than about 3-4GB of memory for running the desktop, background apps and BOINC, I shouldn't have to need memory bandwidth. I am more interested in how fast the cpu tasks can access memory for their computations. After I have finished my memory stability testing I will put the host back online and see if my cpu task completion times come down and match my Ryzen 2700X times. Before this change, I was running cpu tasks in around 48 minutes. I'm hoping to get the cpu tasks down to the same 30 minutes time of the 2700X. The differences in memory clocks is only 400Mhz and probably does not matter that much. I guess I will find out.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1970200 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20147
Credit: 7,508,002
RAC: 20
United Kingdom
Message 1970210 - Posted: 13 Dec 2018, 23:32:07 UTC - in response to Message 1970200.  
Last modified: 13 Dec 2018, 23:34:24 UTC

Please excuse me if I'm reading you wrong:

Don't get too hung up on the completion times for individual WUs? The significant factor is what total throughput is achieved for an aggregate of WUs...


Typically, using HT/SMT will slow down individual tasks, but can give a small overall speedup for multiple tasks interleaving their execution across the CPU resources...

Personal experience suggests a maximum speedup with Intel HT of about +30% over multiple tasks (throughput of 130%) over that of running non-HT (throughput a nominal 100%). However, if the CPU cache becomes overloaded/poisoned, or if there is contention for a CPU resource, then you can see some drastic slowdowns in trying to use HT!


I've not acquired any recent hardware to test comparable effects for AMD.


Good sleuthing!

Happy cool crunchin',
Martin

"HT" (Intel "Hyper Threading") is Intel's Marketing jargon for "SMT" (Simultaneous Multi-Threading) as is now utilised by AMD for their Zen CPUs: where a single physical CPU core is shared (multiplexed) between two CPU tasks.
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 1970210 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1970213 - Posted: 13 Dec 2018, 23:49:33 UTC

I still use SMT. No changes there. Still running the same number of tasks and threads. Just running with local NUMA mode memory instead of UMA mode memory access. If I complete two tasks per hour per thread at 30 minutes each instead of two tasks per 90 minutes at 45 minutes each, my aggregate for the day goes up.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1970213 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1970268 - Posted: 14 Dec 2018, 6:52:13 UTC - in response to Message 1970210.  

Please excuse me if I'm reading you wrong:

Don't get too hung up on the completion times for individual WUs? The significant factor is what total throughput is achieved for an aggregate of WUs...


The Threadripper 2 (and 1, I guess) suffer from a memory model that causes 1/2 of the cpus to not have direct access to the ram. The article I posted above seems to do a good job of explaining the issue. And as such it is a hardware limitation that can only be changed by a redesigned cpu/cpu socket. Since AMD promised to not change the TR2 socket for 2 years, I don't expect to get out of this particular design limitation soon.

The times I quote are the "average" estimated completion times. I have had averages north of 2 hours for 56-60 cores down to someplace near 47 minutes for 30 cores.

After taking a short look at 40 cores again and not seeing much improvement for a short run with the faster ram I am back at 30 cores (25 cpu, 3 gpu).

I am still looking for the "best" sweet spot but it appears to probably be near 30 cores.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1970268 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1970270 - Posted: 14 Dec 2018, 7:48:06 UTC

No, only your chip the 2990WX and the 2970WX suffer from half the dies not having direct access. My chip, the 2920X and the 2950X have both dies with direct access to memory. Only the 4 die TR suffer from one pair of dies with no direct memory access. Look at the second diagram in your link showing the 2950X TR.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1970270 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20147
Credit: 7,508,002
RAC: 20
United Kingdom
Message 1970291 - Posted: 14 Dec 2018, 12:18:35 UTC - in response to Message 1970268.  
Last modified: 14 Dec 2018, 12:19:03 UTC

I am still looking for the "best" sweet spot but it appears to probably be near 30 cores.

Unless the NUMA scheduling works to give some advantage, then you might see the best results are with disabling the SMT to run natively one thread per core with minimum overhead/latency...?

I'm guessing that running non-SMT will also make best use of all of the CPU cache for a speedup also.


Good sleuthing!

Happy cool fast crunchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 1970291 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1970306 - Posted: 14 Dec 2018, 15:01:12 UTC - in response to Message 1970291.  

I am still looking for the "best" sweet spot but it appears to probably be near 30 cores.

Unless the NUMA scheduling works to give some advantage, then you might see the best results are with disabling the SMT to run natively one thread per core with minimum overhead/latency...?

I'm guessing that running non-SMT will also make best use of all of the CPU cache for a speedup also.


Good sleuthing!

Happy cool fast crunchin',
Martin


For my bios, NUMA appears to be turned on by default. I think I tried a short run with UMA on CL16 ram that seemed to show it was slower. But I also think I was running 40 cores.

So another experiment could be SMT turned off (so now max is 32 threads/cores) and UMA turned on. On the grounds of theory, I would always assume UMA would be slower. If the test load is still 26 threads, then a whole lot more of the processing is being done on non-direct memory access cpus which should slow things down. When you add UMA which means even the direct access cpus are going to be accessing memory in a slower way, it "oughta" be slower.

Sigh, yet another test :)

Tom
A proud member of the OFA (Old Farts Association).
ID: 1970306 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20147
Credit: 7,508,002
RAC: 20
United Kingdom
Message 1970315 - Posted: 14 Dec 2018, 16:33:23 UTC - in response to Message 1970306.  

I am still looking for the "best" sweet spot but it appears to probably be near 30 cores.

Unless the NUMA scheduling works to give some advantage, then you might see the best results are with disabling the SMT to run natively one thread per core with minimum overhead/latency...?

I'm guessing that running non-SMT will also make best use of all of the CPU cache for a speedup also.


Good sleuthing!

Happy cool fast crunchin',
Martin


For my bios, NUMA appears to be turned on by default. I think I tried a short run with UMA on CL16 ram that seemed to show it was slower. But I also think I was running 40 cores.

So another experiment could be SMT turned off (so now max is 32 threads/cores) and UMA turned on. On the grounds of theory, I would always assume UMA would be slower. If the test load is still 26 threads, then a whole lot more of the processing is being done on non-direct memory access cpus which should slow things down. When you add UMA which means even the direct access cpus are going to be accessing memory in a slower way, it "oughta" be slower.

Sigh, yet another test :)

Hey! One change at a time! ;-)

So... What is actually being done to what effect with the BIOS NUMA/UMA control?

If your system really is NUMA-aware, then my guess is that the combination of no-SMT + NUMA should give the lowest latencies and maximise the best use of CPU cache...

But that's just a guess upon reality... ;-)


Happy cool crunchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 1970315 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1970334 - Posted: 14 Dec 2018, 17:39:37 UTC - in response to Message 1970315.  


Hey! One change at a time! ;-)

So... What is actually being done to what effect with the BIOS NUMA/UMA control?

If your system really is NUMA-aware, then my guess is that the combination of no-SMT + NUMA should give the lowest latencies and maximise the best use of CPU cache...

But that's just a guess upon reality... ;-)


Happy cool crunchin',
Martin


My mobo defaults to UMA. I had to change off Auto to Channel mode to enable NUMA.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1970334 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1970347 - Posted: 14 Dec 2018, 19:39:04 UTC - in response to Message 1970334.  


My mobo defaults to UMA. I had to change off Auto to Channel mode to enable NUMA.


I thought that too. I first read that "channel" mode enabled NUMA, then I read some place else that channel mode disabled NUMA :(

So I am basing my claim that I have NUMA turned on, on that. Hmmmmm..... I remember where I got the "channel" but I don't remember where I got the contradiction.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1970347 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1970354 - Posted: 14 Dec 2018, 20:03:27 UTC - in response to Message 1970347.  


My mobo defaults to UMA. I had to change off Auto to Channel mode to enable NUMA.


I thought that too. I first read that "channel" mode enabled NUMA, then I read some place else that channel mode disabled NUMA :(

So I am basing my claim that I have NUMA turned on, on that. Hmmmmm..... I remember where I got the "channel" but I don't remember where I got the contradiction.

Tom


Just found direction for turning NUMA on in the MSI forum, said "channel" so I have turned it on.

Now to run it long enough to see what changes.

Since we are now getting tasks with tasks with more varied run times this may take a while to see how it shakes out.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1970354 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1970357 - Posted: 14 Dec 2018, 20:10:49 UTC - in response to Message 1970347.  

A real easy check is to run the Intel Memory Latency Checker or MLC. First set the BIOS to run UMA mode and notice the MLC output describes the run as UMA with distributed memory nodes. Then change to Channel and run MLC again and you will see the output described as NUMA and the nodes are local to each die. You should see a big change in the latencies between the two modes.

Intel Memory Latency Checker

Installation
Intel® MLC supports both Linux and Windows.

Linux
Copy the mlc binary to any directory on your system
Intel® MLC dynamically links to GNU C library (glibc/lpthread) and this library must be present on the system
Root privileges are required to run this tool as the tool modifies the H/W prefetch control MSR to enable/disable prefetchers for latency and b/w measurements. Refer readme documentation on running without root privileges
MSR driver (not part of the install package) should be loaded. This can typically be done with 'modprobe msr' command if it is not already included.

Install the msr-tools package and then modprobe msr. Then go to the directory with the unpacked MLC and open a root terminal and start the app with ./mlc.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1970357 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1970511 - Posted: 15 Dec 2018, 19:30:12 UTC - in response to Message 1970357.  

A real easy check is to run the Intel Memory Latency Checker or MLC.


Thank you.

Sorry for the delay in replying. I am already in the NUMA mode and haven't decided if I want to switch back yet. I finally got all the directions read, everything installed and got a test run w/o seti processing. I have already run it once using SUDO so this is also in root mode.
tom@EJS-GIFT:~/Downloads$ cd ml_test
tom@EJS-GIFT:~/Downloads/ml_test$ cd Linux
tom@EJS-GIFT:~/Downloads/ml_test/Linux$ ./mlc
Intel(R) Memory Latency Checker - v3.6
Measuring idle latencies (in ns)...
		Numa node
Numa node	     0	     1	     2	     3	
       0	  67.0	-	 106.0	-	
       1	 108.7	-	 105.1	-	
       2	 105.9	-	  66.9	-	
       3	 105.2	-	 108.7	-	

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :	^C
Exiting...


If I am inferring correctly, the intersections of 0,0 and 2,2 are the two cpu chips with direct memory access and the latency reports. So pretty much anything but the direct access is going to be nearly the exact same slowdown.

I have a little time right now. I will re-boot and reset the memory channel to interweave.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1970511 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1970516 - Posted: 15 Dec 2018, 19:41:39 UTC - in response to Message 1970511.  


I have a little time right now. I will re-boot and reset the memory channel to interweave.

Tom


Actually, I reset it to "auto" and got this result.

om@EJS-GIFT:~/Downloads/ml_test$ cd Linux
tom@EJS-GIFT:~/Downloads/ml_test/Linux$ ls
mlc  mlc_avx512  redist.txt
tom@EJS-GIFT:~/Downloads/ml_test/Linux$ sudo ./mlc
[sudo] password for tom: 
Intel(R) Memory Latency Checker - v3.6
Measuring idle latencies (in ns)...
		Numa node
Numa node	     0	     1	     2	     3	
       0	  67.7	-	 105.8	-	
       1	 108.8	-	 105.2	-	
       2	 105.9	-	  67.4	-	
       3	 105.1	-	 108.6	-	

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :	^C
Exiting...


tom@EJS-GIFT:~/Downloads/ml_test/Linux$ ^C
tom@EJS-GIFT:~/Downloads/ml_test/Linux$ 


Which implies that the NUMA mode is the "default". I think?

It has about 4 choices including auto, channel and something about chip etc.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1970516 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1970531 - Posted: 15 Dec 2018, 20:59:31 UTC
Last modified: 15 Dec 2018, 21:43:29 UTC

This is what I get in Auto or UMA mode.

root@Numbskull:/home/keith/Downloads/Utils/MLC# ./mlc
Intel(R) Memory Latency Checker - v3.5
Measuring idle latencies (in ns)...
Memory node
Socket 0 1
0 98.9 98.8
1 98.9 98.8

This is what I get in Channel or NUMA mode.

root@Numbskull:/home/keith/Downloads/Utils/MLC# ./mlc
Intel(R) Memory Latency Checker - v3.5
Measuring idle latencies (in ns)...
Numa node
Numa node 0 1
0 60.6 99.8
1 102.3 61.1

But I ran for a half day yesterday in NUMA mode with its better memory latency speeds and higher transaction speeds . . . but I failed to notice a single improvement in any of my cpu or gpu task times.

Plus, the system locked up at 3:30AM this morning and was down until I got up and noticed it. So I reverted back to my previous and stable Auto memory configuration in UMA mode since it apparently has no positive or negative effect on task times.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1970531 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1970623 - Posted: 16 Dec 2018, 6:42:35 UTC - in response to Message 1970531.  

This is what I get in Auto or UMA mode.

root@Numbskull:/home/keith/Downloads/Utils/MLC# ./mlc
Intel(R) Memory Latency Checker - v3.5
Measuring idle latencies (in ns)...
Memory node
Socket 0 1
0 98.9 98.8
1 98.9 98.8


Unless I run into some spare time earlier I will try out each of the options I found in the bios menu that includes "channel" on monday and see if any of them cause it to not display the "numa" title.
Otherwise I may be looking in the wrong location in the bios. But that is the only "channel" I have found.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1970623 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1970633 - Posted: 16 Dec 2018, 8:07:39 UTC - in response to Message 1970623.  

Numa node	     0	     1	     2	     3	
       0	  67.7	-	 105.8	-	
       1	 108.8	-	 105.2	-	
       2	 105.9	-	  67.4	-	
       3	 105.1	-	 108.6	-	
With only 2 sticks of CL14 RAM with the 2990WX, you could very well be better off with your 4 sticks of CL16 in it.
Go back to your Diagram and see that only 1/4 of the cores have direct memory access now.
ID: 1970633 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1970675 - Posted: 16 Dec 2018, 14:50:43 UTC - in response to Message 1970633.  
Last modified: 16 Dec 2018, 14:55:27 UTC

Numa node	     0	     1	     2	     3	
       0	  67.7	-	 105.8	-	
       1	 108.8	-	 105.2	-	
       2	 105.9	-	  67.4	-	
       3	 105.1	-	 108.6	-	
With only 2 sticks of CL14 RAM with the 2990WX, you could very well be better off with your 4 sticks of CL16 in it.
Go back to your Diagram and see that only 1/4 of the cores have direct memory access now.


That's odd. I have 4 sticks of CL14 ram. I haven't read the docs. So I hadn't realized that something about this info seemed to indicate that I only had 2 sticks.

Both the bios and the task manager agree that I have 32 MB of ram. And the bios is recognizing 4 sticks.

Tom
A proud member of the OFA (Old Farts Association).
ID: 1970675 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1970685 - Posted: 16 Dec 2018, 16:29:33 UTC - in response to Message 1970675.  

OK, sorry. I misread the chart in my early morning blurry eyed state as being channel 0,1,2,3 being not fully occupied.
ID: 1970685 · Report as offensive
Profile Tom M
Volunteer tester

Send message
Joined: 28 Nov 02
Posts: 5124
Credit: 276,046,078
RAC: 462
Message 1970758 - Posted: 17 Dec 2018, 5:21:29 UTC - in response to Message 1970685.  

OK, sorry. I misread the chart in my early morning blurry eyed state as being channel 0,1,2,3 being not fully occupied.


I have looked at the directions/manual and I would, based on the examples, come to the same conclusion as you. The columns across the top appear to be Numa clusters that I would have assumed meant banks of memory. The MB manual claims I have a quad channel MB.

So I am confused.

It seems like each pair of memory channels is being treated as a single NUMA resource. Since each direct access cpu chip has 2 channel lines coming out of it in the block diagram I can see how it might come up that way.

I have even loaded the defaults for the bios and got the same format (it probably was a bit slower though :)

There are about 3 other settings I haven't tried yet. And I want to see if they generate different node results in the ml test.

Back in a bit.
Tom
A proud member of the OFA (Old Farts Association).
ID: 1970758 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : My 2990WX


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.