Message boards :
Number crunching :
CES 2017 -- AMD RYZEN CPU
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Problem with Windows and Ryzen were posted today explaining what is going wrong with SMT and single-thread performance. Good news is that Linux with the latest kernel has Ryzen working correctly. smt_configuration_error_in_windows_found_to_be Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
Problem with Windows and Ryzen were posted today explaining what is going wrong with SMT and single-thread performance. Good news is that Linux with the latest kernel has Ryzen working correctly. Will be good if they can get that issue sorted early on. I doubt if single threaded performance will improve much (if any- in Linux tests it's single threaded performance was still no where near Kaby Lake, however it is much, much improved on all AMD has done before and does bring it up to around the previous Intel generation); however it will give a good boost to multicore performance, particularly with games. At present Ryzen needs HyperThreading off to perform well on games that can make use of multiple cores (which is pretty much all of them these days, although some make better use of extra CPU cores than others). I suspect that once software developers start optimizing for Ryzen we should see some significant boosts to single threaded performance. Even with it's relatively poor single threaded performance, if you have an application that can take advantage of the Ryzen architecture for good single threaded performance, and it also supports multiple threads, you do get excellent performance. Grant Darwin NT |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I'm waiting delivery of a 1700, Asus x370 pro motherboard and 8GB RAM. Replying to myself basically about my thoughts on the why the 1700X and 1800X were so much more than the 1700. An article over at PCPerspective shot down my argument that the X processors had XFR and the 1700 did not. Not true evidently. The 1700 has XFR also. Only difference is that it can boost only two 50 Mhz steps compared to two 100 Mhz steps for the 1700X and 1800X. Looking more and more that the smart selection for the 8C/16T chips is the 1700. It is overclocking to 95% of the X chips at $70-170 less cost. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
From the PCper architectural details, given on their last podcast, the half clocked AVX256 might be something to consider for many. For our purposes mostly feeding GPUs, I think/hope that'll be much of a muchness. Had to be a compromise. Pretty sure I could live with half speed AVX256 at the price and power advantages over the equivalent Intels. We'll see. . . That is my way of thinking. From what I have heard there is no performance loss in their SSE3.0 implementation so maybe use that for crunching. I intend to run some comparisons in that hope. While their 128bit AVX can only be half as fast as 256Bit AVX would have been, their SSE3.0 might bridge the gap and make them very useful even as CPU crunchers. And they would allow me to squeeze that little extra out of my 2 x 1060s. Stephen :) |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
I've done a little bit of looking around and research and found Cinebench results for the plain 1700 (and the X and 1800X, for that matter) that don't look entirely promising for SMT/HT, if I'm honest. For example, the 1700 got 149 single, 1410 multi (~9.5x speedup), which I assume is on all 16 threads. 1700X seems to be about 9.75x and 1800x is pretty much 10.0x. What I haven't seen though, is anywhere that has done Cinebench with SMT/HT turned off. That's the one that I'm really interested in seeing, and I would expect it to be above 7.00x. But what the speed-up seems to suggest to me with 16 threads versus one single thread... SMT/HT looks to be practically useless. Sure, it's better than 8x, but it seems to suffer pretty hard and doesn't result in much of a gain. Based on the scores for the plain 1700 that I'm seeing, 16 threads only having the effectiveness of 9.5 threads is basically worse than the Bulldozer's loss from using all the available cores (my FX-6100 ends up at 4.1x when using all 6 cores, but gives me 2.95x when I specify that I only want it to use 3). So it almost seems like it would be a detriment to run with SMT/HT turned on, even if overall, it yields more throughput--any single-threaded task will end up suffering when more than 8 threads are being used, just like how Bulldozer suffers when there are more threads in progress than there are pairs of cores. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
I've done a little bit of looking around and research and found Cinebench results for the plain 1700 (and the X and 1800X, for that matter) that don't look entirely promising for SMT/HT, if I'm honest. I'll have to spend some time looking. But I'm sure that with the latest discoveries regarding thread scheduling, how apps handle the L3 cache and such, I'm almost positive I've run across Cinebench benchmarks done with SMT off for both S15 and R15 tests in the last day or so. Now just have to find you the links again. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Grant (SSSF) Send message Joined: 19 Aug 99 Posts: 13736 Credit: 208,696,464 RAC: 304 |
So it almost seems like it would be a detriment to run with SMT/HT turned on, even if overall, it yields more throughput--any single-threaded task will end up suffering when more than 8 threads are being used, just like how Bulldozer suffers when there are more threads in progress than there are pairs of cores. It has always been the case with HyperThreading that the performance of a single thread is less than with HyperThreading off. But as with Seti crunching on the CPU (and multiple WUs on the GPU depending on the application), the longer run time per thread is offset by more threads been done. The end result is more work done, which is what's important. Grant Darwin NT |
Stephen "Heretic" Send message Joined: 20 Sep 12 Posts: 5557 Credit: 192,787,363 RAC: 628 |
So it almost seems like it would be a detriment to run with SMT/HT turned on, even if overall, it yields more throughput--any single-threaded task will end up suffering when more than 8 threads are being used, just like how Bulldozer suffers when there are more threads in progress than there are pairs of cores. . . The captions do not specify but I get the impression those graphs are both from tests with HyperThreading turned ON. I would like to see the graphs of the same tests with HT off. Granted with only a single task running on one thread it should in theory be much the same as running without HT, but I would like to see it put to the test anyway. It's all well and good to say the output will be higher with it on, but that presumes that runtimes with HT on and multiple threads running will be significantly LESS than double the runtimes with HT off, so I would like to see those figures. That would tell the story. Stephen ? |
Paul Send message Joined: 17 May 99 Posts: 72 Credit: 42,977,964 RAC: 43 |
Hi All, Got my Ryzen system built and now I'm playing with S@H as that is my most important benchmark. First things first, I see some people reporting a BOINC benchmark of 4500 MIPS for the Ryzen 1800, but I got 5000 right out of the box, so I just want to put that on the record. Anyway, here's my question, what is the optimal thread count for Ryzen? There must be a justifiable theoretical answer. 8 or 16? (Here's why I'm confused. The FX-line had one FPU per core pair, so it only made sense that simultaneous thread efficiency would peak around 1/2 the number of cores. But, what I read about Ryzen is that it has *two* FP units per core. Now, this is very confusing. Let's put aside the practical issues of feeding cache from main memory. Does this mean we expect to see throughput increasing beyond 1/2 num cpus? Up to roughly two simultaneous threads per core, even? I'm confused because Ryzen BOINC benchmark is two times greater than FX, and if I ran twice as many threads per core, *and* twice as many cores, that makes 8 times the throughput, which is crazy. Over just three days, I'm pretty sure it's not even going to reach 4 times, judging by the new curve in RAC; I've watched it recover many times after outages or problems in the past and it has a characteristic shape. I still have BOINC set to 50% CPUS as I'm very suspicious. I would just try to run more but it's really hard to understand BOINC performance. I've been running BOINC for a long time and I've never seen a better measure of performance than RAC, but that takes about 30 days to stabilize. credit/sec fluctuates wildly between tasks, and, besides, running 16 threads for a few hours would make a lot of data to gather just to compute the average. It's straight forward, I'm sure; I could do that.) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Hi Paul, and welcome to the thread. All interesting points you've stated. I too was questioning just how many "real" cores to put into play since the Ryzen architecture is not FPU hamstrung like its predecessor the FX. For the moment since I put my Ryzen system online I have limited the number of SETI tasks to 12 with typical 4 GPU tasks running. That means the CPU is mainly running at 75% utilization. I typically have 3 to 4 GPU tasks running since I also crunch on the GPU for Einstein and MilkyWay. That means that I can have 8-9 CPU tasks running at the same time. I do limit the CPU tasks to physical cores through affinity in ProcessLasso. That also means I am not running all of my CPU tasks on physical cores at all times since there all only 8 physical cores. I am still feeling the system out and trying to take baby steps first before I let it fully loose on BOINC. I have already seen a large impact on RAC since it went online. Most evident in the large decrease in processing time for CPU tasks over my FX processors. And also interesting is the 1700X's preference for BLC CPU tasks which run a half hour quicker than any normal range Arecibo task. I have no idea why except to guess the AVX processing speed of the 1700X is multiple times faster than on my FX processors. I haven't tried to track a specific CPU task that is running on one of the virtual cores through to completion to compare it to one run on a physical core yet. But I really haven't seen any very visible outlier in running time for the same AR range CPU tasks that I can pinpoint it to being run on a virtual core. I have my suspicion that with the core leveling going on in the chip that it probably is irrelevant whether the task runs on a "real" or "virtual" core with regard to task completion times. I am going to stay at 12 tasks run concurrently for a while longer before I begin to load the chip to 100%. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Cosmic_Ocean Send message Joined: 23 Dec 00 Posts: 3027 Credit: 13,516,867 RAC: 13 |
*blows dust off thread* I wasn't entirely sure where to put this--make a new thread, or add to this one, or put it in the Panic Mode thread? I chose to put it here in this one instead. 16-core Ryzen "Whitehaven" details leaked AMD’s upcoming 16 core enthusiast Ryzen “Whitehaven†CPUs have been spotted. The new processors will come in variations of up to 16 cores and 32 threads and will support quad-channel DDR4 memory. Linux laptop: record uptime: 1511d 20h 19m (ended due to the power brick giving-up) |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
*blows dust off thread* Thanks for the link. I have been offline for a day and had missed the news. I was aware of the server chip Naples architecture but had heard nothing about the "Whitehaven" chip. Interesting that it will be a LGA socket. Should give Intel a good run for the money. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Keith White Send message Joined: 29 May 99 Posts: 392 Credit: 13,035,233 RAC: 22 |
Hi All, Okay, the confusion about Bulldozer suffered from exuberant marketing due to it's internal design when they decided to count maximum number of simultaneous threads and market it as cores rather than the number of Bulldozer "Modules", the "Module" being more analogous to Intel's Hyper-Threading enabled core or today's Ryzen Simultaneous MultiThreading core in terms of functionality. There is no functional difference between them. Each can execute two threads at the same time and each thread shares use of a single FPU. The physical difference is that Bulldozer decided to fix partition the ALU (Arithmetic Logic Unit) as two "Integer Cores", one for each thread that could run, where Intel and Ryzen consist of a much larger ALU that could achieve higher overall throughput when two unrelated threads are scheduled at the same time through dynamic sharing of resources. The i7-7700K, the Ryzen 1500X and the FX-8350 have four FPUs and each can run 8 threads. "Life is just nature's way of keeping meat fresh." - The Doctor |
Filipe Send message Joined: 12 Aug 00 Posts: 218 Credit: 21,281,677 RAC: 20 |
AMD Ryzen R7 RAC seems impressively good |
Paul Send message Joined: 17 May 99 Posts: 72 Credit: 42,977,964 RAC: 43 |
Thanks for your help. I think I understand what you are saying, but still have questions. I understand the distinction between threads and cores. As you say, HT has been around for a long time and we learned the difference very well when Intel did it. Wikipedia explicitly states that there are "two floating-point units per core". Now, I checked the cited reference and I think I see why they say that. There is *one* FP rename, but two sets of compute units behind it. The 1500 has 8 cores, but FX-3850 had 4. 1) Are you saying that they started counting cores differently between FX and Ryzen? 2) I think my question stands: Can one core tackle two FP threads, simultaneously? |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
1) Yes, cores are counted the same as Intel for Ryzen. The "core" count for FX was really "modules" with a 4 X 2 configuration. I believe most of the monitoring programs now interrogate FX as 4 cores with HT. I know that the author of SIV decided to change to that definition in past versions after consulting with me and running tests. Ryzen now is considered to have a true modern definition of core counts. Each core capable of HT or SMT thread scheduling. The IOMMU and NUMA nodes both identify it as having 8 cores. 2) I've always liked the "deep-dive" that Anandtech does of CPU architectures. Each core has its own FPU capable of two FPU threads, one listed as "schedulable" and the other listed as "non-schedulable" So, yes to answer your question, each core can process two FPU threads with each taking a turn on clock scheduling. But, just as with any HT core, irregardless of Intel or AMD architecture, the "real" or physical core always get prioritized over a virtual core in practicality. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
jason_gee Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0 |
Following up with some reading of Agner Fog's latest optimisation guides, looks as though IPC is higher than any Intel processor, so the memory latency + frequency issues will dominate potentially for some time. Having watched through portions of an AMD livestream, while looking for info on the IOMMU updates due for AGESA 1006 update to fix groupings, I noted that they mentioned they've covered 'standard'JEDEC compatibility, and are moving onto the custom XMP2 style support, with memory with the Samsung B-Die having been the easy one. Most likely FFTW tweaks will end up being incorporated at some point, then as things settle MT apps will need to be produced to make better use of these, in addition to figuring out some additional optimisations. Apparently the AVX2 implementation, despite being effectively half clocked, is faster than separate faster-clocked SSEx, because it preserves entries/fetches/decodes in the instruction pipeline. "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
All I know is that my Ryzen 1700X simply burns through the BLC CPU tasks using the r3330 AVX version. I don't know why that app finds the BLC tasks so easy to process compared to Arecibo CPU tasks. I guess something about the tasks data structure is especially amenable to AVX pathway. I know I don't get anywhere near the same process time using that app on my old FX processors. Ryzen really likes AVX code it seems. Interesting to hear that it may run especially well on AVX2 code pathways also. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
Keith White Send message Joined: 29 May 99 Posts: 392 Credit: 13,035,233 RAC: 22 |
To be picky, the OS only sees "logical cores", all it knows is to assign one thread to every even/odd pair (0,1- 2,3 - etc) before assigning a 2nd thread to a pair, doesn't matter which core, the even or odd numbered one. Ryzen adds the complication to the scheduler that the current R5 and R7 versions are really two cores complexes (CCX) each with 4 cores and 8MB of L3 cache (six core versions are 3+3 and quad cores are 2+2). If threads need to communicate to each other or access data on the other CCX's cache using the chip's "Infinity Fabric" (I hate AMD marketing), it takes 2 1/2 times longer than if it's on the same CCX. This is why benchmarks that test cache performance are giving odd results on Ryzen compared to Intel. The Windows scheduler does appear to assign single threads to all the cores on one CCX before it starts assigning them to the other and then double threads up per core as usual. Now since, at least the CPU version of the S@H cruncher, is essentially running on a single thread, it shouldn't really impact it. However threads can move from one core to another while the OS suspends one thread to allow other ones to run and if the thread moves from one CCX to another, there is going to be a performance penalty if data it needs is residing in the L3 cache of the other CCX but it's probably not much. Plus it sort of can be fixed in the software of the cruncher. Also at the OS level you can tell an application to only run on the same core it starts on (or a group of cores) which is how SIV does it. "Life is just nature's way of keeping meat fresh." - The Doctor |
Keith Myers Send message Joined: 29 Apr 01 Posts: 13164 Credit: 1,160,866,277 RAC: 1,873 |
Yes, the system will use whatever it wants. Unless you provide guidance in exactly how you want it to work. I only crunch on core #0,2,4,6,8,10,12,14 for CPU tasks still. I still haven't gotten around to try running on the odd numbered cores in an experiment. I set the CPU app affinity to even cores in ProcessLasso. The odd numbered cores get to do support duty for the GPU tasks and running the desktop. I have tried to minimize the times required to traverse the Data Fabric by running my memory at 3200 Mhz. I have hopes that later next month when my motherboard gets the AGESA 1.0.0.6 F/W update in a new BIOS that I might be able to increase the memory clock a bit more and further reduce the CCX communication penalty. Seti@Home classic workunits:20,676 CPU time:74,226 hours A proud member of the OFA (Old Farts Association) |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.