Lunatics optimization for Ryzen, any plans?

Message boards : Number crunching : Lunatics optimization for Ryzen, any plans?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile M_M
Avatar

Send message
Joined: 20 May 04
Posts: 58
Credit: 30,268,693
RAC: 3,283
Serbia
Message 1901927 - Posted: 19 Nov 2017, 20:32:29 UTC

As the title say, are there any plans for this?

As I understand, Ryzen is pretty much different architecture then Intel, so it would make sense to get optimized path code for it, especially since there are more and more Ryzen and Threadripper systems out there...
ID: 1901927 · Report as offensive     Reply Quote
Profile Keith Myers
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 3131
Credit: 203,752,016
RAC: 288,781
United States
Message 1901931 - Posted: 19 Nov 2017, 20:45:35 UTC - in response to Message 1901927.  

I sorta doubt it. Our developers are stretched thin as it is. The legacy Intel code path of the current apps is sufficiently competent enough to keep Ryzen and Threadripper in the conversation. I too wish there were more choices for cpu apps in the Windows environment. Lots more cpu app choices in the Linux environment. The SSE4.1 app for Linux is head and shoulders faster than the fastest AVX Windows cpu app. Wish I had a SSE4.x app in Windows.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1901931 · Report as offensive     Reply Quote
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 5871
Credit: 77,975,869
RAC: 21,492
Russia
Message 1902044 - Posted: 20 Nov 2017, 11:16:14 UTC - in response to Message 1901931.  
Last modified: 20 Nov 2017, 11:19:00 UTC

I sorta doubt it. Our developers are stretched thin as it is. The legacy Intel code path of the current apps is sufficiently competent enough to keep Ryzen and Threadripper in the conversation. I too wish there were more choices for cpu apps in the Windows environment. Lots more cpu app choices in the Linux environment. The SSE4.1 app for Linux is head and shoulders faster than the fastest AVX Windows cpu app. Wish I had a SSE4.x app in Windows.

It's kind of "all who drink vodka in 1850 are currently dead hence vodka was poisoned in 1850" reasoning.
Linux SSE4.1 app faster (I'll take it for grated) not because it's SSE4.1 over AVX but because different compilers were used to get binaries.
There is no SSE4.1 path in opt CPU app at all. I rechecked that recently while reconstructing build environment on new device.
And cause VC++ compiler has only SSE2, AVX and AVX2 code generation options there will be no SSE4.1 binary again [There will be SSE3 though. Cause there is hand-coded SSE3 path in source code that will emit SSE3 machine ops in binary].
One can attempt to grab "SSE4.1" (if there was such) FFTW library DLL and use it with app for speedup.
Or one can use completely different toolchain to build real SSE4.1 binary in hope compiler of that toolchain has good enough auto-optimizer. But again, there is no separate SSE4.1 path in opt app(and as I recall in stock either).
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1902044 · Report as offensive     Reply Quote
Profile Keith Myers
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 3131
Credit: 203,752,016
RAC: 288,781
United States
Message 1902078 - Posted: 20 Nov 2017, 17:41:48 UTC - in response to Message 1902044.  
Last modified: 20 Nov 2017, 17:45:13 UTC

I sorta doubt it. Our developers are stretched thin as it is. The legacy Intel code path of the current apps is sufficiently competent enough to keep Ryzen and Threadripper in the conversation. I too wish there were more choices for cpu apps in the Windows environment. Lots more cpu app choices in the Linux environment. The SSE4.1 app for Linux is head and shoulders faster than the fastest AVX Windows cpu app. Wish I had a SSE4.x app in Windows.

It's kind of "all who drink vodka in 1850 are currently dead hence vodka was poisoned in 1850" reasoning.
Linux SSE4.1 app faster (I'll take it for grated) not because it's SSE4.1 over AVX but because different compilers were used to get binaries.
There is no SSE4.1 path in opt CPU app at all. I rechecked that recently while reconstructing build environment on new device.
And cause VC++ compiler has only SSE2, AVX and AVX2 code generation options there will be no SSE4.1 binary again [There will be SSE3 though. Cause there is hand-coded SSE3 path in source code that will emit SSE3 machine ops in binary].
One can attempt to grab "SSE4.1" (if there was such) FFTW library DLL and use it with app for speedup.
Or one can use completely different toolchain to build real SSE4.1 binary in hope compiler of that toolchain has good enough auto-optimizer. But again, there is no separate SSE4.1 path in opt app(and as I recall in stock either).

OK, I guess my request should be directed at Urs and ask him either to port the AVX, AVX2, SSE41 and SSE42 code to Windows or to loan his machine and whatever compilers it has on it to somebody with the Windows code. There is a complete stock of every CPU variation at Lunatics

The fftwfloat335.7z file is available that has all the libfftw library variations to match up with the compiled cpu app.

So, why does the Lunatics installer have a specific radio button to choose the SSE4.1 app for AMD systems and even suggests to run that for better performance over the stock SSE3 app?

I have two basically identical Ryzen systems running at the same frequency or close to it. One is running Linux and the other Windows 10. The LInux system runs the SSE41 app and the Windows 10 system runs the AVX app. The BLC25 with standard AR tasks run for 45-48 minutes on the Linux system and the same BLC25 tasks with standard AR run for 55-62 minutes with the Windows 10 system.

[Addendum] I forgot to mention I was previously using the AVX app on the Linux system after it was build until someone in the forums suggested the SSE41 app as faster. I tried and changed to the SSE41 app and have validated that claim. So I have baselines for each app in Linux.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1902078 · Report as offensive     Reply Quote
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 5871
Credit: 77,975,869
RAC: 21,492
Russia
Message 1902190 - Posted: 21 Nov 2017, 6:43:43 UTC - in response to Message 1902078.  
Last modified: 21 Nov 2017, 6:46:05 UTC


OK, I guess my request should be directed at Urs and ask him either to port the AVX, AVX2, SSE41 and SSE42 code to Windows or to loan his machine and whatever compilers it has on it to somebody with the Windows code. There is a complete stock of every CPU variation at Lunatics

The fftwfloat335.7z file is available that has all the libfftw library variations to match up with the compiled cpu app.

So, why does the Lunatics installer have a specific radio button to choose the SSE4.1 app for AMD systems and even suggests to run that for better performance over the stock SSE3 app?

I have two basically identical Ryzen systems running at the same frequency or close to it. One is running Linux and the other Windows 10. The LInux system runs the SSE41 app and the Windows 10 system runs the AVX app. The BLC25 with standard AR tasks run for 45-48 minutes on the Linux system and the same BLC25 tasks with standard AR run for 55-62 minutes with the Windows 10 system.

[Addendum] I forgot to mention I was previously using the AVX app on the Linux system after it was build until someone in the forums suggested the SSE41 app as faster. I tried and changed to the SSE41 app and have validated that claim. So I have baselines for each app in Linux.


Hm. Well, today is day of repetition it seems.
OK, I'll repeat - there is no SSE4.1 code path in opt app.
New statement: processing code for Linux and Windows is the same. Code paths are different only in OS-specific detais.
Repeat: difference comes from toolchains.
New statement: on Linux Urs used GCC AFAIK, on Windows I used VC++.

So, try to find volunteer to cross-compile with GCC under Windows. Or try to find volunteer to restore MinGW config Joe Segur used for prev family of opt apps. They definitely were faster (at those times direct comparison was possible) so it smth worth to do.

Regarding Lunatics installer - most probably just leftover from those times. There was SSE4.1 SETI7 app.

BTW, while you in test mood- try to run SSE3 app under Linux. How big is difference in speed if any?


To be precise, this one http://lunatics.kwsn.info/index.php?action=downloads;sa=view;down=481 over this one http://lunatics.kwsn.info/index.php?action=downloads;sa=view;down=482
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1902190 · Report as offensive     Reply Quote
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 5871
Credit: 77,975,869
RAC: 21,492
Russia
Message 1902191 - Posted: 21 Nov 2017, 6:50:08 UTC

And regarding thread title - what makes Ryzen to be different?
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1902191 · Report as offensive     Reply Quote
Profile Keith Myers
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 3131
Credit: 203,752,016
RAC: 288,781
United States
Message 1902201 - Posted: 21 Nov 2017, 8:27:11 UTC - in response to Message 1902190.  
Last modified: 21 Nov 2017, 8:57:09 UTC


Hm. Well, today is day of repetition it seems.
OK, I'll repeat - there is no SSE4.1 code path in opt app.
New statement: processing code for Linux and Windows is the same. Code paths are different only in OS-specific detais.
Repeat: difference comes from toolchains.
New statement: on Linux Urs used GCC AFAIK, on Windows I used VC++.

So, try to find volunteer to cross-compile with GCC under Windows. Or try to find volunteer to restore MinGW config Joe Segur used for prev family of opt apps. They definitely were faster (at those times direct comparison was possible) so it smth worth to do.

Regarding Lunatics installer - most probably just leftover from those times. There was SSE4.1 SETI7 app.

BTW, while you in test mood- try to run SSE3 app under Linux. How big is difference in speed if any?


To be precise, this one http://lunatics.kwsn.info/index.php?action=downloads;sa=view;down=481 over this one http://lunatics.kwsn.info/index.php?action=downloads;sa=view;down=482


OK, let me be clear, I am not a developer and do not have the skills or software necessary to compile apps. I still do not understand your statement that all the apps use the same code.

Did Urs simply create one file, the optimized SSE3 app and then make four more exact copies of the file and simply give them different names that indicate they use a different SSE code path?

What would the purpose be? Why do the SSE2, SSSE3, SSE41 and SSE42 r3306 apps exist at all then?

I do see that all the r3306 apps have the same file size. Is there truth in your statement that all the apps are the same?

I do see that the r3345 AVX app has a different file size, so I would conclude it DOES have different code in it compared to the r3306 app. This is the original app I was running after I built the Linux system back in July. I ran it exclusively up till the third week of October when I switched to the SSE41 app.

On my system, the SSE41 app is undisputably faster than the AVX app.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1902201 · Report as offensive     Reply Quote
Profile Keith Myers
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 3131
Credit: 203,752,016
RAC: 288,781
United States
Message 1902202 - Posted: 21 Nov 2017, 8:34:57 UTC - in response to Message 1902191.  

And regarding thread title - what makes Ryzen to be different?


It is a completely new and different architecture than the previous FX processors and is also very different on many factors compared to current Intel architecture. The entire gaming industry is having to learn how to code for Ryzen now. They have learned that the Intel code pathways they have been using for the past ten years are sub-optimal for the Ryzen architecture and if no changes are made to their games, huge amounts of performance is being left on the table.

Read the very good synopsis on the Ryzen architecture at PC Perspective AMD-Ryzen-7-1800X-Review-Now-and-Zen
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1902202 · Report as offensive     Reply Quote
Profile Keith Myers
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 3131
Credit: 203,752,016
RAC: 288,781
United States
Message 1902204 - Posted: 21 Nov 2017, 8:54:59 UTC

OK, I'll repeat - there is no SSE4.1 code path in opt app.
New statement: processing code for Linux and Windows is the same. Code paths are different only in OS-specific detais.
Repeat: difference comes from toolchains.
New statement: on Linux Urs used GCC AFAIK, on Windows I used VC++


I'm sorry Raistmer, you are going to have dumb it way down for me. I just do not comprehend how an app that has SSE41 function calls can work on a cpu that only has SSE3 capabilities. My Ryzen's have all MMX capabilities and SSE capabilities up to AVX2 but excluding AVX512 which is a recent development in the latest Intel generation. At Lunatics, you are warned to correctly choose the right app for your operating system (32bit or 64bit) and also to verify exactly what SSE functions your cpu supports before downloading the correct app.

I am speaking only about the Linux environment since that is the only one that has all possible SSE type apps available. I am only comparing the different Linux cpu apps. Referring back to my original post in this thread, I was only wishing for a SSE41 Windows app since I have seen such a large improvement over the AVX app. I was just stating I would like to see the same choices of SSE function apps on Windows.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1902204 · Report as offensive     Reply Quote
Profile Karsten Vinding
Volunteer tester

Send message
Joined: 18 May 99
Posts: 207
Credit: 20,211,612
RAC: 4,568
Denmark
Message 1902217 - Posted: 21 Nov 2017, 10:34:28 UTC - in response to Message 1902204.  
Last modified: 21 Nov 2017, 10:35:13 UTC

As I understand it (and Raistmer is free to correct me), the basic code is the same for many of the apps.

When you compile the code, you ask the complier to try and optimize for specific instruction sets, and it will try to change some of the coding, so that it uses newer instructions, whenever possible and beneficial.
The end result of this kind of optimization is very reliant upon the compilers ability to do this well, and probably to some extend, the person running the compiles, and his abilities to use the correct parameters. The last part probably comes down to a lot of trial and error.

In the code itself, there is some handwritten very well optimized code, that the compiler is told to keep its hands of. That is probably the SSE3 code base?

It would be nice to have a more Ryzen specific codepath, but its probably not a small job, and there would have to be a lot of experimentation. I once tested on such a project on older AMD hardware at lunatics, where the resulting AMD specific code was 5-10% faster.
It can probably be done, and there could be large benefits, but it takes a lot of time.
ID: 1902217 · Report as offensive     Reply Quote
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 5871
Credit: 77,975,869
RAC: 21,492
Russia
Message 1902222 - Posted: 21 Nov 2017, 10:55:08 UTC - in response to Message 1902201.  


Did Urs simply create one file, the optimized SSE3 app and then make four more exact copies of the file and simply give them different names that indicate they use a different SSE code path?

No.
There are high-level (C/C++) language sources (program) and machine code binary instructions (binary executable).
Compiler translates one to another.
What I'm saying - high-level code, specific for SSE4.1, is absent.
There is such code paths for SSE and SSE3 for example. So compiler takes different source to make SSE and SSE3.
In case of SSE41 the difference in speed comes not from any SSE4.1 specific optimizations done by programmers but just from differencies between compilers - how they translate high-level code into machine code.
GCC can emit SSE4.1 instructions on its own. VC++ - doesn't. So, if programmer didn't directly tell use such SSE4.1 instuction (by using so called intrinsics, for example) resulting binary will not use SSE4.1 machine instructions at all.

So, Urs' binaries are different. But only because of used compiler/toolchain.
If one find correspondingly good compiler for Windows (before we use ICC for example) - Windows binaries become faster too. Ad it doesn't require any additional development (that is, source code writing), just adaptation of build process to new toolset. In other words, codebase should be ported to the new toolset.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1902222 · Report as offensive     Reply Quote
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 5871
Credit: 77,975,869
RAC: 21,492
Russia
Message 1902223 - Posted: 21 Nov 2017, 10:56:26 UTC - in response to Message 1902202.  


Read the very good synopsis on the Ryzen architecture at PC Perspective AMD-Ryzen-7-1800X-Review-Now-and-Zen

I'll try, thanks.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1902223 · Report as offensive     Reply Quote
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 5871
Credit: 77,975,869
RAC: 21,492
Russia
Message 1902225 - Posted: 21 Nov 2017, 11:05:19 UTC - in response to Message 1902204.  


I'm sorry Raistmer, you are going to have dumb it way down for me. I just do not comprehend how an app that has SSE41 function calls can work on a cpu that only has SSE3 capabilities.

It will not. It will call SSE3-based ones instead (how stock works) or one should use another binary (how opt app works).

At Lunatics, you are warned to correctly choose the right app for your operating system (32bit or 64bit) and also to verify exactly what SSE functions your cpu supports before downloading the correct app.

exactly.


I am speaking only about the Linux environment since that is the only one that has all possible SSE type apps available. I am only comparing the different Linux cpu apps. Referring back to my original post in this thread, I was only wishing for a SSE41 Windows app since I have seen such a large improvement over the AVX app. I was just stating I would like to see the same choices of SSE function apps on Windows.


And I trying to explain that the reason of speed difference lies not in SSE-level of app per se.
BTW, speed improvement on your host (AMD?) can come just from poor implementation of AVX instruction set.
We had that before, when SSE2 binaries were faster than SSE3 ones on early AMD processors.

Still proposed experiment willbe quite interesting: try to replace your current SSE4.1 Linux binary with SSE3 one (links posted above).
And run it for few days - will it slower faster or approx the same on Ryzen?
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1902225 · Report as offensive     Reply Quote
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 5871
Credit: 77,975,869
RAC: 21,492
Russia
Message 1902226 - Posted: 21 Nov 2017, 11:08:44 UTC - in response to Message 1902217.  

As I understand it (and Raistmer is free to correct me), the basic code is the same for many of the apps.

When you compile the code, you ask the complier to try and optimize for specific instruction sets, and it will try to change some of the coding, so that it uses newer instructions, whenever possible and beneficial.
The end result of this kind of optimization is very reliant upon the compilers ability to do this well, and probably to some extend, the person running the compiles, and his abilities to use the correct parameters. The last part probably comes down to a lot of trial and error.

In the code itself, there is some handwritten very well optimized code, that the compiler is told to keep its hands of. That is probably the SSE3 code base?

It would be nice to have a more Ryzen specific codepath, but its probably not a small job, and there would have to be a lot of experimentation. I once tested on such a project on older AMD hardware at lunatics, where the resulting AMD specific code was 5-10% faster.
It can probably be done, and there could be large benefits, but it takes a lot of time.

yep. Hope your english will beat mine in explanatory strength :)
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1902226 · Report as offensive     Reply Quote
Profile Keith Myers
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 3131
Credit: 203,752,016
RAC: 288,781
United States
Message 1902263 - Posted: 22 Nov 2017, 2:16:57 UTC - in response to Message 1902225.  

OK, I can do that. I ran exactly one task using the r3305 SSSE3 Linux app that TBar included in his zi3v Linux package when I first built the system and installed Linux. I then updated the app to the r3345 AVX Linux app simply because the Windows AVX app was faster than the optimized SSE3 app on my first Windows Ryzen system and I expect the same response from the Ryzen Linux system.

So you want the basic SSE3 app run for a few days for comparison, correct?
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1902263 · Report as offensive     Reply Quote
Profile Keith Myers
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 3131
Credit: 203,752,016
RAC: 288,781
United States
Message 1902269 - Posted: 22 Nov 2017, 3:06:04 UTC

OK, so there are a lot of benchmark tools at Lunatics. I even see a lot of files with your name Raistmer, attributed. I only thought you worked in the Windows environment. Should I run some of the benchmark tools and test tasks with the suggested SSE41 and SSE3 apps with the Lunatics test tools. Or do you want just the SETI Main tasks that get sent to my host. Which method would provide the best comparisons and information for you. I don't have any experience with the Lunatics test tools. It looks like I need to MAKE some of the files myself. I have never compiled anything so far in Linux. I would probably need some handholding.
Seti@Home classic workunits:20,676 CPU time:74,226 hours
ID: 1902269 · Report as offensive     Reply Quote
Profile Mike
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 30799
Credit: 59,254,080
RAC: 24,975
Germany
Message 1902309 - Posted: 22 Nov 2017, 8:53:45 UTC
Last modified: 22 Nov 2017, 8:54:06 UTC

I did some benches on my Ryzen under Linux after i built the system.
SSE4.1 was fastest followed by AVX so i dont think your bench would be much different.
With each crime and every kindness we birth our future.
ID: 1902309 · Report as offensive     Reply Quote
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 5871
Credit: 77,975,869
RAC: 21,492
Russia
Message 1902316 - Posted: 22 Nov 2017, 9:25:51 UTC - in response to Message 1902269.  

OK, so there are a lot of benchmark tools at Lunatics. I even see a lot of files with your name Raistmer, attributed. I only thought you worked in the Windows environment. Should I run some of the benchmark tools and test tasks with the suggested SSE41 and SSE3 apps with the Lunatics test tools. Or do you want just the SETI Main tasks that get sent to my host. Which method would provide the best comparisons and information for you. I don't have any experience with the Lunatics test tools. It looks like I need to MAKE some of the files myself. I have never compiled anything so far in Linux. I would probably need some handholding.

Benchmark would be more precise until one collect big statistics for online runs.
And no, it doesn't need any compilation. Just to put needed files in needed places and to run script.
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1902316 · Report as offensive     Reply Quote
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 5871
Credit: 77,975,869
RAC: 21,492
Russia
Message 1902317 - Posted: 22 Nov 2017, 9:26:21 UTC - in response to Message 1902309.  

I did some benches on my Ryzen under Linux after i built the system.
SSE4.1 was fastest followed by AVX so i dont think your bench would be much different.

For what degree? What % difference?
SETI apps news
We're not gonna fight them. We're gonna transcend them.
ID: 1902317 · Report as offensive     Reply Quote
Profile Mike
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 30799
Credit: 59,254,080
RAC: 24,975
Germany
Message 1902318 - Posted: 22 Nov 2017, 9:51:12 UTC - in response to Message 1902317.  
Last modified: 22 Nov 2017, 9:52:23 UTC

I did some benches on my Ryzen under Linux after i built the system.
SSE4.1 was fastest followed by AVX so i dont think your bench would be much different.

For what degree? What % difference?


Just 1% - 2%.
On some tasks even SSSE3 was faster than AVX.
You can view the results on Lunatics i posted them in my Mint thread.
With each crime and every kindness we birth our future.
ID: 1902318 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Lunatics optimization for Ryzen, any plans?


 
©2018 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.