Developing a Multi-Threaded Benchmarking App for Linux

Message boards : Number crunching : Developing a Multi-Threaded Benchmarking App for Linux
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1967242 - Posted: 26 Nov 2018, 11:20:21 UTC - in response to Message 1967236.  

I'm currently running the r3711 SSE41 app against the AVX2 app since that one wasn't included in Rick's set of default apps for some reason. Also some anomalous behavior in the number of gpu instances that can be invoked for some reason.

benchMT currently only allows 1 task per GPU. Number of GPUs is determined by
lshw -short | grep display

Does the log file indicate the correct number of GPUs?

I'm only running one task per gpu since I am running the CUDA92 app. But if I only invoke 3 instances of the application, it only runs two tasks on two gpus and has the third instance pending until the first two complete, then the pending task runs on the first gpu. I see all 3 gpus always.

This is the entry in benchCFG

setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
#setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
This is what the benchmark is going to execute

Only 0 CPU jobs and 3 GPU jobs. Max Threads reduced to 3
List of Initialized Slots
SlotNum | platform | device | state | job | SlotDir
-0------| GPU | 0 | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/0
-1------| GPU | 1 | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/1
-2------| CPU | NA | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/2
##### 3 total slots
Pending jobs (CPU/GPU): 0 / 3
Pending reference jobs: 0
Execute listed jobs? [y/N]

With this benchCFG file entry

setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92

This is what the benchmark is going to execute

Only 0 CPU jobs and 4 GPU jobs. Max Threads reduced to 4
List of Initialized Slots
SlotNum | platform | device | state | job | SlotDir
-0------| GPU | 0 | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/0
-1------| GPU | 1 | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/1
-2------| GPU | 2 | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/2
-3------| CPU | NA | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/3
##### 4 total slots
Pending jobs (CPU/GPU): 0 / 4
Pending reference jobs: 0
Execute listed jobs? [y/N]

Looks like a bug. Let me try to reproduce it on my system this evening.


When I run this on my system, It all appears normal. Can you post or send me your complete BenchCFG file and the hostname*.txt file in the run subdir of testData? Also, running it with --debug option might give more insight. Also, does this app require a .cl file? If so, it needs to be in the APPS_GPU directory. Thanks!


I was just able to reproduce the problem. It happens when there are less GPU jobs than GPUs. Working on it...
ID: 1967242 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1967253 - Posted: 26 Nov 2018, 16:27:23 UTC - in response to Message 1967242.  

Sent you a PM with your requested file contents. Will have to try the new beta now that I understand how to use it better.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1967253 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1967507 - Posted: 28 Nov 2018, 11:23:01 UTC

Here is my comparison of all of the Linux CPU apps that I have. I ran 7 Arecibo and 8 GBT WUs twice each on each system using only 30 threads on each system. The 2990WX had SMT disabled and the 1950X had it enabled. The MB, cooling solution, memory, BIOS, OS are all the same between the 2 systems. BIOS settings are also the same with the exception of manual Vcore and CPU Core ratio. LLC is -L2 on 2990WX and -L1 on the 1950X.


Based on these results, the r3711_sse41 app is fastest, though the 2 newer apps have a noticeable reduction in Similarity. Not sure if that difference is significant though,
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1967507 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1967565 - Posted: 28 Nov 2018, 20:22:44 UTC

Thanks for the update benchmark runs with the All-in-One package standard app Rick. That is what I too have always found, the r3711 SSE41 app has always been the fastest on my Ryzens. I even found it faster on my i7-6850K system though Juan found the Intel AVX2 app the fastest on his i7-6850K host.

I'll give the new benchmark executable a run again. I would like to get some reference runs on my Ryzen 2700X host so that I will have it later for comparison against the upcoming Threadripper 2920X system.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1967565 · Report as offensive
juan BFP Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 16 Mar 07
Posts: 9786
Credit: 572,710,851
RAC: 3,799
Panama
Message 1967718 - Posted: 29 Nov 2018, 14:48:08 UTC - in response to Message 1967565.  
Last modified: 29 Nov 2018, 14:51:58 UTC

I even found it faster on my i7-6850K system though Juan found the Intel AVX2 app the fastest on his i7-6850K host.

I redo the test now since i i'm running without hyperthreading (4 GPU + 2CPU only and NO -nobs) and both apps give me strongly similar numbers 33-36 min to crunch a BLC11 WU. Will leave running with SSE41for some more time waiting new types of WU to be sure.
As expected running with this configuration my CPU temps downs to <43C and CPU usage is at 50-60% range. Good for keep the host cool even with external temp of +36C. I use a TT Water 3.0 cooler.
ID: 1967718 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1967776 - Posted: 29 Nov 2018, 18:44:02 UTC - in response to Message 1967718.  

Theoretically any non-AVX instruction should be fastest on Intel cpus since most motherboards impose a AVX instruction penalty whereby they reduce the clock frequency in the BIOS with a AVX offset because of the high loading and heat production when Intel cpus run an AVX instruction. Unless you set a very low offset(and I found it can't be set to zero nor completely disabled in the BIOS) the cpu will run an AVX instruction or app at a much lower clock frequency than the set cpu clock frequency.

So if you choose an app that doesn't run AVX or AVX2 instructions, you will keep your cpu clocks always up at their max set value. Or if the performance benefit of running the AVX instructions at reduced clocks still is better than the non-AVX app or instruction.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1967776 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 1967778 - Posted: 29 Nov 2018, 18:49:11 UTC - in response to Message 1967776.  
Last modified: 29 Nov 2018, 18:53:11 UTC

Theoretically any non-AVX instruction should be fastest on Intel cpus since most motherboards impose a AVX instruction penalty whereby they reduce the clock frequency in the BIOS with a AVX offset because of the high loading and heat production when Intel cpus run an AVX instruction. Unless you set a very low offset(and I found it can't be set to zero nor completely disabled in the BIOS) the cpu will run an AVX instruction or app at a much lower clock frequency than the set cpu clock frequency.

So if you choose an app that doesn't run AVX or AVX2 instructions, you will keep your cpu clocks always up at their max set value. Or if the performance benefit of running the AVX instructions at reduced clocks still is better than the non-AVX app or instruction.


not all intel CPUs act this way. the AVX turbo penalty only showed up with the Haswell line and beyond (socket 2011-v3, E5 v3+ Xeons). earlier 2011-1 Ivy Bridge EP (think v1 and v2 E5 Xeons) chips didnt see this behavior.

theoretically AVX and AVX2 should be faster, as they are capable of running larger data chunks per cycle, but i guess it's up to the app to actually do that?
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 1967778 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1967781 - Posted: 29 Nov 2018, 18:56:54 UTC - in response to Message 1967778.  

theoretically AVX and AVX2 should be faster, as they are capable of running larger data chunks per cycle, but i guess it's up to the app to actually do that?

Yes though modern Intel cpu can do AVX-512 instructions, very few apps actually are able to use that instruction.

Our AVX apps are old enough to not understand the AVX512 instructions so can't use it. Our code branch is ancient so even with our developers using modern compilers with modern cpu flags, the code base can't really utilize the newer AVX instructions to their maximum potential.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1967781 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 1967782 - Posted: 29 Nov 2018, 19:00:52 UTC - in response to Message 1967781.  
Last modified: 29 Nov 2018, 19:06:27 UTC

theoretically AVX and AVX2 should be faster, as they are capable of running larger data chunks per cycle, but i guess it's up to the app to actually do that?

Yes though modern Intel cpu can do AVX-512 instructions, very few apps actually are able to use that instruction.

Our AVX apps are old enough to not understand the AVX512 instructions so can't use it. Our code branch is ancient so even with our developers using modern compilers with modern cpu flags, the code base can't really utilize the newer AVX instructions to their maximum potential.


Keith, you're confusing AVX/AVX2 and AVX-512.

AXV and AVX2 only have 256bit FP registers. AVX2 added FMA.

AVX-512 expands that to 512bits. AVX-512 is relatively new and first showed up i think in the Xeon Phi co-processor cards. it's on some of the more recent HEDT chips (Skylake and beyond) and the big Xeon server chips.

SSE4 i believe is 128-bit.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 1967782 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1967796 - Posted: 29 Nov 2018, 19:40:31 UTC - in response to Message 1967782.  

theoretically AVX and AVX2 should be faster, as they are capable of running larger data chunks per cycle, but i guess it's up to the app to actually do that?

Yes though modern Intel cpu can do AVX-512 instructions, very few apps actually are able to use that instruction.

Our AVX apps are old enough to not understand the AVX512 instructions so can't use it. Our code branch is ancient so even with our developers using modern compilers with modern cpu flags, the code base can't really utilize the newer AVX instructions to their maximum potential.


Keith, you're confusing AVX/AVX2 and AVX-512.

AXV and AVX2 only have 256bit FP registers. AVX2 added FMA.

AVX-512 expands that to 512bits. AVX-512 is relatively new and first showed up i think in the Xeon Phi co-processor cards. it's on some of the more recent HEDT chips (Skylake and beyond) and the big Xeon server chips.

SSE4 i believe is 128-bit.

No I am not confusing the different AVX instructions nor the register widths. A point I was trying to make is that AVX apps only can be used with efficiency on Intel hardware since they have had full width 256 bit AVX registers for a long time. With AMD cpus they have been stuck with 128 bit registers so far and have had to fuse two registers together in an inefficient way to run the 256 bit AVX instruction.

That is only going to change with the new Ryzen 2 cpus next year which will get the usual 256 bit register for AVX.

So SSE41 with a 128 bit instructions falls right into the current AMD cpu wheelhouse.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1967796 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 1967800 - Posted: 29 Nov 2018, 19:48:14 UTC - in response to Message 1967796.  

yeah, i guess i didn't see the relevance of mentioning avx-512 since there isn't an app for it and very very few people here even have hardware capable of running it anyway. i was strictly comparing SSE4 to AVX with the 128 vs 256-bit register differences.

but my initial comment about AVX was in reference to the intel chips specifically. in theory they should be faster, but i guess the app either isn't coded in a way to utilize it to its potential, or the type of computations we do on the SETI WUs aren't able to use the larger registers.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 1967800 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1967804 - Posted: 29 Nov 2018, 19:57:20 UTC - in response to Message 1967800.  

but my initial comment about AVX was in reference to the intel chips specifically. in theory they should be faster, but i guess the app either isn't coded in a way to utilize it to its potential, or the type of computations we do on the SETI WUs aren't able to use the larger registers.

Which goes back to my original comment about the app codebase which Joe Segur et al worked on back in the 2000's. I'm sure it was cutting edge at the time but hardware has moved on. And the codebase is still from that period. No developers have added anything new to it since then. So we are stuck with rudimentary AVX code for all our AVX apps. Nothing you can to do with a compiler can change that. The base code has to be updated to understand better and more efficient use of AVX2 and AVX-512. That ain't happening until we get some new code writers.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1967804 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 1967810 - Posted: 29 Nov 2018, 20:20:40 UTC - in response to Message 1967804.  

well if i understand correctly, unless SETI can use FMA (multiply accumulate), there probably wont be much improvement between AVX and AVX2. i thought it was all FP code, in which case AVX vs AVX2 would have minimal difference.

but i get your point that it's all old code. sounds like even the AVX app was more of a straight port from the SSE4 code
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 1967810 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20289
Credit: 7,508,002
RAC: 20
United Kingdom
Message 1967826 - Posted: 29 Nov 2018, 21:18:55 UTC

Good work there, looks interesting...

Can this sort of benchmarking also be used to test for host system bottlenecks such as CPU resource units contention, CPU cache exhaustion/poisoning, memory bandwidth limits, whatever other bottlenecks?


Happy cool crunchin',
Martin
See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 1967826 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1967835 - Posted: 29 Nov 2018, 22:14:17 UTC - in response to Message 1967804.  

but my initial comment about AVX was in reference to the intel chips specifically. in theory they should be faster, but i guess the app either isn't coded in a way to utilize it to its potential, or the type of computations we do on the SETI WUs aren't able to use the larger registers.

Which goes back to my original comment about the app codebase which Joe Segur et al worked on back in the 2000's. I'm sure it was cutting edge at the time but hardware has moved on. And the codebase is still from that period. No developers have added anything new to it since then. So we are stuck with rudimentary AVX code for all our AVX apps. Nothing you can to do with a compiler can change that. The base code has to be updated to understand better and more efficient use of AVX2 and AVX-512. That ain't happening until we get some new code writers.


No, no, no.
Last code change by Joe Segur was in 2015.
That`s just 3 years ago.
Each windows application has been hand optimized.
Also don`t forget these changes has been made for windows and not all of them have the same effect on Linux.
Different compilers is just one reason.
For AVX2 there simply is no data which would benefit from it.


With each crime and every kindness we birth our future.
ID: 1967835 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1967836 - Posted: 29 Nov 2018, 22:15:41 UTC - in response to Message 1967826.  

Sure, Rick has been using it extensively to determine the optimal number of cpu tasks to run on his mega-cpu cruncher with 64 threads. Any cpu benchmark can be used to determine bottlenecks when changing a single variable. But other common benchmarks only run a single cpu core or all cpu cores with no choice of inbetween. His benchmark will be able to closely mimic our actual cpu and gpu loading with our specific cpu and gpu apps to replicate actual crunching conditions on the host.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1967836 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1967837 - Posted: 29 Nov 2018, 22:17:56 UTC - in response to Message 1967835.  

No, no, no.
Last code change by Joe Segur was in 2015.
That`s just 3 years ago.
Each windows application has been hand optimized.

Ok, so where do we go to lookup the app commit changes for all the apps for those of us that don't have perfect mimetic memory.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1967837 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1967839 - Posted: 29 Nov 2018, 22:21:51 UTC - in response to Message 1967837.  

No, no, no.
Last code change by Joe Segur was in 2015.
That`s just 3 years ago.
Each windows application has been hand optimized.

Ok, so where do we go to lookup the app commit changes for all the apps for those of us that don't have perfect mimetic memory.


That`s in the development section of Lunatics.
Only a few of us Lunatics have access to it.


With each crime and every kindness we birth our future.
ID: 1967839 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13164
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1967843 - Posted: 29 Nov 2018, 22:41:41 UTC - in response to Message 1967839.  
Last modified: 29 Nov 2018, 22:42:09 UTC

No, no, no.
Last code change by Joe Segur was in 2015.
That`s just 3 years ago.
Each windows application has been hand optimized.

Ok, so where do we go to lookup the app commit changes for all the apps for those of us that don't have perfect mimetic memory.


That`s in the development section of Lunatics.
Only a few of us Lunatics have access to it.

OK, but without such knowledge of a hidden developer area that most of us don't have access to all anyone can do about the history of the development of the apps is to make guesses from the release date and any docs accompanying the app.

My guesstimate was off by a decade.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1967843 · Report as offensive
Profile Zalster Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 27 May 99
Posts: 5517
Credit: 528,817,460
RAC: 242
United States
Message 1967846 - Posted: 29 Nov 2018, 22:52:12 UTC

Way off topic but does anyone know if Joe was from NY?
ID: 1967846 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Number crunching : Developing a Multi-Threaded Benchmarking App for Linux


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.