Developing a Multi-Threaded Benchmarking App for Linux

Message boards : Number crunching : Developing a Multi-Threaded Benchmarking App for Linux
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

AuthorMessage
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1966945 - Posted: 25 Nov 2018, 3:29:48 UTC

I have run my first benchmark of CPU app performance. It uses 30 cores and does 30 repetitions for each app.


The high variability is likely an effect of the 2990WX having only half the cores with direct memory access and the last run of jobs occuring in low loading. Should re-run on my 1950X system in the future. Also, probably would be better to use 30 different WUs rather than 30 repetitions of the same WU.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1966945 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1966946 - Posted: 25 Nov 2018, 3:46:10 UTC

The AVX2 app is looking good. I never retested that app on Ryzen + I don't think.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1966946 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13720
Credit: 208,696,464
RAC: 304
Australia
Message 1966948 - Posted: 25 Nov 2018, 4:25:14 UTC - in response to Message 1966945.  

Also, probably would be better to use 30 different WUs rather than 30 repetitions of the same WU.

A mix of Areibo & GBT tasks would be interesting.
One of the issues with running 2 GPU WUs at a time under CUDA50 was when a Arecibo & GBT WU were running on the same GPU, the runtime for the Arecibo task would generally triple.
I don't see that happening on the CPU, but I wouldn't be surprised if there were some performance impact there.
Grant
Darwin NT
ID: 1966948 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1966971 - Posted: 25 Nov 2018, 8:02:01 UTC - in response to Message 1966948.  

Also, probably would be better to use 30 different WUs rather than 30 repetitions of the same WU.

A mix of Areibo & GBT tasks would be interesting.
One of the issues with running 2 GPU WUs at a time under CUDA50 was when a Arecibo & GBT WU were running on the same GPU, the runtime for the Arecibo task would generally triple.
I don't see that happening on the CPU, but I wouldn't be surprised if there were some performance impact there.


Good idea. I will setup a new benchmark run to execute during this week's outage.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1966971 · Report as offensive
Profile Brent Norman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester

Send message
Joined: 1 Dec 99
Posts: 2786
Credit: 685,657,289
RAC: 835
Canada
Message 1966972 - Posted: 25 Nov 2018, 8:28:32 UTC

Curious, Did you run each app individually for a day, then the next, etc.
Or run a mixture of apps all at once?
ID: 1966972 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1966974 - Posted: 25 Nov 2018, 8:43:21 UTC - in response to Message 1966972.  

Curious, Did you run each app individually for a day, then the next, etc.
Or run a mixture of apps all at once?


I ran 30 iterations of all apps with a single WU, which is 180 tasks. These 180 tasks were loaded across 30 cores until complete.
GitHub: Ricks-Lab
Instagram: ricks_labs
ID: 1966974 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1967167 - Posted: 26 Nov 2018, 0:14:32 UTC

Here are the 1950x results for the benchmark run identical to what I did for the 2990WX. The 1950x has SMT enabled, while the 2990WX has it disabled, so both runs used 30 of a total of 32 available threads.
ID: 1967167 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1967169 - Posted: 26 Nov 2018, 0:39:01 UTC

Interesting, it looks like the Gen.1 TR likes the standard AVX application while the Gen. 2 TR likes the AVX2 application.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1967169 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1967189 - Posted: 26 Nov 2018, 2:39:17 UTC

I'm currently running the r3711 SSE41 app against the AVX2 app since that one wasn't included in Rick's set of default apps for some reason. Also some anomalous behavior in the number of gpu instances that can be invoked for some reason.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1967189 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1967191 - Posted: 26 Nov 2018, 2:42:26 UTC - in response to Message 1967189.  

I'm currently running the r3711 SSE41 app against the AVX2 app since that one wasn't included in Rick's set of default apps for some reason. Also some anomalous behavior in the number of gpu instances that can be invoked for some reason.

Can you provide a link of where a set of 3711 apps can be found? I will plan to include them in the benchmark run planned during the outage.
ID: 1967191 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1967193 - Posted: 26 Nov 2018, 2:51:27 UTC - in response to Message 1967189.  

I'm currently running the r3711 SSE41 app against the AVX2 app since that one wasn't included in Rick's set of default apps for some reason. Also some anomalous behavior in the number of gpu instances that can be invoked for some reason.

benchMT currently only allows 1 task per GPU. Number of GPUs is determined by
lshw -short | grep display

Does the log file indicate the correct number of GPUs?
ID: 1967193 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1967195 - Posted: 26 Nov 2018, 2:57:52 UTC - in response to Message 1967191.  

The r3711 SSE41 app is the default app installed the TBar BOINC All-in-One packages.
http://www.arkayn.us/lunatics/BOINC-7.8.3.7z
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1967195 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1967199 - Posted: 26 Nov 2018, 3:11:24 UTC - in response to Message 1967193.  

I'm currently running the r3711 SSE41 app against the AVX2 app since that one wasn't included in Rick's set of default apps for some reason. Also some anomalous behavior in the number of gpu instances that can be invoked for some reason.

benchMT currently only allows 1 task per GPU. Number of GPUs is determined by
lshw -short | grep display

Does the log file indicate the correct number of GPUs?

I'm only running one task per gpu since I am running the CUDA92 app. But if I only invoke 3 instances of the application, it only runs two tasks on two gpus and has the third instance pending until the first two complete, then the pending task runs on the first gpu. I see all 3 gpus always.

This is the entry in benchCFG

setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
#setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
This is what the benchmark is going to execute

Only 0 CPU jobs and 3 GPU jobs. Max Threads reduced to 3
List of Initialized Slots
SlotNum | platform | device | state | job | SlotDir
-0------| GPU | 0 | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/0
-1------| GPU | 1 | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/1
-2------| CPU | NA | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/2
##### 3 total slots
Pending jobs (CPU/GPU): 0 / 3
Pending reference jobs: 0
Execute listed jobs? [y/N]

With this benchCFG file entry

setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92

This is what the benchmark is going to execute

Only 0 CPU jobs and 4 GPU jobs. Max Threads reduced to 4
List of Initialized Slots
SlotNum | platform | device | state | job | SlotDir
-0------| GPU | 0 | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/0
-1------| GPU | 1 | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/1
-2------| GPU | 2 | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/2
-3------| CPU | NA | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/3
##### 4 total slots
Pending jobs (CPU/GPU): 0 / 4
Pending reference jobs: 0
Execute listed jobs? [y/N]
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1967199 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1967202 - Posted: 26 Nov 2018, 3:19:05 UTC
Last modified: 26 Nov 2018, 3:22:58 UTC

This is the output of the r3711SSE41 app versus the r3712AVX2 app. I ran 4 instances or each app.
https://www.dropbox.com/s/wjgz56tqmrn1zi1/Screenshot%20from%202018-11-25%2018-52-54.png?dl=0

The SSE41 app is up to 10% faster than the AVX2 app. That is what I found on my old Gen. 1700X and 1800X cpus. So not seeing any improvement on Ryzen+ 2700X cpus. Might be something different on Threadrippers.

I will be able to test on TR once I get my TR platform built.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1967202 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1967204 - Posted: 26 Nov 2018, 3:25:59 UTC - in response to Message 1967199.  

I'm currently running the r3711 SSE41 app against the AVX2 app since that one wasn't included in Rick's set of default apps for some reason. Also some anomalous behavior in the number of gpu instances that can be invoked for some reason.

benchMT currently only allows 1 task per GPU. Number of GPUs is determined by
lshw -short | grep display

Does the log file indicate the correct number of GPUs?

I'm only running one task per gpu since I am running the CUDA92 app. But if I only invoke 3 instances of the application, it only runs two tasks on two gpus and has the third instance pending until the first two complete, then the pending task runs on the first gpu. I see all 3 gpus always.

This is the entry in benchCFG

setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
#setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
This is what the benchmark is going to execute

Only 0 CPU jobs and 3 GPU jobs. Max Threads reduced to 3
List of Initialized Slots
SlotNum | platform | device | state | job | SlotDir
-0------| GPU | 0 | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/0
-1------| GPU | 1 | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/1
-2------| CPU | NA | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/2
##### 3 total slots
Pending jobs (CPU/GPU): 0 / 3
Pending reference jobs: 0
Execute listed jobs? [y/N]

With this benchCFG file entry

setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92

This is what the benchmark is going to execute

Only 0 CPU jobs and 4 GPU jobs. Max Threads reduced to 4
List of Initialized Slots
SlotNum | platform | device | state | job | SlotDir
-0------| GPU | 0 | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/0
-1------| GPU | 1 | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/1
-2------| GPU | 2 | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/2
-3------| CPU | NA | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/3
##### 4 total slots
Pending jobs (CPU/GPU): 0 / 4
Pending reference jobs: 0
Execute listed jobs? [y/N]

Looks like a bug. Let me try to reproduce it on my system this evening.
ID: 1967204 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1967205 - Posted: 26 Nov 2018, 3:28:03 UTC - in response to Message 1967202.  

This is the output of the r3711SSE41 app versus the r3712AVX2 app. I ran 4 instances or each app.
https://www.dropbox.com/s/wjgz56tqmrn1zi1/Screenshot%20from%202018-11-25%2018-52-54.png?dl=0

The SSE41 app is up to 10% faster than the AVX2 app. That is what I found on my old Gen. 1700X and 1800X cpus. So not seeing any improvement on Ryzen+ 2700X cpus. Might be something different on Threadrippers.

I will be able to test on TR once I get my TR platform built.


For analysis, I suggest using the .psv file the testData directory. This file is easy to import into excel and summarize with pivot. It is pipe delimited.
ID: 1967205 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1967210 - Posted: 26 Nov 2018, 4:18:16 UTC - in response to Message 1967205.  

I'll have to pass since I flunked OpenCalc and Excel.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1967210 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1967236 - Posted: 26 Nov 2018, 10:21:50 UTC - in response to Message 1967204.  

I'm currently running the r3711 SSE41 app against the AVX2 app since that one wasn't included in Rick's set of default apps for some reason. Also some anomalous behavior in the number of gpu instances that can be invoked for some reason.

benchMT currently only allows 1 task per GPU. Number of GPUs is determined by
lshw -short | grep display

Does the log file indicate the correct number of GPUs?

I'm only running one task per gpu since I am running the CUDA92 app. But if I only invoke 3 instances of the application, it only runs two tasks on two gpus and has the third instance pending until the first two complete, then the pending task runs on the first gpu. I see all 3 gpus always.

This is the entry in benchCFG

setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
#setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
This is what the benchmark is going to execute

Only 0 CPU jobs and 3 GPU jobs. Max Threads reduced to 3
List of Initialized Slots
SlotNum | platform | device | state | job | SlotDir
-0------| GPU | 0 | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/0
-1------| GPU | 1 | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/1
-2------| CPU | NA | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/2
##### 3 total slots
Pending jobs (CPU/GPU): 0 / 3
Pending reference jobs: 0
Execute listed jobs? [y/N]

With this benchCFG file entry

setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92
setiathome_x41p_V0.97b2_Linux-Pascal+_cuda92

This is what the benchmark is going to execute

Only 0 CPU jobs and 4 GPU jobs. Max Threads reduced to 4
List of Initialized Slots
SlotNum | platform | device | state | job | SlotDir
-0------| GPU | 0 | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/0
-1------| GPU | 1 | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/1
-2------| GPU | 2 | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/2
-3------| CPU | NA | EMPTY | None| /home/keith/Downloads/Utils/benchMT/Slots/3
##### 4 total slots
Pending jobs (CPU/GPU): 0 / 4
Pending reference jobs: 0
Execute listed jobs? [y/N]

Looks like a bug. Let me try to reproduce it on my system this evening.


When I run this on my system, It all appears normal. Can you post or send me your complete BenchCFG file and the hostname*.txt file in the run subdir of testData? Also, running it with --debug option might give more insight. Also, does this app require a .cl file? If so, it needs to be in the APPS_GPU directory. Thanks!
ID: 1967236 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1967237 - Posted: 26 Nov 2018, 10:30:20 UTC - in response to Message 1967195.  

The r3711 SSE41 app is the default app installed the TBar BOINC All-in-One packages.
http://www.arkayn.us/lunatics/BOINC-7.8.3.7z


I just downloaded and extracted. Did not find the r3711 SSE41 app.
ID: 1967237 · Report as offensive
Profile RueiKe Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 14 Feb 16
Posts: 492
Credit: 378,512,430
RAC: 785
Taiwan
Message 1967241 - Posted: 26 Nov 2018, 11:10:49 UTC - in response to Message 1967237.  

The r3711 SSE41 app is the default app installed the TBar BOINC All-in-One packages.
http://www.arkayn.us/lunatics/BOINC-7.8.3.7z


I just downloaded and extracted. Did not find the r3711 SSE41 app.


I found it in a different download on the site. I will include it in my next run.
ID: 1967241 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

Message boards : Number crunching : Developing a Multi-Threaded Benchmarking App for Linux


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.