Proclamation - memory speed is more important than shader count or gpu core clocks

Message boards : Number crunching : Proclamation - memory speed is more important than shader count or gpu core clocks
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1949119 - Posted: 12 Aug 2018, 2:30:15 UTC

I am going to make a statement. I have two almost identical systems. Both run the same motherboard, the same cpu and same system memory type and quantity. Both at the same cpu clocks and same system memory clocks.

One system has two identical GTX 1070 cards and one GTX 1080 card.

The other system has two identical GTX 1070 cards and one GTX 1070Ti card. The 1070Ti card has one less shader unit than the 1080 card.
All cards have 8GB of gpu memory. All cards run at approximately the same gpu core clock of around 2Ghz.

The 1070 cards have GDDR5 memory and run at 8Ghz memory clock. The 1080 card has GDDR5X memory. The GDDR5X memory runs at 11Ghz.

The system with the GTX 1080 in it has a 5K RAC advantage over the system with the 1070Ti in it. Both systems run the Linux CUDA9.0 special app.

GPU memory speed and memory bandwidth is more important to task time completion than gpu core clock. This is my assertion.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1949119 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 1949120 - Posted: 12 Aug 2018, 2:37:16 UTC - in response to Message 1949119.  

GPU memory speed and memory bandwidth is more important to task time completion than gpu core clock. This is my assertion.

Pretty sure Petri recently posted that one of his biggest GPU application speed ups was done by reducing the amount of GPU memory access required.
Grant
Darwin NT
ID: 1949120 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1949121 - Posted: 12 Aug 2018, 3:08:26 UTC - in response to Message 1949120.  

GPU memory speed and memory bandwidth is more important to task time completion than gpu core clock. This is my assertion.

Pretty sure Petri recently posted that one of his biggest GPU application speed ups was done by reducing the amount of GPU memory access required.

But I'm pretty sure he was referring to his latest statically linked releases. Not the older zi3v apps.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1949121 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13722
Credit: 208,696,464
RAC: 304
Australia
Message 1949132 - Posted: 12 Aug 2018, 4:37:19 UTC - in response to Message 1949121.  

But I'm pretty sure he was referring to his latest statically linked releases. Not the older zi3v apps.

But it does show that memory access plays a significant factor in computation times.
The less memory access required, the faster the WU is processed. And by the same reasoning, the faster any memory accesses are, the faster computations will be.
Grant
Darwin NT
ID: 1949132 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1949136 - Posted: 12 Aug 2018, 5:26:30 UTC - in response to Message 1949132.  

Yes, that is exactly what I was trying to make a point of. Don't worry about the gpu core clock. Let the Nvidia GPU Boost 3.0 mechanism in the firmware of the card take care of the core clock. You can just run it stock and the card will boost the clock to whatever the thermal and power limits allow. But you are always going to get penalized by Nvidia in the driver for running compute load with the severe drop in memory clocks. That parameter is NOT boosted by GPU Boost 3.0. So whatever you can do to get the memory clock back to what it should be running in P0 state is the best thing for reducing task completion times.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1949136 · Report as offensive
Ghia
Avatar

Send message
Joined: 7 Feb 17
Posts: 238
Credit: 28,911,438
RAC: 50
Norway
Message 1949141 - Posted: 12 Aug 2018, 7:52:15 UTC

I thought the shaders were 2432 (1070Ti) vs. 2560 1080) ?
Humans may rule the world...but bacteria run it...
ID: 1949141 · Report as offensive
Stephen "Heretic" Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 20 Sep 12
Posts: 5557
Credit: 192,787,363
RAC: 628
Australia
Message 1949159 - Posted: 12 Aug 2018, 10:28:40 UTC - in response to Message 1949141.  

I thought the shaders were 2432 (1070Ti) vs. 2560 1080) ?


. . I think you are talking in CUDA cores while Keith was talking about CUs (the channels that the cuda cores are grouped in).

Stephen

?
ID: 1949159 · Report as offensive
mmonnin
Volunteer tester

Send message
Joined: 8 Jun 17
Posts: 58
Credit: 10,176,849
RAC: 0
United States
Message 1949167 - Posted: 12 Aug 2018, 13:04:49 UTC

Did you set the memory speeds in Linux? Default they do not run at 8 GBps but only at 7.6GBps for me. 1070 and 1070Ti in Linux.
ID: 1949167 · Report as offensive
Al Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Avatar

Send message
Joined: 3 Apr 99
Posts: 1682
Credit: 477,343,364
RAC: 482
United States
Message 1949168 - Posted: 12 Aug 2018, 13:13:51 UTC - in response to Message 1949136.  

Yes, that is exactly what I was trying to make a point of. Don't worry about the gpu core clock. Let the Nvidia GPU Boost 3.0 mechanism in the firmware of the card take care of the core clock. You can just run it stock and the card will boost the clock to whatever the thermal and power limits allow. But you are always going to get penalized by Nvidia in the driver for running compute load with the severe drop in memory clocks. That parameter is NOT boosted by GPU Boost 3.0. So whatever you can do to get the memory clock back to what it should be running in P0 state is the best thing for reducing task completion times.
Keith, does this just apply to the 10x0 series cards, that's where the driver got borked?

ID: 1949168 · Report as offensive
mmonnin
Volunteer tester

Send message
Joined: 8 Jun 17
Posts: 58
Credit: 10,176,849
RAC: 0
United States
Message 1949175 - Posted: 12 Aug 2018, 13:33:13 UTC - in response to Message 1949168.  

Yes, that is exactly what I was trying to make a point of. Don't worry about the gpu core clock. Let the Nvidia GPU Boost 3.0 mechanism in the firmware of the card take care of the core clock. You can just run it stock and the card will boost the clock to whatever the thermal and power limits allow. But you are always going to get penalized by Nvidia in the driver for running compute load with the severe drop in memory clocks. That parameter is NOT boosted by GPU Boost 3.0. So whatever you can do to get the memory clock back to what it should be running in P0 state is the best thing for reducing task completion times.
Keith, does this just apply to the 10x0 series cards, that's where the driver got borked?


The GPU boost works pretty similar with 9xx series Maxwell cards. Memory OC is not allowed in the P2 state by most apps. I think NV inspector can do it in Windows. Those cards can be flashed to whatever memory clock you want though unlike Pascal.
ID: 1949175 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1949208 - Posted: 12 Aug 2018, 16:46:15 UTC - in response to Message 1949167.  

Did you set the memory speeds in Linux? Default they do not run at 8 GBps but only at 7.6GBps for me. 1070 and 1070Ti in Linux.

Yes, I add some overclock back into P2 state so that the cards are running close to what they would run in P0 state if Nvidia didn't penalize us for compute loads. I add 600 Mhz to the memory clock to the 1070 so effective clock is 8200 Mhz. That is only 200 Mhz past stock P0 speed.

You can use Nvidia Profile Inspector in Windows to turn off the CUDA P2 downclock but there is no such utility or ability to do that in Linux. So you have to overclock P2 state a bit.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1949208 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1949209 - Posted: 12 Aug 2018, 16:51:41 UTC - in response to Message 1949168.  

Yes, that is exactly what I was trying to make a point of. Don't worry about the gpu core clock. Let the Nvidia GPU Boost 3.0 mechanism in the firmware of the card take care of the core clock. You can just run it stock and the card will boost the clock to whatever the thermal and power limits allow. But you are always going to get penalized by Nvidia in the driver for running compute load with the severe drop in memory clocks. That parameter is NOT boosted by GPU Boost 3.0. So whatever you can do to get the memory clock back to what it should be running in P0 state is the best thing for reducing task completion times.
Keith, does this just apply to the 10x0 series cards, that's where the driver got borked?

No the P2 compute load penalty is applied to all Nvidia cards excluding some 1050 cards or similar cards in previous generations. As soon as the shader count gets above 6 or so, they get penalized. So Kepler, Maxwell and Pascal suffer the the P2 compute load penalty. Simply because the video driver enforces it. Nvidia could change that if they wanted. But I think they want to continue forcing users to their Tesla and Quadro products for anyone doing compute loads.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1949209 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1949210 - Posted: 12 Aug 2018, 17:02:01 UTC

I'll copy my message from our GPUUG forum about how to turn of CUDA P2 compute load penalty for Windows users.

I'm pretty sure this isn't known, or I have been VERY asleep at the wheel, but I have discovered how to get Nvidia cards INCLUDING PASCAL into P0 state in Windows. You can even overclock both the memory and graphics clock to your heart's content.

You need to get the current Nvidia Inspector utility PLUS the Nvidia Profile Inspector utility. The key piece is the Profile Inspector.
Nvidia Profile Inspector download

"CUDA - Force P2" state listed under section "5 - common"

By just going into the Profile Inspector to the Section 5 and toggling off the Force P2 state under compute loads to OFF, you will be able to run your Nvidia cards in P0 state while crunching Seti or whatever project you desire.

You can then use the normal Nvidia Inspector to add offset clock to both graphic cores and memory or just leave the cards at stock clock for P0. The cards will remain in P0 state forever.

Hope everybody reads this post and tries out the fix and posts their feedback to this thread. Have a wonderful day!

Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1949210 · Report as offensive
mmonnin
Volunteer tester

Send message
Joined: 8 Jun 17
Posts: 58
Credit: 10,176,849
RAC: 0
United States
Message 1949344 - Posted: 13 Aug 2018, 12:19:50 UTC - in response to Message 1949208.  
Last modified: 13 Aug 2018, 12:20:11 UTC

Did you set the memory speeds in Linux? Default they do not run at 8 GBps but only at 7.6GBps for me. 1070 and 1070Ti in Linux.

Yes, I add some overclock back into P2 state so that the cards are running close to what they would run in P0 state if Nvidia didn't penalize us for compute loads. I add 600 Mhz to the memory clock to the 1070 so effective clock is 8200 Mhz. That is only 200 Mhz past stock P0 speed.

You can use Nvidia Profile Inspector in Windows to turn off the CUDA P2 downclock but there is no such utility or ability to do that in Linux. So you have to overclock P2 state a bit.


Depending on the memory OEM on the card it might be able to OC quite high. I've had my 1070 in windows at +900 before.

I currently have them both at stock 8GHz in Linux in P2 but there's only about 5 seconds difference between a 1070 and 1070Ti since the 1070 can OC the GPU higher. About 2:20 to 2:25 or so per task so not a lot of time to make a noticeable difference if I change the memory clocks. I'll have to experiment some more.
ID: 1949344 · Report as offensive
Ian&Steve C.
Avatar

Send message
Joined: 28 Sep 99
Posts: 4267
Credit: 1,282,604,591
RAC: 6,640
United States
Message 1949357 - Posted: 13 Aug 2018, 13:57:46 UTC - in response to Message 1949209.  
Last modified: 13 Aug 2018, 14:04:10 UTC

Yes, that is exactly what I was trying to make a point of. Don't worry about the gpu core clock. Let the Nvidia GPU Boost 3.0 mechanism in the firmware of the card take care of the core clock. You can just run it stock and the card will boost the clock to whatever the thermal and power limits allow. But you are always going to get penalized by Nvidia in the driver for running compute load with the severe drop in memory clocks. That parameter is NOT boosted by GPU Boost 3.0. So whatever you can do to get the memory clock back to what it should be running in P0 state is the best thing for reducing task completion times.
Keith, does this just apply to the 10x0 series cards, that's where the driver got borked?

No the P2 compute load penalty is applied to all Nvidia cards excluding some 1050 cards or similar cards in previous generations. As soon as the shader count gets above 6 or so, they get penalized. So Kepler, Maxwell and Pascal suffer the the P2 compute load penalty. Simply because the video driver enforces it. Nvidia could change that if they wanted. But I think they want to continue forcing users to their Tesla and Quadro products for anyone doing compute loads.


i can only confirm that my 750ti (Maxwell) and 1050ti (Pascal) cards ran at P0 by default. My 1060's all ran P2 by default, and 1080ti's run P2 by default. I never checked what they were doing when i was running 760's



as for memory speed. I guess it might only big see improvements on the less optimized apps. SoG and zi3v.

on my systems running petri's latest iteration, changing memory speed does not have much effect MAYBE 1-2 seconds faster on WUs that are taking about 60-70 seconds on average (1080ti).

my theory on why i'm not seeing the same improvements with increased mem clock is because the latest app just doesnt rely on the memory as much.
Seti@Home classic workunits: 29,492 CPU time: 134,419 hours

ID: 1949357 · Report as offensive
Cruncher-American Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor

Send message
Joined: 25 Mar 02
Posts: 1513
Credit: 370,893,186
RAC: 340
United States
Message 1949377 - Posted: 13 Aug 2018, 16:07:56 UTC - in response to Message 1949210.  

I just d/l Profile Inspector per your instructions, and got 2.13, NOT 2.1.3.9.
It does NOT have the "CUDA - Force P2" line item under part 5.

How can I get the version you used? Or another, that supports the change?
ID: 1949377 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1949385 - Posted: 13 Aug 2018, 16:59:02 UTC - in response to Message 1949377.  

I just d/l Profile Inspector per your instructions, and got 2.13, NOT 2.1.3.9.
It does NOT have the "CUDA - Force P2" line item under part 5.

How can I get the version you used? Or another, that supports the change?

Looks like the old link is not up on the latest. You always get the latest at the developer's Github repository.

2.1.3.19
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1949385 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1949387 - Posted: 13 Aug 2018, 17:06:58 UTC - in response to Message 1949344.  


Depending on the memory OEM on the card it might be able to OC quite high. I've had my 1070 in windows at +900 before.

I currently have them both at stock 8GHz in Linux in P2 but there's only about 5 seconds difference between a 1070 and 1070Ti since the 1070 can OC the GPU higher. About 2:20 to 2:25 or so per task so not a lot of time to make a noticeable difference if I change the memory clocks. I'll have to experiment some more.

You have to be careful about adding too much memory overclock or you can crash your system and trash all your work like I just did a couple of days ago. My 1070's wouldn't take a +1000Mhz memory overclock boost. The danger is that as each task unloads from the card, the card transitions back into P0 or what Linux calls P3 state from P2. The overclock is added to all P0 states including P0, so the 1070 card tries to run at 9000Mhz which it can't do and crashes.

The solution is to either overclock more mildly like my 600Mhz boost or use Petri's newest KeepP2 application which runs a small compute load in the background on the card at all times and prevents it from returning to P0 state.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1949387 · Report as offensive
mmonnin
Volunteer tester

Send message
Joined: 8 Jun 17
Posts: 58
Credit: 10,176,849
RAC: 0
United States
Message 1949854 - Posted: 15 Aug 2018, 11:41:46 UTC

I moved the 1070 from 8k back to Linux P2 default of 7.6k and I saw about a 4-5 second increase in run times. I bumped it back up to 8.1k this morning. Slow increments atm. The 1070 is just a couple seconds behind the 1070Ti since it boosts higher.
ID: 1949854 · Report as offensive
Profile Keith Myers Special Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 29 Apr 01
Posts: 13161
Credit: 1,160,866,277
RAC: 1,873
United States
Message 1949897 - Posted: 15 Aug 2018, 16:28:57 UTC - in response to Message 1949854.  

I've found that a 600Mhz boost of memory clock in P2 is entirely safe for 8200Mhz effective. Running 8400Mhz currently on my 1070's now with keepP2.
Seti@Home classic workunits:20,676 CPU time:74,226 hours

A proud member of the OFA (Old Farts Association)
ID: 1949897 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Proclamation - memory speed is more important than shader count or gpu core clocks


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.