PCIe speed and CUDA performance

Message boards : Number crunching : PCIe speed and CUDA performance
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 1064934 - Posted: 9 Jan 2011, 12:17:45 UTC
Last modified: 9 Jan 2011, 12:18:42 UTC

Man!

I've should've been a novelist instead:

http://www.imdb.com/title/tt0060196/
=
The CPU, The GPU & The coder

:)

Think twice before you click and send the driver on it's way!

Regards Vyper

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 1064934 · Report as offensive
Profile Helli_retiered
Volunteer tester
Avatar

Send message
Joined: 15 Dec 99
Posts: 707
Credit: 108,785,585
RAC: 0
Germany
Message 1064937 - Posted: 9 Jan 2011, 12:47:36 UTC - in response to Message 1064934.  
Last modified: 9 Jan 2011, 12:51:04 UTC

Man!

I've should've been a novelist instead:

http://www.imdb.com/title/tt0060196/
=
The CPU, The GPU & The coder

:)

Think twice before you click and send the driver on it's way!

Regards Vyper



Best Western ever! :-)

Perhaps - 2096 Words - how long did it take? :D

Helli
A loooong time ago: First Credits after SETI@home Restart
ID: 1064937 · Report as offensive
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 1064944 - Posted: 9 Jan 2011, 13:49:04 UTC - in response to Message 1064937.  
Last modified: 9 Jan 2011, 13:50:07 UTC

Don't know man..

Perhaps around half an hour or so. I had it all in my mind.
No. I have not taken drugs :P

Kind regards Vyper

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 1064944 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1064945 - Posted: 9 Jan 2011, 13:51:03 UTC

BTW, data transfer speed can be affected with communicatio retries, PCIe is able to retry communication in case of failure.
Does anyone know some tool that can show these retries (their number) if any ?
ID: 1064945 · Report as offensive
Profile Fred J. Verster
Volunteer tester
Avatar

Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,903,643
RAC: 0
Netherlands
Message 1064949 - Posted: 9 Jan 2011, 14:14:28 UTC - in response to Message 1064945.  
Last modified: 9 Jan 2011, 14:42:22 UTC

BTW, data transfer speed can be affected with communicatio retries, PCIe is able to retry communication in case of failure.
Does anyone know some tool that can show these retries (their number) if any ?


Well, that is a difficult one and haven't seen/found such a tool, yet.
Maybe here?
Or here?
Another piece of usefull information, but I still do not understand why a Mobo with 2 PCI-E x16 Slots, runs 2 NVIDIA cards in 1 and 2x mode, whereas 2 ATI cards are run in 2x 16x?
And, equaly important, no real difference in Crunching Speed, is noticeble.
Computing times to complete 2 0.4 AR MB WU's, appear similar on the 480, when running in 1x (!) or 16x ! (According to GPUz 0.50)

Difference gets quickly greater, when runnin 3 or 4 at once, at 4 per GPU, times doubles, that appears the tipping point when running in 1x mode and CPU time increases, too!

Oh getting way off Topic again, sorry.
ID: 1064949 · Report as offensive
Dave

Send message
Joined: 29 Mar 02
Posts: 778
Credit: 25,001,396
RAC: 0
United Kingdom
Message 1064983 - Posted: 9 Jan 2011, 17:18:10 UTC

Nice story!

Well I think i'm going to go for all-16× just to be on the safe side ;).
ID: 1064983 · Report as offensive
Highlander
Avatar

Send message
Joined: 5 Oct 99
Posts: 167
Credit: 37,987,668
RAC: 16
Germany
Message 1064987 - Posted: 9 Jan 2011, 17:31:28 UTC - in response to Message 1064945.  

BTW, data transfer speed can be affected with communicatio retries, PCIe is able to retry communication in case of failure.
Does anyone know some tool that can show these retries (their number) if any ?


Not quite the right thing, but something similar:

http://www.thesycon.de/deu/latency_check.shtml
- Performance is not a simple linear function of the number of CPUs you throw at the problem. -
ID: 1064987 · Report as offensive
-BeNt-
Avatar

Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1064990 - Posted: 9 Jan 2011, 17:36:05 UTC
Last modified: 9 Jan 2011, 17:38:09 UTC

@ -= Vyper =- wow great post! You just wrote a full short story explaining the differences in speed vs bandwidth when you aren't saturating that lane, along with latency of the lanes and interrupts lol. Least someone gets it. I'm sure someone will be along later to'not insult you, just correct you, so to say' merely because they don't grasp or agree with what you're saying. Beautifully done.
Traveling through space at ~67,000mph!
ID: 1064990 · Report as offensive
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 1064992 - Posted: 9 Jan 2011, 17:57:23 UTC - in response to Message 1064990.  

Thank you , thank you.

I'm not sure that i'm 100% correct in what i describe but atleast that gives a small hum of what is going on in terms of what happens when involving different parts of your system.
Everything that eventually can be precalculated or expanded to a easy to follow grid pointer system to make sure that least amount of data possible needed to travel through the slow PCI-E bus is almost certainly a win-win situation.
Cpu can do other work along with gpu not needing to be fed with "what now then" parameters, and if that is not avoidable it simply isn't.

Simply said i presume the system do it's best if as much preparation of data and code is made before the transfers occur to the gpu.

I just couldn't stop myself from making something human referable to what happens inside the computer system at that time of writing.

Kind regards Vyper

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 1064992 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65709
Credit: 55,293,173
RAC: 49
United States
Message 1065008 - Posted: 9 Jan 2011, 18:49:37 UTC

As long as It works I'm happy, The technical stuff is nice, But not all that important to Me.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1065008 · Report as offensive
Profile Helli_retiered
Volunteer tester
Avatar

Send message
Joined: 15 Dec 99
Posts: 707
Credit: 108,785,585
RAC: 0
Germany
Message 1065011 - Posted: 9 Jan 2011, 18:56:28 UTC - in response to Message 1065008.  
Last modified: 9 Jan 2011, 18:56:45 UTC

As long as It works I'm happy, The technical stuff is nice, But not all that important to Me.


dito! LOL

Helli
A loooong time ago: First Credits after SETI@home Restart
ID: 1065011 · Report as offensive
.clair.

Send message
Joined: 4 Nov 04
Posts: 1300
Credit: 55,390,408
RAC: 69
United Kingdom
Message 1065130 - Posted: 10 Jan 2011, 0:55:23 UTC

Ah, the simplicity of having only one AGP 8x slot to bother with :)
Just an appinfo.xml away from a big increase in crunching ability (asus ah 4650 1GB).
The last upgrade to keep my athlon xp 3000 rig out of landfill.
ID: 1065130 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1066321 - Posted: 13 Jan 2011, 21:42:45 UTC

Thanks to all!


Maybe someone which have two same grafic cards at different PCIe speed would like to make a bench-test?
Tools are available on the Lunatics site.


Which performance loss is to expect with a GTX4xx-5xx at PCIe 1.0 x8 speed and 3 WUs/GPU?
Possible to make also here a bench-test?

ID: 1066321 · Report as offensive
Profile Tim Norton
Volunteer tester
Avatar

Send message
Joined: 2 Jun 99
Posts: 835
Credit: 33,540,164
RAC: 0
United Kingdom
Message 1066331 - Posted: 13 Jan 2011, 22:30:05 UTC - in response to Message 1066321.  

i have three machines with identical mb and paired identical gpu's - two have same cpu as well - all setup the same with 3 wu each gpu so 6 gpu "threads" at once

there is no measurable difference between pcie slots of different speeds

for each MB one slot at x8 and one slot at x4

Read these tests they have also been reproduced at other sites - easy to find on Google

First is PCIe 2.0 x16/x16 vs x16/x8

http://www.hardocp.com/article/2010/08/16/sli_cfx_pcie_bandwidth_perf_x16x16_vs_x16x8/1

Second is PCIe 2.0 x16/x16 vs. x8/x8

http://www.hardocp.com/article/2010/08/23/gtx_480_sli_pcie_bandwidth_perf_x16x16_vs_x8x8/1

Third is PCIe 2.0 x16/x16 vs. x4/x4 (equivalent to x8/x8 on PCIe 1.0)

http://www.hardocp.com/article/2010/08/25/gtx_480_sli_pcie_bandwidth_perf_x16x16_vs_x4x4/

Admittedly these tests are done in SLI/CFX mode and tested with various games but i think the principle still applies to our various SETI rigs where we have 2 or more GPU cards as the games give the card shaders and memory a good work out. They tested at high resolutions so the amount of data being passed back an forth should i believe be significant enough to be comparable with SETI crunching or more likely exceed it.

Basically the conclusions they came to is that none of the setups even the x4/x4 had any significant affect compared with the x16/x16 settings - i.e. the bus is not getting saturated.

This mirrors my own experience as having 4 cards (x8/x8/x8/x8) in my i7 vs. say 2 cards (x16/x16) did not show any obvious difference in SETI crunching times for comparable wu AR. Similarly where i have a dual 460's vs a single 460 host times are comparable as well. Also if the PCIe bus was a factor in crunching time i would have thought that at some point over clocking the GPU card would reach a plateau beyond which times did not decrease due to a limitation of the bus data bandwidth - again something i have not experienced.

It maybe however if you have an older motherboard with more than two PCIe 1.0 slots you could see an effect but i do not have any of those as my motherboards with more than one PCIe slots are all PCIe 2.0.

I was also going to research the affects of over clocking the PCIe bus but if the bus is not near saturation as the test and as my experience suggest i wonder if it has any noticeable affect

The biggest thing that will affect you crunching times is the availability of free cpu threads to feed the cards - fully load you cpu with SETI and your gpu times will lengthen considerably free up a thread or two depending on number of gpu or wu they are running and time shorten


Tim

ID: 1066331 · Report as offensive
-BeNt-
Avatar

Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1066335 - Posted: 13 Jan 2011, 22:39:51 UTC - in response to Message 1066331.  

i have three machines with identical mb and paired identical gpu's - two have same cpu as well - all setup the same with 3 wu each gpu so 6 gpu "threads" at once

there is no measurable difference between pcie slots of different speeds

for each MB one slot at x8 and one slot at x4

Read these tests they have also been reproduced at other sites - easy to find on Google

First is PCIe 2.0 x16/x16 vs x16/x8

http://www.hardocp.com/article/2010/08/16/sli_cfx_pcie_bandwidth_perf_x16x16_vs_x16x8/1

Second is PCIe 2.0 x16/x16 vs. x8/x8

http://www.hardocp.com/article/2010/08/23/gtx_480_sli_pcie_bandwidth_perf_x16x16_vs_x8x8/1

Third is PCIe 2.0 x16/x16 vs. x4/x4 (equivalent to x8/x8 on PCIe 1.0)

http://www.hardocp.com/article/2010/08/25/gtx_480_sli_pcie_bandwidth_perf_x16x16_vs_x4x4/

Admittedly these tests are done in SLI/CFX mode and tested with various games but i think the principle still applies to our various SETI rigs where we have 2 or more GPU cards as the games give the card shaders and memory a good work out. They tested at high resolutions so the amount of data being passed back an forth should i believe be significant enough to be comparable with SETI crunching or more likely exceed it.

Basically the conclusions they came to is that none of the setups even the x4/x4 had any significant affect compared with the x16/x16 settings - i.e. the bus is not getting saturated.

This mirrors my own experience as having 4 cards (x8/x8/x8/x8) in my i7 vs. say 2 cards (x16/x16) did not show any obvious difference in SETI crunching times for comparable wu AR. Similarly where i have a dual 460's vs a single 460 host times are comparable as well. Also if the PCIe bus was a factor in crunching time i would have thought that at some point over clocking the GPU card would reach a plateau beyond which times did not decrease due to a limitation of the bus data bandwidth - again something i have not experienced.

It maybe however if you have an older motherboard with more than two PCIe 1.0 slots you could see an effect but i do not have any of those as my motherboards with more than one PCIe slots are all PCIe 2.0.

I was also going to research the affects of over clocking the PCIe bus but if the bus is not near saturation as the test and as my experience suggest i wonder if it has any noticeable affect

The biggest thing that will affect you crunching times is the availability of free cpu threads to feed the cards - fully load you cpu with SETI and your gpu times will lengthen considerably free up a thread or two depending on number of gpu or wu they are running and time shorten



Yeah that's what I assumed from the beginning about the bus not being saturated. Nice to see tests that backup what I was thinking. Thanks for the link!

Traveling through space at ~67,000mph!
ID: 1066335 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1066461 - Posted: 14 Jan 2011, 3:48:18 UTC - in response to Message 1066331.  
Last modified: 14 Jan 2011, 4:01:46 UTC

Thanks!


I don't know if the SLI environment is similar with CUDA crunching.
The cards are connected with SLI cables, or not? Over this the cards communicate. For CUDA it's not recommended to use this cables (at least not with the old nVIDIA 190.38 driver).


But.. from my experiences same AR can vary ~ 5 % calculation time.
So a bench-test would be helpful.. ;-)


Your PCIe 2.0 x16 slots run @ PCIe 2.0 x8 and x4 speed?
Normally (okay it depend which chipset) they could run x16/x16.
My AMD Phenom II X4 940 BE with MSI K9A2 Platinum mobo, the 4 PCIe 2.0 x16 slots run x16/x16 or x8/x8/x8/x8.


My problem, my old Intel Core2 Extreme QX6700 with Intel D975XBX2 mobo have 3 PCIe 1.0 x16 slots.
If two grafic cards inserted, PCIe slot #1 and #2 run @ PCIe 1.0 x8 speed (like PCIe 2.0 x4). PCIe slot #3 always @ PCIe 1.0 x4 speed.

If I insert two GTX2xx cards, only one CUDA app communicate over one PCIe slot.
If I insert two GTX4xx-5xx cards, (currently) 3 CUDA apps communicate over one PCIe slot.

For example one GTX285 have a S@h-RAC of ~ 16,000 (nVIDIA driver 190.38 + stock MB_6.09_cuda23 app).
I worry, a GTX470-570 have a S@h-RAC of ~ 19,000 (maybe with CUDA x32f app, 3 WUs/GPU), but because of the very slow PCIe speed (3 CUDA apps share one PCIe 1.0 x16 slot with x8 speed) ~ 10 % performance loss, so ~ 17,000 S@h-RAC (or less).


BTW. Have a small look in my profile under 'quick instruction'. I use Fred's nice tool eFMer Priority. It can increase the priority of the CUDA app. So not needed to let idle a part of the CPU.
ID: 1066461 · Report as offensive
Profile Tim Norton
Volunteer tester
Avatar

Send message
Joined: 2 Jun 99
Posts: 835
Credit: 33,540,164
RAC: 0
United Kingdom
Message 1066509 - Posted: 14 Jan 2011, 7:18:02 UTC - in response to Message 1066461.  

my mb run at 8x and 4x because i have two cards in

are you running any cpu apps at the same time as cuda where you are seeing a difference in crunching speed

if so try without cpu for a bit and it may improve the "speeds"

if not it may be that at pcie 1.0 then the bus can be a factor

looks like a 570 is nearer 25k rac - mine are still to top out but looking at credit increase per day (for 2x 570) on one host its 55k+ but that is on pcie 2.0
Tim

ID: 1066509 · Report as offensive
-BeNt-
Avatar

Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1066514 - Posted: 14 Jan 2011, 7:54:54 UTC - in response to Message 1066509.  
Last modified: 14 Jan 2011, 7:58:16 UTC

my mb run at 8x and 4x because i have two cards in

are you running any cpu apps at the same time as cuda where you are seeing a difference in crunching speed

if so try without cpu for a bit and it may improve the "speeds"

if not it may be that at pcie 1.0 then the bus can be a factor

looks like a 570 is nearer 25k rac - mine are still to top out but looking at credit increase per day (for 2x 570) on one host its 55k+ but that is on pcie 2.0


I still doubt even on PCIe 1.0 x4 it would cause a significant slow down because it would still be 250MB/s per lane and it has 4 lanes symetrical, so 1GB/s. Of course that isn't accounting for resends and overhead on the bus, either way I don't think Seti is transferring that amount of data. PCIe 2.0 upped the speeds to 500MB/s per lane meaning a 4x would do 2GB/s both directions, and obviously double with 8x and triple with 16x. I believe it would take a far amount of data to saturate that for sure. ;) Of course there are other things to consider such as the speed of the chipset, processor, FSB of the machine, how over loaded the entire system is, etc. etc. etc. At a certain point you need to find a balance of everything to have a nicely tuned system for optimal performance. In flight simulation we call it unification. In Seti I think I have to agree with Tim, and at a certain point have to simply start looking at the cpu for slow downs in calculation.
Traveling through space at ~67,000mph!
ID: 1066514 · Report as offensive
Profile Lint trap

Send message
Joined: 30 May 03
Posts: 871
Credit: 28,092,319
RAC: 0
United States
Message 1066657 - Posted: 14 Jan 2011, 19:24:10 UTC

From sourgeforce.net you can download "cuda-z", a cpu-z/gpu-z type program which presents some details of any found cuda enabled cards. It has a performance tab.

Martin
ID: 1066657 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51468
Credit: 1,018,363,574
RAC: 1,004
United States
Message 1066789 - Posted: 15 Jan 2011, 2:15:43 UTC

I hope this is not too far off topic, but has anybody been able to verify a performance difference by changing the PCI buss clock in the bios?
I have always locked mine at the standard 100MHz..
Is there a performance gain by clocking the buss to 105 or 110, should the system handle it?
"Freedom is just Chaos, with better lighting." Alan Dean Foster

ID: 1066789 · Report as offensive
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : PCIe speed and CUDA performance


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.