PCIe speed and CUDA performance


log in

Advanced search

Message boards : Number crunching : PCIe speed and CUDA performance

Previous · 1 · 2 · 3 · 4 · Next
Author Message
Profile -= Vyper =-Project donor
Volunteer tester
Avatar
Send message
Joined: 5 Sep 99
Posts: 1091
Credit: 326,022,536
RAC: 83,574
Sweden
Message 1064934 - Posted: 9 Jan 2011, 12:17:45 UTC
Last modified: 9 Jan 2011, 12:18:42 UTC

Man!

I've should've been a novelist instead:

http://www.imdb.com/title/tt0060196/
=
The CPU, The GPU & The coder

:)

Think twice before you click and send the driver on it's way!

Regards Vyper
____________

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group

Profile HelliProject donor
Volunteer tester
Avatar
Send message
Joined: 15 Dec 99
Posts: 704
Credit: 91,903,170
RAC: 30,574
Germany
Message 1064937 - Posted: 9 Jan 2011, 12:47:36 UTC - in response to Message 1064934.
Last modified: 9 Jan 2011, 12:51:04 UTC

Man!

I've should've been a novelist instead:

http://www.imdb.com/title/tt0060196/
=
The CPU, The GPU & The coder

:)

Think twice before you click and send the driver on it's way!

Regards Vyper



Best Western ever! :-)

Perhaps - 2096 Words - how long did it take? :D

Helli
____________
A loooong time ago: My first Credits

Profile -= Vyper =-Project donor
Volunteer tester
Avatar
Send message
Joined: 5 Sep 99
Posts: 1091
Credit: 326,022,536
RAC: 83,574
Sweden
Message 1064944 - Posted: 9 Jan 2011, 13:49:04 UTC - in response to Message 1064937.
Last modified: 9 Jan 2011, 13:50:07 UTC

Don't know man..

Perhaps around half an hour or so. I had it all in my mind.
No. I have not taken drugs :P

Kind regards Vyper
____________

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group

Profile Raistmer
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 16 Jun 01
Posts: 3588
Credit: 48,762,665
RAC: 20,864
Russia
Message 1064945 - Posted: 9 Jan 2011, 13:51:03 UTC

BTW, data transfer speed can be affected with communicatio retries, PCIe is able to retry communication in case of failure.
Does anyone know some tool that can show these retries (their number) if any ?

Profile Fred J. Verster
Volunteer tester
Avatar
Send message
Joined: 21 Apr 04
Posts: 3252
Credit: 31,902,797
RAC: 257
Netherlands
Message 1064949 - Posted: 9 Jan 2011, 14:14:28 UTC - in response to Message 1064945.
Last modified: 9 Jan 2011, 14:42:22 UTC

BTW, data transfer speed can be affected with communicatio retries, PCIe is able to retry communication in case of failure.
Does anyone know some tool that can show these retries (their number) if any ?


Well, that is a difficult one and haven't seen/found such a tool, yet.
Maybe here?
Or here?
Another piece of usefull information, but I still do not understand why a Mobo with 2 PCI-E x16 Slots, runs 2 NVIDIA cards in 1 and 2x mode, whereas 2 ATI cards are run in 2x 16x?
And, equaly important, no real difference in Crunching Speed, is noticeble.
Computing times to complete 2 0.4 AR MB WU's, appear similar on the 480, when running in 1x (!) or 16x ! (According to GPUz 0.50)

Difference gets quickly greater, when runnin 3 or 4 at once, at 4 per GPU, times doubles, that appears the tipping point when running in 1x mode and CPU time increases, too!

Oh getting way off Topic again, sorry.
____________

Dave
Avatar
Send message
Joined: 29 Mar 02
Posts: 774
Credit: 23,193,139
RAC: 0
United Kingdom
Message 1064983 - Posted: 9 Jan 2011, 17:18:10 UTC

Nice story!

Well I think i'm going to go for all-16× just to be on the safe side ;).

Highlander
Avatar
Send message
Joined: 5 Oct 99
Posts: 154
Credit: 31,789,156
RAC: 6,026
Germany
Message 1064987 - Posted: 9 Jan 2011, 17:31:28 UTC - in response to Message 1064945.

BTW, data transfer speed can be affected with communicatio retries, PCIe is able to retry communication in case of failure.
Does anyone know some tool that can show these retries (their number) if any ?


Not quite the right thing, but something similar:

http://www.thesycon.de/deu/latency_check.shtml
____________

-BeNt-
Avatar
Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1064990 - Posted: 9 Jan 2011, 17:36:05 UTC
Last modified: 9 Jan 2011, 17:38:09 UTC

@ -= Vyper =- wow great post! You just wrote a full short story explaining the differences in speed vs bandwidth when you aren't saturating that lane, along with latency of the lanes and interrupts lol. Least someone gets it. I'm sure someone will be along later to'not insult you, just correct you, so to say' merely because they don't grasp or agree with what you're saying. Beautifully done.
____________
Traveling through space at ~67,000mph!

Profile -= Vyper =-Project donor
Volunteer tester
Avatar
Send message
Joined: 5 Sep 99
Posts: 1091
Credit: 326,022,536
RAC: 83,574
Sweden
Message 1064992 - Posted: 9 Jan 2011, 17:57:23 UTC - in response to Message 1064990.

Thank you , thank you.

I'm not sure that i'm 100% correct in what i describe but atleast that gives a small hum of what is going on in terms of what happens when involving different parts of your system.
Everything that eventually can be precalculated or expanded to a easy to follow grid pointer system to make sure that least amount of data possible needed to travel through the slow PCI-E bus is almost certainly a win-win situation.
Cpu can do other work along with gpu not needing to be fed with "what now then" parameters, and if that is not avoidable it simply isn't.

Simply said i presume the system do it's best if as much preparation of data and code is made before the transfers occur to the gpu.

I just couldn't stop myself from making something human referable to what happens inside the computer system at that time of writing.

Kind regards Vyper
____________

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group

zoom314Project donor
Avatar
Send message
Joined: 30 Nov 03
Posts: 46800
Credit: 37,000,373
RAC: 2,765
United States
Message 1065008 - Posted: 9 Jan 2011, 18:49:37 UTC

As long as It works I'm happy, The technical stuff is nice, But not all that important to Me.
____________
My Facebook, War Commander, 2015

Profile HelliProject donor
Volunteer tester
Avatar
Send message
Joined: 15 Dec 99
Posts: 704
Credit: 91,903,170
RAC: 30,574
Germany
Message 1065011 - Posted: 9 Jan 2011, 18:56:28 UTC - in response to Message 1065008.
Last modified: 9 Jan 2011, 18:56:45 UTC

As long as It works I'm happy, The technical stuff is nice, But not all that important to Me.


dito! LOL

Helli
____________
A loooong time ago: My first Credits

.clair.
Volunteer moderator
Send message
Joined: 4 Nov 04
Posts: 1300
Credit: 23,080,640
RAC: 552
United Kingdom
Message 1065130 - Posted: 10 Jan 2011, 0:55:23 UTC

Ah, the simplicity of having only one AGP 8x slot to bother with :)
Just an appinfo.xml away from a big increase in crunching ability (asus ah 4650 1GB).
The last upgrade to keep my athlon xp 3000 rig out of landfill.

Profile [seti.international] Dirk Sadowski
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 7115
Credit: 61,260,653
RAC: 5,504
Germany
Message 1066321 - Posted: 13 Jan 2011, 21:42:45 UTC

Thanks to all!


Maybe someone which have two same grafic cards at different PCIe speed would like to make a bench-test?
Tools are available on the Lunatics site.


Which performance loss is to expect with a GTX4xx-5xx at PCIe 1.0 x8 speed and 3 WUs/GPU?
Possible to make also here a bench-test?

____________
BR

SETI@home Needs your Help ... $10 & U get a Star!

Team seti.international

Das Deutsche Cafe. The German Cafe.

Profile Tim Norton
Volunteer tester
Avatar
Send message
Joined: 2 Jun 99
Posts: 835
Credit: 33,540,164
RAC: 0
United Kingdom
Message 1066331 - Posted: 13 Jan 2011, 22:30:05 UTC - in response to Message 1066321.

i have three machines with identical mb and paired identical gpu's - two have same cpu as well - all setup the same with 3 wu each gpu so 6 gpu "threads" at once

there is no measurable difference between pcie slots of different speeds

for each MB one slot at x8 and one slot at x4

Read these tests they have also been reproduced at other sites - easy to find on Google

First is PCIe 2.0 x16/x16 vs x16/x8

http://www.hardocp.com/article/2010/08/16/sli_cfx_pcie_bandwidth_perf_x16x16_vs_x16x8/1

Second is PCIe 2.0 x16/x16 vs. x8/x8

http://www.hardocp.com/article/2010/08/23/gtx_480_sli_pcie_bandwidth_perf_x16x16_vs_x8x8/1

Third is PCIe 2.0 x16/x16 vs. x4/x4 (equivalent to x8/x8 on PCIe 1.0)

http://www.hardocp.com/article/2010/08/25/gtx_480_sli_pcie_bandwidth_perf_x16x16_vs_x4x4/

Admittedly these tests are done in SLI/CFX mode and tested with various games but i think the principle still applies to our various SETI rigs where we have 2 or more GPU cards as the games give the card shaders and memory a good work out. They tested at high resolutions so the amount of data being passed back an forth should i believe be significant enough to be comparable with SETI crunching or more likely exceed it.

Basically the conclusions they came to is that none of the setups even the x4/x4 had any significant affect compared with the x16/x16 settings - i.e. the bus is not getting saturated.

This mirrors my own experience as having 4 cards (x8/x8/x8/x8) in my i7 vs. say 2 cards (x16/x16) did not show any obvious difference in SETI crunching times for comparable wu AR. Similarly where i have a dual 460's vs a single 460 host times are comparable as well. Also if the PCIe bus was a factor in crunching time i would have thought that at some point over clocking the GPU card would reach a plateau beyond which times did not decrease due to a limitation of the bus data bandwidth - again something i have not experienced.

It maybe however if you have an older motherboard with more than two PCIe 1.0 slots you could see an effect but i do not have any of those as my motherboards with more than one PCIe slots are all PCIe 2.0.

I was also going to research the affects of over clocking the PCIe bus but if the bus is not near saturation as the test and as my experience suggest i wonder if it has any noticeable affect

The biggest thing that will affect you crunching times is the availability of free cpu threads to feed the cards - fully load you cpu with SETI and your gpu times will lengthen considerably free up a thread or two depending on number of gpu or wu they are running and time shorten


____________
Tim

-BeNt-
Avatar
Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1066335 - Posted: 13 Jan 2011, 22:39:51 UTC - in response to Message 1066331.

i have three machines with identical mb and paired identical gpu's - two have same cpu as well - all setup the same with 3 wu each gpu so 6 gpu "threads" at once

there is no measurable difference between pcie slots of different speeds

for each MB one slot at x8 and one slot at x4

Read these tests they have also been reproduced at other sites - easy to find on Google

First is PCIe 2.0 x16/x16 vs x16/x8

http://www.hardocp.com/article/2010/08/16/sli_cfx_pcie_bandwidth_perf_x16x16_vs_x16x8/1

Second is PCIe 2.0 x16/x16 vs. x8/x8

http://www.hardocp.com/article/2010/08/23/gtx_480_sli_pcie_bandwidth_perf_x16x16_vs_x8x8/1

Third is PCIe 2.0 x16/x16 vs. x4/x4 (equivalent to x8/x8 on PCIe 1.0)

http://www.hardocp.com/article/2010/08/25/gtx_480_sli_pcie_bandwidth_perf_x16x16_vs_x4x4/

Admittedly these tests are done in SLI/CFX mode and tested with various games but i think the principle still applies to our various SETI rigs where we have 2 or more GPU cards as the games give the card shaders and memory a good work out. They tested at high resolutions so the amount of data being passed back an forth should i believe be significant enough to be comparable with SETI crunching or more likely exceed it.

Basically the conclusions they came to is that none of the setups even the x4/x4 had any significant affect compared with the x16/x16 settings - i.e. the bus is not getting saturated.

This mirrors my own experience as having 4 cards (x8/x8/x8/x8) in my i7 vs. say 2 cards (x16/x16) did not show any obvious difference in SETI crunching times for comparable wu AR. Similarly where i have a dual 460's vs a single 460 host times are comparable as well. Also if the PCIe bus was a factor in crunching time i would have thought that at some point over clocking the GPU card would reach a plateau beyond which times did not decrease due to a limitation of the bus data bandwidth - again something i have not experienced.

It maybe however if you have an older motherboard with more than two PCIe 1.0 slots you could see an effect but i do not have any of those as my motherboards with more than one PCIe slots are all PCIe 2.0.

I was also going to research the affects of over clocking the PCIe bus but if the bus is not near saturation as the test and as my experience suggest i wonder if it has any noticeable affect

The biggest thing that will affect you crunching times is the availability of free cpu threads to feed the cards - fully load you cpu with SETI and your gpu times will lengthen considerably free up a thread or two depending on number of gpu or wu they are running and time shorten



Yeah that's what I assumed from the beginning about the bus not being saturated. Nice to see tests that backup what I was thinking. Thanks for the link!

____________
Traveling through space at ~67,000mph!

Profile [seti.international] Dirk Sadowski
Volunteer tester
Avatar
Send message
Joined: 6 Apr 07
Posts: 7115
Credit: 61,260,653
RAC: 5,504
Germany
Message 1066461 - Posted: 14 Jan 2011, 3:48:18 UTC - in response to Message 1066331.
Last modified: 14 Jan 2011, 4:01:46 UTC

Thanks!


I don't know if the SLI environment is similar with CUDA crunching.
The cards are connected with SLI cables, or not? Over this the cards communicate. For CUDA it's not recommended to use this cables (at least not with the old nVIDIA 190.38 driver).


But.. from my experiences same AR can vary ~ 5 % calculation time.
So a bench-test would be helpful.. ;-)


Your PCIe 2.0 x16 slots run @ PCIe 2.0 x8 and x4 speed?
Normally (okay it depend which chipset) they could run x16/x16.
My AMD Phenom II X4 940 BE with MSI K9A2 Platinum mobo, the 4 PCIe 2.0 x16 slots run x16/x16 or x8/x8/x8/x8.


My problem, my old Intel Core2 Extreme QX6700 with Intel D975XBX2 mobo have 3 PCIe 1.0 x16 slots.
If two grafic cards inserted, PCIe slot #1 and #2 run @ PCIe 1.0 x8 speed (like PCIe 2.0 x4). PCIe slot #3 always @ PCIe 1.0 x4 speed.

If I insert two GTX2xx cards, only one CUDA app communicate over one PCIe slot.
If I insert two GTX4xx-5xx cards, (currently) 3 CUDA apps communicate over one PCIe slot.

For example one GTX285 have a S@h-RAC of ~ 16,000 (nVIDIA driver 190.38 + stock MB_6.09_cuda23 app).
I worry, a GTX470-570 have a S@h-RAC of ~ 19,000 (maybe with CUDA x32f app, 3 WUs/GPU), but because of the very slow PCIe speed (3 CUDA apps share one PCIe 1.0 x16 slot with x8 speed) ~ 10 % performance loss, so ~ 17,000 S@h-RAC (or less).


BTW. Have a small look in my profile under 'quick instruction'. I use Fred's nice tool eFMer Priority. It can increase the priority of the CUDA app. So not needed to let idle a part of the CPU.
____________
BR

SETI@home Needs your Help ... $10 & U get a Star!

Team seti.international

Das Deutsche Cafe. The German Cafe.

Profile Tim Norton
Volunteer tester
Avatar
Send message
Joined: 2 Jun 99
Posts: 835
Credit: 33,540,164
RAC: 0
United Kingdom
Message 1066509 - Posted: 14 Jan 2011, 7:18:02 UTC - in response to Message 1066461.

my mb run at 8x and 4x because i have two cards in

are you running any cpu apps at the same time as cuda where you are seeing a difference in crunching speed

if so try without cpu for a bit and it may improve the "speeds"

if not it may be that at pcie 1.0 then the bus can be a factor

looks like a 570 is nearer 25k rac - mine are still to top out but looking at credit increase per day (for 2x 570) on one host its 55k+ but that is on pcie 2.0
____________
Tim

-BeNt-
Avatar
Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1066514 - Posted: 14 Jan 2011, 7:54:54 UTC - in response to Message 1066509.
Last modified: 14 Jan 2011, 7:58:16 UTC

my mb run at 8x and 4x because i have two cards in

are you running any cpu apps at the same time as cuda where you are seeing a difference in crunching speed

if so try without cpu for a bit and it may improve the "speeds"

if not it may be that at pcie 1.0 then the bus can be a factor

looks like a 570 is nearer 25k rac - mine are still to top out but looking at credit increase per day (for 2x 570) on one host its 55k+ but that is on pcie 2.0


I still doubt even on PCIe 1.0 x4 it would cause a significant slow down because it would still be 250MB/s per lane and it has 4 lanes symetrical, so 1GB/s. Of course that isn't accounting for resends and overhead on the bus, either way I don't think Seti is transferring that amount of data. PCIe 2.0 upped the speeds to 500MB/s per lane meaning a 4x would do 2GB/s both directions, and obviously double with 8x and triple with 16x. I believe it would take a far amount of data to saturate that for sure. ;) Of course there are other things to consider such as the speed of the chipset, processor, FSB of the machine, how over loaded the entire system is, etc. etc. etc. At a certain point you need to find a balance of everything to have a nicely tuned system for optimal performance. In flight simulation we call it unification. In Seti I think I have to agree with Tim, and at a certain point have to simply start looking at the cpu for slow downs in calculation.
____________
Traveling through space at ~67,000mph!

Profile Lint trapProject donor
Send message
Joined: 30 May 03
Posts: 871
Credit: 27,840,272
RAC: 8,392
United States
Message 1066657 - Posted: 14 Jan 2011, 19:24:10 UTC

From sourgeforce.net you can download "cuda-z", a cpu-z/gpu-z type program which presents some details of any found cuda enabled cards. It has a performance tab.

Martin

Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : PCIe speed and CUDA performance

Copyright © 2014 University of California