Message boards :
Number crunching :
PCIe speed and CUDA performance
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
-= Vyper =- Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537 |
Man! I've should've been a novelist instead: http://www.imdb.com/title/tt0060196/ = The CPU, The GPU & The coder :) Think twice before you click and send the driver on it's way! Regards Vyper _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group |
Helli_retiered Send message Joined: 15 Dec 99 Posts: 707 Credit: 108,785,585 RAC: 0 |
Man! Best Western ever! :-) Perhaps - 2096 Words - how long did it take? :D Helli A loooong time ago: First Credits after SETI@home Restart |
-= Vyper =- Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537 |
|
Raistmer Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121 |
BTW, data transfer speed can be affected with communicatio retries, PCIe is able to retry communication in case of failure. Does anyone know some tool that can show these retries (their number) if any ? |
Fred J. Verster Send message Joined: 21 Apr 04 Posts: 3252 Credit: 31,903,643 RAC: 0 |
BTW, data transfer speed can be affected with communicatio retries, PCIe is able to retry communication in case of failure. Well, that is a difficult one and haven't seen/found such a tool, yet. Maybe here? Or here? Another piece of usefull information, but I still do not understand why a Mobo with 2 PCI-E x16 Slots, runs 2 NVIDIA cards in 1 and 2x mode, whereas 2 ATI cards are run in 2x 16x? And, equaly important, no real difference in Crunching Speed, is noticeble. Computing times to complete 2 0.4 AR MB WU's, appear similar on the 480, when running in 1x (!) or 16x ! (According to GPUz 0.50) Difference gets quickly greater, when runnin 3 or 4 at once, at 4 per GPU, times doubles, that appears the tipping point when running in 1x mode and CPU time increases, too! Oh getting way off Topic again, sorry. |
Dave Send message Joined: 29 Mar 02 Posts: 778 Credit: 25,001,396 RAC: 0 |
Nice story! Well I think i'm going to go for all-16× just to be on the safe side ;). |
Highlander Send message Joined: 5 Oct 99 Posts: 167 Credit: 37,987,668 RAC: 16 |
BTW, data transfer speed can be affected with communicatio retries, PCIe is able to retry communication in case of failure. Not quite the right thing, but something similar: http://www.thesycon.de/deu/latency_check.shtml - Performance is not a simple linear function of the number of CPUs you throw at the problem. - |
-BeNt- Send message Joined: 17 Oct 99 Posts: 1234 Credit: 10,116,112 RAC: 0 |
@ -= Vyper =- wow great post! You just wrote a full short story explaining the differences in speed vs bandwidth when you aren't saturating that lane, along with latency of the lanes and interrupts lol. Least someone gets it. I'm sure someone will be along later to'not insult you, just correct you, so to say' merely because they don't grasp or agree with what you're saying. Beautifully done. Traveling through space at ~67,000mph! |
-= Vyper =- Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537 |
Thank you , thank you. I'm not sure that i'm 100% correct in what i describe but atleast that gives a small hum of what is going on in terms of what happens when involving different parts of your system. Everything that eventually can be precalculated or expanded to a easy to follow grid pointer system to make sure that least amount of data possible needed to travel through the slow PCI-E bus is almost certainly a win-win situation. Cpu can do other work along with gpu not needing to be fed with "what now then" parameters, and if that is not avoidable it simply isn't. Simply said i presume the system do it's best if as much preparation of data and code is made before the transfers occur to the gpu. I just couldn't stop myself from making something human referable to what happens inside the computer system at that time of writing. Kind regards Vyper _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group |
zoom3+1=4 Send message Joined: 30 Nov 03 Posts: 65709 Credit: 55,293,173 RAC: 49 |
As long as It works I'm happy, The technical stuff is nice, But not all that important to Me. The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's |
Helli_retiered Send message Joined: 15 Dec 99 Posts: 707 Credit: 108,785,585 RAC: 0 |
As long as It works I'm happy, The technical stuff is nice, But not all that important to Me. dito! LOL Helli A loooong time ago: First Credits after SETI@home Restart |
.clair. Send message Joined: 4 Nov 04 Posts: 1300 Credit: 55,390,408 RAC: 69 |
Ah, the simplicity of having only one AGP 8x slot to bother with :) Just an appinfo.xml away from a big increase in crunching ability (asus ah 4650 1GB). The last upgrade to keep my athlon xp 3000 rig out of landfill. |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
Thanks to all! Maybe someone which have two same grafic cards at different PCIe speed would like to make a bench-test? Tools are available on the Lunatics site. Which performance loss is to expect with a GTX4xx-5xx at PCIe 1.0 x8 speed and 3 WUs/GPU? Possible to make also here a bench-test? |
Tim Norton Send message Joined: 2 Jun 99 Posts: 835 Credit: 33,540,164 RAC: 0 |
i have three machines with identical mb and paired identical gpu's - two have same cpu as well - all setup the same with 3 wu each gpu so 6 gpu "threads" at once there is no measurable difference between pcie slots of different speeds for each MB one slot at x8 and one slot at x4 Read these tests they have also been reproduced at other sites - easy to find on Google First is PCIe 2.0 x16/x16 vs x16/x8 http://www.hardocp.com/article/2010/08/16/sli_cfx_pcie_bandwidth_perf_x16x16_vs_x16x8/1 Second is PCIe 2.0 x16/x16 vs. x8/x8 http://www.hardocp.com/article/2010/08/23/gtx_480_sli_pcie_bandwidth_perf_x16x16_vs_x8x8/1 Third is PCIe 2.0 x16/x16 vs. x4/x4 (equivalent to x8/x8 on PCIe 1.0) http://www.hardocp.com/article/2010/08/25/gtx_480_sli_pcie_bandwidth_perf_x16x16_vs_x4x4/ Admittedly these tests are done in SLI/CFX mode and tested with various games but i think the principle still applies to our various SETI rigs where we have 2 or more GPU cards as the games give the card shaders and memory a good work out. They tested at high resolutions so the amount of data being passed back an forth should i believe be significant enough to be comparable with SETI crunching or more likely exceed it. Basically the conclusions they came to is that none of the setups even the x4/x4 had any significant affect compared with the x16/x16 settings - i.e. the bus is not getting saturated. This mirrors my own experience as having 4 cards (x8/x8/x8/x8) in my i7 vs. say 2 cards (x16/x16) did not show any obvious difference in SETI crunching times for comparable wu AR. Similarly where i have a dual 460's vs a single 460 host times are comparable as well. Also if the PCIe bus was a factor in crunching time i would have thought that at some point over clocking the GPU card would reach a plateau beyond which times did not decrease due to a limitation of the bus data bandwidth - again something i have not experienced. It maybe however if you have an older motherboard with more than two PCIe 1.0 slots you could see an effect but i do not have any of those as my motherboards with more than one PCIe slots are all PCIe 2.0. I was also going to research the affects of over clocking the PCIe bus but if the bus is not near saturation as the test and as my experience suggest i wonder if it has any noticeable affect The biggest thing that will affect you crunching times is the availability of free cpu threads to feed the cards - fully load you cpu with SETI and your gpu times will lengthen considerably free up a thread or two depending on number of gpu or wu they are running and time shorten Tim |
-BeNt- Send message Joined: 17 Oct 99 Posts: 1234 Credit: 10,116,112 RAC: 0 |
i have three machines with identical mb and paired identical gpu's - two have same cpu as well - all setup the same with 3 wu each gpu so 6 gpu "threads" at once Yeah that's what I assumed from the beginning about the bus not being saturated. Nice to see tests that backup what I was thinking. Thanks for the link! Traveling through space at ~67,000mph! |
Sutaru Tsureku Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5 |
Thanks! I don't know if the SLI environment is similar with CUDA crunching. The cards are connected with SLI cables, or not? Over this the cards communicate. For CUDA it's not recommended to use this cables (at least not with the old nVIDIA 190.38 driver). But.. from my experiences same AR can vary ~ 5 % calculation time. So a bench-test would be helpful.. ;-) Your PCIe 2.0 x16 slots run @ PCIe 2.0 x8 and x4 speed? Normally (okay it depend which chipset) they could run x16/x16. My AMD Phenom II X4 940 BE with MSI K9A2 Platinum mobo, the 4 PCIe 2.0 x16 slots run x16/x16 or x8/x8/x8/x8. My problem, my old Intel Core2 Extreme QX6700 with Intel D975XBX2 mobo have 3 PCIe 1.0 x16 slots. If two grafic cards inserted, PCIe slot #1 and #2 run @ PCIe 1.0 x8 speed (like PCIe 2.0 x4). PCIe slot #3 always @ PCIe 1.0 x4 speed. If I insert two GTX2xx cards, only one CUDA app communicate over one PCIe slot. If I insert two GTX4xx-5xx cards, (currently) 3 CUDA apps communicate over one PCIe slot. For example one GTX285 have a S@h-RAC of ~ 16,000 (nVIDIA driver 190.38 + stock MB_6.09_cuda23 app). I worry, a GTX470-570 have a S@h-RAC of ~ 19,000 (maybe with CUDA x32f app, 3 WUs/GPU), but because of the very slow PCIe speed (3 CUDA apps share one PCIe 1.0 x16 slot with x8 speed) ~ 10 % performance loss, so ~ 17,000 S@h-RAC (or less). BTW. Have a small look in my profile under 'quick instruction'. I use Fred's nice tool eFMer Priority. It can increase the priority of the CUDA app. So not needed to let idle a part of the CPU. |
Tim Norton Send message Joined: 2 Jun 99 Posts: 835 Credit: 33,540,164 RAC: 0 |
my mb run at 8x and 4x because i have two cards in are you running any cpu apps at the same time as cuda where you are seeing a difference in crunching speed if so try without cpu for a bit and it may improve the "speeds" if not it may be that at pcie 1.0 then the bus can be a factor looks like a 570 is nearer 25k rac - mine are still to top out but looking at credit increase per day (for 2x 570) on one host its 55k+ but that is on pcie 2.0 Tim |
-BeNt- Send message Joined: 17 Oct 99 Posts: 1234 Credit: 10,116,112 RAC: 0 |
my mb run at 8x and 4x because i have two cards in I still doubt even on PCIe 1.0 x4 it would cause a significant slow down because it would still be 250MB/s per lane and it has 4 lanes symetrical, so 1GB/s. Of course that isn't accounting for resends and overhead on the bus, either way I don't think Seti is transferring that amount of data. PCIe 2.0 upped the speeds to 500MB/s per lane meaning a 4x would do 2GB/s both directions, and obviously double with 8x and triple with 16x. I believe it would take a far amount of data to saturate that for sure. ;) Of course there are other things to consider such as the speed of the chipset, processor, FSB of the machine, how over loaded the entire system is, etc. etc. etc. At a certain point you need to find a balance of everything to have a nicely tuned system for optimal performance. In flight simulation we call it unification. In Seti I think I have to agree with Tim, and at a certain point have to simply start looking at the cpu for slow downs in calculation. Traveling through space at ~67,000mph! |
Lint trap Send message Joined: 30 May 03 Posts: 871 Credit: 28,092,319 RAC: 0 |
From sourgeforce.net you can download "cuda-z", a cpu-z/gpu-z type program which presents some details of any found cuda enabled cards. It has a performance tab. Martin |
kittyman Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004 |
I hope this is not too far off topic, but has anybody been able to verify a performance difference by changing the PCI buss clock in the bios? I have always locked mine at the standard 100MHz.. Is there a performance gain by clocking the buss to 105 or 110, should the system handle it? "Freedom is just Chaos, with better lighting." Alan Dean Foster |
©2024 University of California
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.