PCIe speed and CUDA performance

Message boards : Number crunching : PCIe speed and CUDA performance
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65752
Credit: 55,293,173
RAC: 49
United States
Message 1067446 - Posted: 16 Jan 2011, 22:06:28 UTC - in response to Message 1067429.  

Lack of difference in calculations in first three results show that two units are identical, especially in sane of testing purposes.
Second result shows that even presence of another PCIe device affects crunching times on fastest possible speed(related to PCIe speeds or not, it must be noted). Third one shows that desktop handling does affect performance and shows for how much.

And, as I told, this two cards are far from saturating the bus, but there is already a difference. it doesn't work alone. If u neglect several factors, giving u small performance increases, u get a really significant summary performance loss. If u consider this percents insignificant, I'm writing this second time, don't assume other think in same way. My tests are for those interested of having any percent under their belt, as Sutaru is.

You persist on continue giving irrelevant examples(the way SSD works and how driver stack, cache managers, I/O queue etc handles it is different than GPUs, way different)


Ah yeah I see what you are saying with the workunits, I failed to recognize that until you mentioned it. However 1-3% difference is made up of different units, namely cpu, or maybe the bus from cpu etc.

Irrelevant examples? WTF dude it's SSD's that connect INTO the PCIe bus, humm transferring data over the same bus is irrelevant? You do realize that the same stuff that operates an SSD's memory is the same technology that operates all the memory on a video card. The back ground particulars on how it work, why it works etc. doesn't matter what does matter was showing saturation on the bus, and two of the fastest transfer speeds around can't do it, that's the point that was being made.

and speak about perfect world performance - what I've measures with CUDA-Z shows that actually about 60% of bandwidth is available in fact. If I was you, I won't play with numbers that easy.
We haven't seen any real numbers from you, IIRC.


Really....insults? I have yet to see any real numbers from you either for all I know. CUDA-z shows 1073.64MB/s just sitting on desktop. So oh no I guess I'm only getting 1/8th the speed of my x16! Of course this is running 2 WU's plus 4 cpus's while watching a HD movie, two monitors, second one running BoincTasks, I have 10 fingers that typed this........I show the exact same differences work unit to work unit on a none changing bus speed or cpu speed. You want numbers like always check my stats they are open. So if I was you I would make sure you number show something more significant than the 1-3% loss that your cpu and chipset is having when feeding your gpu.

I hate getting personal, but damn dude, I quote and link actual facts about so have other people and used practical examples and logical figures even using YOUR numbers. And you still fail to see what is being talked about. I'm done because obviously you don't get it. But then again I suppose everyone who has posted there numbers and experience are simply dumb, ignorant, and you simply think you know it all. Either way if this discussion can't stay a discussion and less of an attack I'm done with it as what I've been talking about has already been backed up by links and experience from others. You are the only one reporting what you are talking about.....but anyways.

Even better to Quote Todd Hebert:

I stand corrected - I didn't think that you could install a 295 in a x4 connected slot. However our application is compact and would run within the gpu/frame buffer and would not be the same as say a game with a high transfer of textures - you would see a performance hit there for sure.

Todd


From this thread, but I guess the guy who holds the record doesn't know what he's talking about either. And Todd does know his stuff. He gets it.

Thank You, I was only correcting an error on Todds part, But then a 4x slot is not a problem for a GTX295 doing Seti@Home, As long as the 4x slot(electrically) is long enough for a 16x card to fit in It.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1067446 · Report as offensive
hbomber
Volunteer tester

Send message
Joined: 2 May 01
Posts: 437
Credit: 50,852,854
RAC: 0
Bulgaria
Message 1067450 - Posted: 16 Jan 2011, 22:23:05 UTC
Last modified: 16 Jan 2011, 22:37:14 UTC

And what if u don't get corrected in case where you wrote incorrect statements?
The cases you are missing are about GT240's power consumption and GT260s power consumption and last is where also, riding popular wave, gave an advice to run several units simultaneously. In those I corrected you, and ran actual tests to prove last case. What if I didn't? Ppl would think 240 consumes 100+ watts and 260 consumes 270 watts alone, effectively missing two most efficient cruncher GPUs and would lose time and resource, running more than one unit on 470, where is not needed.
I appreciate your acknowledge of mistakes, but better don't let them float around at all. Bcs not every time someone would correct you.
Yet another example of long floating uncorrected mistake is to use or not to use hyper-threading. And still no one has brought question, how does 4 cores perform, regarding is HT on or off during the period(yeah, there is huge difference if u crunch with 4 cores, but leave HT on).

I'll say it for last time - let the people assess numbers, according to their environment and desires.

As for 295s. I would not quote Todd in same way bcs two reasons. I haven't observed such card(and in multi GPU environments it very hard to keep track how particular card/GPU performs) and I have suspicion that 295 is not saturating the bus that hard. It has internal synchronization mechanisms(with NF200), may get handled by driver differently, e.g using packed commands for both GPU and they gets dispatched internally, thus utilizing the bus not more than regular high end card or in comparable levels.

World shattering, u say. Well, few damaged tiles, another "world shattering" number, brought "Columbia" down. For some they are few, for others they were matter of life. Its harsh example, but you cannot assess how much world shattering for other ppl something is, even those few dozens of seconds.

Your irony about my attitude won't help your prove your statements in any topic, a friendly reminder. My attitude towards you does not affect other ppl directly, while the stuff about GTX 260 using 270 watts does.
As much as you continue to teach people which is much and which is less, speaking GENERALLY and posting incorrect information on several places already, we cannot agree on any topic.
ID: 1067450 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20291
Credit: 7,508,002
RAC: 20
United Kingdom
Message 1067453 - Posted: 16 Jan 2011, 22:48:23 UTC - in response to Message 1067446.  

... But then a 4x slot is not a problem for a GTX295 doing Seti@Home, As long as the 4x slot(electrically) is long enough for a 16x card to fit in It.

I've seen examples where the endstop on a x4 PCIe connector has been cut out so that you can slot in a x16 card or whatever.

Does that still work ok even for a x1 slot?

(Who needs super expensive motherboards if you can make up an evil GPU cruncher out of x1 slots?!...)

Happy fast crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 1067453 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65752
Credit: 55,293,173
RAC: 49
United States
Message 1067455 - Posted: 16 Jan 2011, 22:51:24 UTC - in response to Message 1067453.  

... But then a 4x slot is not a problem for a GTX295 doing Seti@Home, As long as the 4x slot(electrically) is long enough for a 16x card to fit in It.

I've seen examples where the end stop on a x4 PCIe connector has been cut out so that you can slot in a x16 card or whatever.

Does that still work ok even for a x1 slot?

(Who needs super expensive motherboards if you can make up an evil GPU cruncher out of x1 slots?!...)

Happy fast crunchin',
Martin

I have no idea, It should work in theory, But I've never tried that. As long as there isn't anything physically in the way of the 16x card edge, It might work, But I don't know that for sure.
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1067455 · Report as offensive
-BeNt-
Avatar

Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1067462 - Posted: 16 Jan 2011, 23:01:59 UTC - in response to Message 1067450.  

And what if u don't get corrected in case where you wrote incorrect statements?
The cases you are missing are about GT240's power consumption and GT260s power consumption and last is where also, riding popular wave, gave an advice to run several units simultaneously. In those I corrected you, and ran actual tests to prove last case. What if I didn't? Ppl would think 240 consumes 100+ watts and 260 consumes 270 watts alone, effectively missing two most efficient cruncher GPUs and would lose time and resource, running more than one unit on 470, where is not needed.
I appreciate your acknowledge of mistakes, but better don't let them float around at all. Bcs not every time someone would correct you.


And each time I was quoting wall values instead of card values. Which one is more important total power consumption or in picking a power supply depends on the conversation. And while I was wrong on the topic at hand I wasn't incorrect. Two different things.


Yet another example of long floating uncorrected mistake is to use or not to use hyper-threading. And still no one has brought question, how does 4 cores perform, regarding is HT on or off during the period(yeah, there is huge difference if u crunch with 4 cores, but leave HT on).


So....like I've always qualified my statement on that, there have been people having issues with HT on, but it effects everyone differently depending on the computer setup. Steve is one person I know can give insight on that, and I always direct them to him since he has an out standing amount of time trouble shooting it and figuring it out.


I'll say it for last time - let the people assess numbers, according to their environment and desires.

As for 295s. I would not quote Todd in same way bcs two reasons. I haven't observed such card(and in multi GPU environments it very hard to keep track how particular card/GPU performs) and I have suspicion that 295 is not saturating the bus that hard. It has internal synchronization mechanisms(with NF200), may get handled by driver differently, e.g using packed commands for both GPU and they gets dispatched internally, thus utilizing the bus not more than regular high end card or in comparable levels.


The #1 and #3 computer in the world for Seti run 4x 295's one on a quad core machine the other on a six core. The crunching times are approximately the same. Both are being ran on the 1366 socket (i7, Xeon). The one thing we don't know is what motherboards they are using, Helli is #3 I know he posts here, so counting on NF200 is a bit of a stretch even if your assumption is correct without knowing. The one certain for sure is the faster cruncher is on XP x64 with 2 more cores and a faster clock speed by about 600 odd MHz.

To bring Todd back into it, his machine ranks in #10 as the fastest 480 cruncher with 3 of them. His times are about half the time a 295 is doing them on the 480 but, are pulling the same numbers as his single 480 machine at #11.


World shattering, u say. Well, few damaged tiles, another "world shattering" number, brought "Columbia" down. For some they are few, for others they were matter of life. Its harsh example, but you cannot assess how much world shattering for other ppl something is, even those few dozens of seconds.

Your irony about my attitude won't help your prove your statements in any topic, a friendly reminder. My attitude towards you does not affect other ppl directly, while the stuff about GTX 260 using 270 watts does.
As much as you continue to teach people which is much and which is less, speaking GENERALLY and posting incorrect information on several places already, we cannot agree on any topic.


You talking about me bring up topics that don't apply here then have the adacity to try and compare crunching numbers with the death of an entire space shuttle crew due to a limited number of tiles!?! We can agree on one topic, already did with you earlier in another thread, your matter of discourse is you don't want to agree with me(or anyone?) despite the facts you don't agree with.

Traveling through space at ~67,000mph!
ID: 1067462 · Report as offensive
hbomber
Volunteer tester

Send message
Joined: 2 May 01
Posts: 437
Credit: 50,852,854
RAC: 0
Bulgaria
Message 1067468 - Posted: 16 Jan 2011, 23:15:00 UTC
Last modified: 16 Jan 2011, 23:21:10 UTC

And your "forgot" to mention that these are wall values, and especially in case of 260 it was written "card alone" or same meaning :)

You again did wrote incorrect stuff about cards in SLI mode.Just now!
I wasnt about to bring this last thing here, but you really need harsh examples.
What if no one corrects you, to ask again?
Man, you really need at all to cut writing on topics where you are not certain on what you know.
ID: 1067468 · Report as offensive
-BeNt-
Avatar

Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1067489 - Posted: 17 Jan 2011, 0:15:05 UTC - in response to Message 1067468.  
Last modified: 17 Jan 2011, 0:21:28 UTC

And your "forgot" to mention that these are wall values, and especially in case of 260 it was written "card alone" or same meaning :)

You again did wrote incorrect stuff about cards in SLI mode.Just now!
I wasnt about to bring this last thing here, but you really need harsh examples.
What if no one corrects you, to ask again?
Man, you really need at all to cut writing on topics where you are not certain on what you know.


Wow dude keep digging, maybe you'll find some tests questions I answered wrong on in the 3rd grade too! I didn't forget to say wall values, I understood them as card values and didn't read into the article to see the finer details. But it's called conversation, discussion, a forum, not an encylopedia. I don't need harsh examples, that was the understanding of how SLi works, or used to, until the drivers were updated, that's why I said "my understanding" not "it works this way". Get over yourself. You also fail to mention how I was right directly after that fact.

The bottom line is you want to personally attack me because I don't agree with your statistics or theories because the facts and numbers point elsewhere on more than just your rig. Hate to tell ya brother but your rig isn't the be all of facts about Seti.
Traveling through space at ~67,000mph!
ID: 1067489 · Report as offensive
j tramer

Send message
Joined: 6 Oct 03
Posts: 242
Credit: 5,412,368
RAC: 0
Canada
Message 1067492 - Posted: 17 Jan 2011, 0:32:38 UTC

nvidia card facts.....look at the pipe lines, and the fill rates....look at newest cards at the bottom of the list....no wonder they cost so much, and out perform the older cards

http://www.hardwaresecrets.com/article/NVIDIA-Chips-Comparison-Table/132

:)
ID: 1067492 · Report as offensive
-BeNt-
Avatar

Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1067495 - Posted: 17 Jan 2011, 0:35:01 UTC

Yeah that makes me feel really bad about my GTS 250, but makes me feel better about the price I paid for the 480! Nice link indeed.
Traveling through space at ~67,000mph!
ID: 1067495 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1067503 - Posted: 17 Jan 2011, 0:48:02 UTC

Thanks to all again!


Please use kindly language here in my/this thread.
I wouldn't like to see a forum mod close this thread, because he need to act because of the forum rules.
Thanks!

ID: 1067503 · Report as offensive
-BeNt-
Avatar

Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1067504 - Posted: 17 Jan 2011, 0:54:02 UTC

Indeed, sorry for the bit of offroading that has taken place in your thread Sutaru, I'm done with that segment of the conversation as it has deteriorated into interpersonal attacks, and won't be posting in response unless it feeds the topic of PCIe and CUDA performance. Thanks.
Traveling through space at ~67,000mph!
ID: 1067504 · Report as offensive
j tramer

Send message
Joined: 6 Oct 03
Posts: 242
Credit: 5,412,368
RAC: 0
Canada
Message 1067515 - Posted: 17 Jan 2011, 1:33:22 UTC

A buddy of my sent me to this site.....lots of info about everything....but really interesting info about video cards, speeds, bits, fill rates, direct x ....nice compare rates....good way to value price, bang for your buck !!!

:)
ID: 1067515 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 1067529 - Posted: 17 Jan 2011, 2:42:20 UTC - in response to Message 1067111.  

The earlier discussion of speed and bandwidth never went on to discuss latency and other effects which are also important to actual throughput. In the thread http://www.xtremesystems.org/forums/showthread.php?t=225823 there are quite a few screenshots of PCIe peak throughput vs. size of transfer for ATI cards, I'd presume nVidia cards would show similar effects.

For MB CUDA work the largest transfer from CPU to GPU is 8 MiB (the baseline smoothed data) followed by a threshold array for pulse finding at nearly 2 MiB IIRC. Both of those are done just at initialization, later there's mainly only parameters passed with calls to kernels involving maybe 16 or 32 bytes. Transfers from GPU to CPU are more numerous. The largest is probably a Power array for spike finding at 128K FFT length, the array is 512 KiB and there might be a few hundred transferred. Basically, mostly what comes back from the GPU is data which may be a candidate for best_spike, best_pulse, or best_gaussian.

The PCIe Speed Test v0.1 discussed in that thread was replaced by v0.2, available from http://developer.amd.com/GPU/ATISTREAMPOWERTOY/Pages/default.aspx. Perhaps that cures the tendency to crash at the largest transfer sizes. The test is of course only for ATI GPUs, but obviously something similar could be done in OpenCL or CUDA and may already be available someplace which my brief search didn't find.
                                                               Joe


There is quite big transfer for Gaussian search (GPU->CPU) in case gaussian found on GPU. 1M dots each of float4 type.
ID: 1067529 · Report as offensive
Previous · 1 · 2 · 3 · 4

Message boards : Number crunching : PCIe speed and CUDA performance


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.