PCIe speed and CUDA performance

Message boards : Number crunching : PCIe speed and CUDA performance
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1064418 - Posted: 7 Jan 2011, 21:25:35 UTC
Last modified: 7 Jan 2011, 21:28:01 UTC

Hello community!


For ~ 2 years, as SETI@home published the nVIDIA CUDA app, one member made a test with his PC.
He had let run his GPU (IIRC 9xxx series) @ PCIe 1.0 x16, x8 and x4 speed (which would mean PCIe 2.0 x8, x4 and x2).
He saw @ PCIe 1.0 x8 3 % performance loss and @ PCIe 1.0 x4 10 % loss.

I can't find the thread, because the forum search have an one year limit. And I can't remember the nick of this member.


The WU isn't only send to the GPU for calculation over the PCIe slot, during the whole calculation time the WU/app get support from the whole system - over the PCIe slot.

In past I changed the system RAM of my 940 BE from DDR2 800/5-5-5-18 to DDR2 1066/5-5-5-18 and saw ~ 2 % performance gain. With an OCed GTX260-216 which have ~ 15,000 S@h-RAC, this means + ~ 300 S@h-RAC/GPU. My whole system which have 4 GPUs -> + ~ 1,200 S@h-RAC.


What would be, if I would take out two GPUs and the two remaining GPUs would run @ PCIe 2.0 x16? (Currently they all run @ PCIe 2.0 x8)
They would have maybe + ~ 3 % more performance?


What's with members which have GTX4xx/5xx cards and let run 2+ WUs/GPU simultaneously (PCIe slot overloaded/bottleneck?)?
If the GPU is @ PCIe 2.0 x8 speed one WU/GPU - fine..
If 2 WUs/GPU then it would be like the GPU is @ PCIe 2.0 x4? - ~ 3 % performance?
If 3 WUs/GPU then @ PCIe 2.0 x2? - ~ 10 % performance?
So maybe it would be the best to let run GTX4xx/5xx cards at full speed PCIe 2.0 x16 if 2+ WUs/GPU?


BTW. Are there mobos out there which have 4x PCIe 2.0 x16 slots which run also at full speed if 4 GPUs inserted?

BTW. Are there grafic cards/mobos out there with have a PCIe 3.0 port/slots?


Thanks!
ID: 1064418 · Report as offensive
_heinz
Volunteer tester

Send message
Joined: 25 Feb 05
Posts: 744
Credit: 5,539,270
RAC: 0
France
Message 1064424 - Posted: 7 Jan 2011, 21:32:36 UTC

Hi Sutaru,
BTW. Are there mobos out there which have 4x PCIe 2.0 x16 slots which run also at full speed if 4 GPUs inserted?

Intel D5400XS, 4x PCIe 2.0 x16, all four slots are X16

but there are others too.

heinz
ID: 1064424 · Report as offensive
-BeNt-
Avatar

Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1064426 - Posted: 7 Jan 2011, 21:35:23 UTC - in response to Message 1064424.  
Last modified: 7 Jan 2011, 21:43:50 UTC

Hi Sutaru,
BTW. Are there mobos out there which have 4x PCIe 2.0 x16 slots which run also at full speed if 4 GPUs inserted?

Intel D5400XS, 4x PCIe 2.0 x16, all four slots are X16

but there are others too.

heinz


Yes there are a number of motherboards that support 4x PCIe x16 now, Asus Rampage Forumla, eVGA Classified, and the Asus P6T6 Supercomputer boards to name a few. As far as crunching differences between 8x and 16x I don't see where it would make a difference isolated. The bandwidth of a 1x PCIe lane is much larger than the size of a WU. However when you speed up the bus of your system the entire system is faster so it will, in the end, make a difference.

As far as PCIe 3.0 last I heard the standard had been delayed and specs wouldn't be available until 2011, and I have yet to see any motherboards with that specification on board.

PCIe 1.0 - 250MB/s per lane (16 lanes) ~4GB/s
PCIe 2.0 - 500MB/s per lane (32 lanes) ~16GB/s
PCIe 3.0 - 1GB/s per lane (32 lanes) ~32GB/s
Traveling through space at ~67,000mph!
ID: 1064426 · Report as offensive
Profile Will Malven
Avatar

Send message
Joined: 2 Jun 99
Posts: 52
Credit: 4,441,977
RAC: 0
United States
Message 1064427 - Posted: 7 Jan 2011, 21:38:29 UTC

Yes, there are 4 way sli boards out there, but if you're looking for an AMD compatible, I think you are out of luck.

There are a number of them for Intel processors. EVGA has the EVGA x58 Classified (model no. 170-BL-E762-A1 )


Man's future lies in the stars, not on Earth. It is each successive generation's responsibility to humanity to expand the knowledge and understanding of our Universe so that we may one day venture forth to meet our neighbors.

Houston, Texas
ID: 1064427 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1064428 - Posted: 7 Jan 2011, 21:38:38 UTC - in response to Message 1064424.  
Last modified: 7 Jan 2011, 21:39:32 UTC

Hi Sutaru,
BTW. Are there mobos out there which have 4x PCIe 2.0 x16 slots which run also at full speed if 4 GPUs inserted?

Intel D5400XS, 4x PCIe 2.0 x16, all four slots are X16

but there are others too.

heinz


Hi Heinz,

but if 4 GPUs inserted they all run at full speed x16?


E.g. I have the MSI K9A2 Platinum (940 BE machine) and if all slots used, all 4 PCIe 2.0 x16 slots run only @ x8 (x8/x8/x8/x8).
ID: 1064428 · Report as offensive
-BeNt-
Avatar

Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1064431 - Posted: 7 Jan 2011, 21:45:26 UTC - in response to Message 1064428.  
Last modified: 7 Jan 2011, 21:48:03 UTC


Hi Heinz,

but if 4 GPUs inserted they all run at full speed x16?


E.g. I have the MSI K9A2 Platinum (940 BE machine) and if all slots used, all 4 PCIe 2.0 x16 slots run only @ x8 (x8/x8/x8/x8).


Yes they will hence:

Yes there are a number of motherboards that support 4x PCIe x16 now, Asus Rampage Forumla, eVGA Classified, and the Asus P6T6 Supercomputer boards to name a few.


But not any x16 by 4 for AMD. Fastest AMD board is the Asus Crosshair and when placed into quad SLi makes the ports 16x, 16x, 8x, 8x.
Traveling through space at ~67,000mph!
ID: 1064431 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1064433 - Posted: 7 Jan 2011, 21:45:39 UTC - in response to Message 1064426.  
Last modified: 7 Jan 2011, 21:46:41 UTC

Hi Sutaru,
BTW. Are there mobos out there which have 4x PCIe 2.0 x16 slots which run also at full speed if 4 GPUs inserted?

Intel D5400XS, 4x PCIe 2.0 x16, all four slots are X16

but there are others too.

heinz


Yes there are a number of motherboards that support 4x PCIe x16 now, Asus Rampage Forumla, eVGA Classified, and the Asus P6T6 Supercomputer boards to name a few. As far as crunching differences between 8x and 16x I don't see where it would make a difference isolated. The bandwidth of a 1x PCIe lane is much larger than the size of a WU. However when you speed up the bus of your system the entire system is faster so it will, in the end, make a difference.

As far as PCIe 3.0 last I heard the standard had been delayed and specs wouldn't be available until 2011, and I have yet to see any motherboards with that specification on board.


But, like I said, it's not only to send the WU to the GPU. During the whole calculation time of the WU, the CUDA application get support from the whole PC system - over the PCIe slot.

Like you see - the result after changing the system RAM in my system.
ID: 1064433 · Report as offensive
-BeNt-
Avatar

Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1064435 - Posted: 7 Jan 2011, 21:50:04 UTC - in response to Message 1064433.  
Last modified: 7 Jan 2011, 21:54:46 UTC


But, like I said, it's not only to send the WU to the GPU. During the whole calculation time of the WU, the CUDA application get support from the whole PC system - over the PCIe slot.

Like you see - the result after changing the system RAM in my system.


You're right, however the data sent across the PCIe lane will not saturate the lane. It would have to send more than 150MB/s at the smallest PCIe specification to saturate the lane and slow it down any. However when you perform at a higher speed it makes the flow faster speeding up the entire system as I stated.

The same reason your system is faster with 1066 ram versus 800. I guess the best way to explain it is if the PCIe bus is a highway with lanes. The speed limit on that highway is the speed of the bus, 800 or 1066, the bandwidth is the lanes. You will get more information through the highway with more lanes, but at a certain point you can match it or beat it with less lanes and more speed. When you have an equal amount of lanes between the two roads, x8 vs x16, you will speed up the traffic by speeding up the bus. This same argument can be applied if you overclock your PCIe bus.

(I think we are saying the same thing in two different ways.)
Traveling through space at ~67,000mph!
ID: 1064435 · Report as offensive
Profile Sutaru Tsureku
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 1064739 - Posted: 8 Jan 2011, 20:39:44 UTC - in response to Message 1064435.  
Last modified: 8 Jan 2011, 20:43:47 UTC

Hmm.. O.K., but..

One member made a test with PCIe 1.0 x16, x8 and x4 speeds and saw this performance loss.

During the whole calculation of the (CUDA) WU the whole system (CPU/system RAM/PCIe slot and so on..) is in useage.

Sure, the CUDA application don't send large data over the PCIe, but maybe higher PCIe speed reduce the delay.. or what ever..

It's a pity, that I can't find the old thread.
I don't know the way to find a ~ two years old thread.


Maybe soneone have same GPU cards in a machine, which run at different PCIe speeds?
Then we could compare the results.
Or if the owner have experiences with bench-tests, he could make a small test.
Tools are available on the Lunatics site.


I thought about to reawaken my old Intel Core2 Extreme QX6700 with Intel D975XBX2 mobo.
This mobo have 3x PCIe 1.0 x16 slots.
But PCIe slot #1 run @ x16 only with one GPU card inserted, if two inserted PCIe slot #1 and #2 run only @ x8. The 3rd slot run always @ x4.

I thought about to insert 2x GTX470 or 570 cards.
So they run only @ PCIe 1.0 x8.
And then 3 WUs/GPU?
I don't know if this would work..
And how the performance loss would be.
And if this would be only waste of electricity.. (RAC/W ratio).
ID: 1064739 · Report as offensive
hbomber
Volunteer tester

Send message
Joined: 2 May 01
Posts: 437
Credit: 50,852,854
RAC: 0
Bulgaria
Message 1064769 - Posted: 8 Jan 2011, 21:46:22 UTC
Last modified: 8 Jan 2011, 21:49:48 UTC

The problem here is, u need to run tests on motherboard, which can control speed of only one slot, without need of another card inserted to make the board to switch PCIe speed. If any other device is present, it may affect the real results when comparing speeds.
One approach to achieve this is to cover some parts of PCIe connector of the card with tape, I think.
Or another approach is to measure two cards on board/boards in both cases - running 2 x16 or 2 x8 simultaneously - the same way as they would run in their final setup. Then u get real numbers.
ID: 1064769 · Report as offensive
Profile Wiggo
Avatar

Send message
Joined: 24 Jan 00
Posts: 34744
Credit: 261,360,520
RAC: 489
Australia
Message 1064778 - Posted: 8 Jan 2011, 22:04:31 UTC - in response to Message 1064739.  

Well I have 2 systems here that have 2x 9800GT's in them (1x Q6600 and 1x Athlon II X4 630), both have 1 card in a 16x slot and the other in a 4x slot, but the cards in both systems all complete tasks in the same amount of time no matter which slot or machine they're in (25min average).

Cheers.
ID: 1064778 · Report as offensive
Profile zoom3+1=4
Volunteer tester
Avatar

Send message
Joined: 30 Nov 03
Posts: 65745
Credit: 55,293,173
RAC: 49
United States
Message 1064802 - Posted: 8 Jan 2011, 23:37:22 UTC
Last modified: 8 Jan 2011, 23:48:08 UTC

Then there's the eVGA P55/P67 Classified 200 motherboard, Which has all PCI-E slots: 1-1x, 1-4x, 2-8x and 3-16/8x slots

Of course the 1x slot is not open on the end so It's only useful for small cards.


eVGA P67 Classified 200

eVGA P55 Classified 200
The T1 Trust, PRR T1 Class 4-4-4-4 #5550, 1 of America's First HST's
ID: 1064802 · Report as offensive
-BeNt-
Avatar

Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1064828 - Posted: 9 Jan 2011, 2:07:04 UTC

People always confuse this, bandwidth is not speed. The PCIe bus is running at specified Mhz regardless that's what determines the speed of the link, the only difference between 1x, 4x, 8x, and 16x is the bandwidth of the link. Bandwidth is a measurement of how much data you can flow during a certain period of time. If you take 25MB and transfer it over a 4x PCIe link it will take the same amount of time as over the 16x link. The only time you will see a true speed difference is when transferring files large enough to saturate the link, or if you have overclocked your PCIe bus above default values. Or in a lot of instances when upgrading your ram to improve latency or bandwidth through your front side bus will allow everything to talk to each other faster and or increase the bandwidth between those parts causing a speed up because your machine had a bottleneck to begin with.

So until Seti@Home starts communicating at 250MB/s+ through the PCIe lanes I don't really see where you would see a significant difference. Hope this makes sense.
Traveling through space at ~67,000mph!
ID: 1064828 · Report as offensive
Profile Mike Special Project $75 donor
Volunteer tester
Avatar

Send message
Joined: 17 Feb 01
Posts: 34258
Credit: 79,922,639
RAC: 80
Germany
Message 1064877 - Posted: 9 Jan 2011, 7:42:09 UTC

Exactly thats it.

You only need the higher lanes speed for high end gaming not for crunching.



With each crime and every kindness we birth our future.
ID: 1064877 · Report as offensive
hbomber
Volunteer tester

Send message
Joined: 2 May 01
Posts: 437
Credit: 50,852,854
RAC: 0
Bulgaria
Message 1064891 - Posted: 9 Jan 2011, 8:31:02 UTC - in response to Message 1064828.  
Last modified: 9 Jan 2011, 8:36:07 UTC

People always confuse this, bandwidth is not speed. The PCIe bus is running at specified Mhz regardless that's what determines the speed of the link, the only difference between 1x, 4x, 8x, and 16x is the bandwidth of the link. Bandwidth is a measurement of how much data you can flow during a certain period of time. If you take 25MB and transfer it over a 4x PCIe link it will take the same amount of time as over the 16x link. The only time you will see a true speed difference is when transferring files large enough to saturate the link, or if you have overclocked your PCIe bus above default values. Or in a lot of instances when upgrading your ram to improve latency or bandwidth through your front side bus will allow everything to talk to each other faster and or increase the bandwidth between those parts causing a speed up because your machine had a bottleneck to begin with.

So until Seti@Home starts communicating at 250MB/s+ through the PCIe lanes I don't really see where you would see a significant difference. Hope this makes sense.

Yet another misleading post.
If you have to transfer certain amount of data, which exceeds single clock cycle of the bus(and those 25MB obviously do, tho), transferring it in less cycles(x16, the more data is transferred trhu twice more lanes) will be faster than if its transferred in more cycles(x8), having in mind the cycle is same long, no matter the x16 or x8.
This is very true for CPU->PCIe bus, where PCIe lanes are limited and fixed amount - 40 lanes for X48/X58 and marely 24 for P55/p67 chipsets(8 of them are for communicating with ICH/PCH).
That is why are used multiplexor chips PLX/Lucid/NF200. They do widen the bandwidth between them and graphic cards(but not to CPU/northbridge), thus using free cycles, left unused after bandwidth is twice widened. In this way the CPU/chipset PCIe(those which are limited and fixed) lanes are kept more full with data flowing.
Especially with many cards, which do not saturate the bus themself, and transferring small amount of data(it is SETI case), but frequently, multiplexor does add performance. When playing games, where data flow is far more, multiplexors may add extra latency and in fact kill performance.
Your case is true(but not exactly true) if bus is not flooded with data, e.g single GPU is used. And even then h16 gives better performance, yet again bcs less cycles are used, bcs each cycle takes certain amount of time for synchronization and treatment. Percent, two, three - it does make difference.
ID: 1064891 · Report as offensive
-BeNt-
Avatar

Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1064914 - Posted: 9 Jan 2011, 10:06:26 UTC - in response to Message 1064891.  
Last modified: 9 Jan 2011, 10:08:40 UTC


Yet another misleading post.
If you have to transfer certain amount of data, which exceeds single clock cycle of the bus(and those 25MB obviously do, tho), transferring it in less cycles(x16, the more data is transferred trhu twice more lanes) will be faster than if its transferred in more cycles(x8), having in mind the cycle is same long, no matter the x16 or x8.
This is very true for CPU->PCIe bus, where PCIe lanes are limited and fixed amount - 40 lanes for X48/X58 and marely 24 for P55/p67 chipsets(8 of them are for communicating with ICH/PCH).
That is why are used multiplexor chips PLX/Lucid/NF200. They do widen the bandwidth between them and graphic cards(but not to CPU/northbridge), thus using free cycles, left unused after bandwidth is twice widened. In this way the CPU/chipset PCIe(those which are limited and fixed) lanes are kept more full with data flowing.
Especially with many cards, which do not saturate the bus themself, and transferring small amount of data(it is SETI case), but frequently, multiplexor does add performance. When playing games, where data flow is far more, multiplexors may add extra latency and in fact kill performance.
Your case is true(but not exactly true) if bus is not flooded with data, e.g single GPU is used. And even then h16 gives better performance, yet again bcs less cycles are used, bcs each cycle takes certain amount of time for synchronization and treatment. Percent, two, three - it does make difference.


You really like to make things personal huh?

I think you should read up on how the PCIe bus works. The lanes convey a certain amount of data per second. The clock cycle of the bus you are referring to is the MHz of the PCIe bus, or bus speed. For instance on my motherboard the PCIe bus speed is 100MHhz by default. This is the speed of the lanes.

A 1x PCIe lane consists of 2 pairs of wires, one to send and one to receive. During that clock cycle you can transmit 1 bit of data per clock cycle. The speed those bits get there are determined by the bus speed of the PCIe interface. In this example 100MHz. MHz is the measurement in which you explain and measure the speed of a microprocessor and on my board at 100MHz means it has 100 million clock cycles per second.

Moving to a 2x bus you have two sets of pairs that send and receive. This bus is still operating at 100MHz, however having two pairs of wire sets allows you to channel 2 bits of data during one clock cycle.

To break it into simple terms, the MHz of the PCIe bus, the bus speed, is the speed limit on a road. The size of the lanes is how many lanes that road is, and the data going through those lanes are the cars. While all your cars get there at the same time you will get more cars through a two lane road versus a one lane road.

Essentially all the data is moving at the same speed you are just getting more bulk through the x8 and x16 lanes respectfully. It isn't getting there any faster as the speed is still at 100MHz you merely have more bandwidth. Water flowing through a hose at 4mph rather it's coming from a fire hose or a garden hose is still 4mph, but you get a lot more water from a fire hose.

Bandwidth does not equal speed it equals throughput. Higher throughput means you will in the end get things done faster, because you will fill a pool faster with a fire hose than you will a water hose, but the water is still moving at the same speed. And as long as you aren't saturating the lane, garden hose amount going through the fire hose, you will see no difference in the speed of your endeavor. So in other words if Seti was pushing more than what a PCIe 1x lane could handle then yes you would see a speed increase moving to a different specification. But it's not, it can't be, the work units are only ~367KB in size!

And the whole 25MB does exceed the limitations of the PCIe bus? You are dead wrong PCIe 1.0a transfers at 250MB/s, increasing to 500MB/s in 2.0, thats PER LANE. A x16 lane can transfer up to 16GB/s in data, but it's still operating at 100MHz on my board. 25MB wouldn't even bother anything. A 3GHz processor with a 1066 FSB will be 'slower' than one with a 1333 FSB, but they are both still 3GHz in speed. Bandwidth does not equal speed, it is throughput.

Now as far as your debate about cpu to PCIe bus and the lane amounts. A x16 port only needs 32 lanes, x16 is only supported on x38/x48/x58 and boards that run the same chips, your 40 lanes. Anything older is running 24 lanes but only supports x8 and lower which only requires 16 lanes and lower so your point is moot?

So if you don't mind quit following me around insulting me, if you feel I've missed something or have details that need to be added please help and reply. I replied with a simple answer before without going into detail because I figured everyone here that knew the finer details didn't need to have them, and no need confusing the people who simple don't care. I'm not here trying to spew false information or mislead anyone(like you are implying I'm doing with the whole 'Yet another misleading post', but the personal attacks are not needed, wanted, or appreciated. Thanks.[/b]
Traveling through space at ~67,000mph!
ID: 1064914 · Report as offensive
hbomber
Volunteer tester

Send message
Joined: 2 May 01
Posts: 437
Credit: 50,852,854
RAC: 0
Bulgaria
Message 1064917 - Posted: 9 Jan 2011, 10:11:53 UTC
Last modified: 9 Jan 2011, 10:43:42 UTC

If you fail to understand how more lanes help moving certain amount of data faster, using less cycles, but more data per cycle, bcs more lanes, I'm not going to discuss this anymore, sorry. And its true, bcs one cycle cannot tranfer 25 MBs, which is considered small transfer(of course, 25 MB is not exceeding PCIe transfer abilities per second, but where I did said the opposite!?). Fixed part here is not time, but transferred data itself. Time to transfer a fixed chunk of data, less smaller than PCIe abilities, is what is variable and what concerns us. Thats the SETI case.
Speaking in road terms, if u need to move 4(fixed number) cars further, which is faster - one per cycle in one lane, or 4 per cycle with 4 lanes, m? It doesn't matter how many cars can push thru this one lane per hour(which is PCIe maximum throughput per lane), but how long it will take for those 4 cars to move thru one, or respectively 4 lanes. Got it?
I'm not insulting you, I'm correcting you, so to say. Three times/posts in last 24 hours you wrote incorrect stuff.
Want more corrections? Why do you think that 367 KBs of data is EVERYTHING, that has to be moved thru PCIe bus? Then why, the hell, it uses 332 MBs video memory and 70+ RAM(and its far for being that simple, but lets go into details)? Hilarious.
Now as far as your debate about cpu to PCIe bus and the lane amounts. A x16 port only needs 32 lanes, x16 is only supported on x38/x48/x58 and boards that run the same chips, your 40 lanes. Anything older is running 24 lanes but only supports x8 and lower which only requires 16 lanes and lower so your point is moot?

You REALLY need to buy a P35 motherboard and see now x16(1.0) works there. Or P55 and see now single card works x16 and two work x8. And you really need to know that x16 needs 16 lanes. Or you speak about two slots?
In X48 16 lanes go for first PCIe, 16 goes to second, last eight go to ICH9/ICH9R and get used for third slot(x1 or x4, depends on particular board implementation) and various other devices. Same for X58.
ID: 1064917 · Report as offensive
-BeNt-
Avatar

Send message
Joined: 17 Oct 99
Posts: 1234
Credit: 10,116,112
RAC: 0
United States
Message 1064927 - Posted: 9 Jan 2011, 10:49:42 UTC - in response to Message 1064917.  
Last modified: 9 Jan 2011, 10:56:29 UTC

If you fail to understand how more lanes help moving certain amount of data faster, using less cycles, but more bandwidht, I'm not going to discuss this anymore, sorry.


I understand what you are saying. But, I think I may have gotten wound up on the wrong topic of bus speed versus work speed. The next bit is a rough estimate so don't hold me too closely to my math as I'm doing this quickly. This is only covering the bare minimum transfer to the gpu.

Bus Speed = 100MHz (100,000,000 clock cycles per second)
Work Unit Size = 367KB (367,000 bits)

PCIe 8x
-------
Number of Lanes = 16
Bit Per Cycle = 8
Cycles Required to transfer WU = 45,875 cycles
Time to accomplish = 0.000045875 seconds

PCIe 16x
--------
Number of Lanes = 32
Bit Per Cycle = 16
Cycles Required to transfer WU = ~23 cycles
Time to accomplish = 0.000000023 seconds

Those are really rough times it would take to transfer the work unit, meaning latency between the cpu and gpu would be lower in an ideal world. And in the end it would be faster just like the 3Ghz processor with the faster front side bus is faster. However it's still operating at 100MHz you are just freeing up clock cycles to be able to push more data down the pipe. Where I think the performance difference would come in would be in the event of a larger file size needing to be moved. Say 500MB, this would start stressing the lanes a bit.

PCIe 8x
-------
Bit Per Cycle = 8
Cycles Required to Transfer WU = 500,000,000 cycles
Time to accomplish = .5 seconds

PCIe 16x
--------
Bit Per Cycles = 16
Cycles Required to Transfer WU = 25,000,000 cycles
Time to accomplish = 0.025 seconds

Not sure at what levels these WU's are being processed at as far as bit/s, but I don't think, in my opinion, at this time we are coming even close to this being a factor in faster crunching times. Least I didn't see a difference moving the same graphics card from a 939 N2 @ 8x, to a C2D P35 board @16x, to my C2Q X48 board @16x. Talking purely gpu, saw big gains in cpu lol.

I understand what you are saying and agree that it could be faster if conditions allow, I just hope you understand or at least concede my point on the discussion.

Thanks for not attacking me this time.

*Edit*
I take that back you have now attacked me again. Thought discussions on here could be held in an adult form. Unfortunately you can not hold true to that form.

You point about how much ram is being used on the card would be due to the calculation the WU is going through on the card. Once the transfer of the WU has taken place the bus is finished with it's main job, as the video cards holds all result during calculation on the card unit it processed and transfers a finished files, WU or whatever it's called back for reporting. Same reason a video game can only be 650MB but uses 1000MB of video memory. You're getting petty in your argument, and I will not drop to that level. Take your insults and move along. Thanks for turning this into a rebuttal argument instead of a discussion, this is going nowhere I'm off to bed to find better conversation tomorrow.
Traveling through space at ~67,000mph!
ID: 1064927 · Report as offensive
hbomber
Volunteer tester

Send message
Joined: 2 May 01
Posts: 437
Credit: 50,852,854
RAC: 0
Bulgaria
Message 1064929 - Posted: 9 Jan 2011, 11:17:03 UTC
Last modified: 9 Jan 2011, 11:39:23 UTC

I see u got it right, as for PCIe lanes.
To extend it further, performance gain from x16 would be insignificant(which is considered as such, depending on personal point of view, for Sutaru few percent is a much. For me too.)only in case, when bus is not saturated. Its not Sutaru's case too. its not the case of serious crunchers. He/they are running more PCIe devices(graphic cards) and bus does gets saturated.
Also, serving logic(which may be only CPU on-die PCIe logic or chipset AND CPU, depending on...u know what) does need more time to process data, coming in more cycles than same data, coming in single cycle. Each cycle involves synchronization and waiting one of other condition to be fulfilled, thus extending time needed to be processed whole chunk of data. Also, some part of this logic, may not serve when asked to, bcs being busy at this moment, extending wait period even longer. This is simple picture of mechanism now more lanes, served in parallel and simultaneously, are better solution, faster, than less lanes, served in several cycles. Another example, similar to this is how Athlon 64 handles SSE3 instructions. Not as Intel - 128 bits per cycle, but 64 in two cycles. And Athlons suck in this.

As for transfer sizes. If you knew what kernel mode transition is, now much time takes to invoke driver and to ussue I/O(what any driver call in fact is) etc, u wouldn't be so sure to give me same "petty" examples how much data is moved and how often transfers occur. Yes, data may be in video RAM, but it takes many transfers thru the bus to command the GPU what it does need to do with it. Jason and Raimster can explain it better to you, I guess. Jason may explain also how faster system RAM(or FSB, bcs its resulting faster transfers via CPU->memory->chipset highway on certain platforms) may help(and it does) also. He did some tests, tho.

Your point about 3 GHz CPU and different bus is worng too. The faster FSB u have its may give u insignificant speedup when FSB is not bottleneck and can give u huge performance increase, when CPU get loaded fully, resulting more calculations being done(bcs one data transition to/from system RAM from CPUs point of view is whole eternity).
Its entirely same for PCIe. Its kinda "another FSB", interface between two communication points.

I wouldn't whine like this and comment your attitude towards me, when we are discussing serious stuff and I get, hm, "attacked". Nor I would comment the nature of arguments you use, outside the context(and be sure, I can comment them pretty ugly). You are not insulted directly, so cut it off and stand on it.
ID: 1064929 · Report as offensive
Profile -= Vyper =-
Volunteer tester
Avatar

Send message
Joined: 5 Sep 99
Posts: 1652
Credit: 1,065,191,981
RAC: 2,537
Sweden
Message 1064931 - Posted: 9 Jan 2011, 12:06:03 UTC
Last modified: 9 Jan 2011, 12:12:58 UTC

I haven't read throught the whole entire thread but i've read the latest posts.

And for what it seems the discussion is almost always around the bandwidth issue.
The only thing where bandwidth is an issue is when you start a new seti@home WU when the cpu needs to prepare the data, don't really know what it does but it seems like it expands it (CPU RAM Usage grows steadily), when that has leveled at around the 70-100Mb it seems to move the data over to the GPU Ram , that require bandwidth because it's a lot of data.
When the data is uploaded to GPU ram it seems to start to crunch numbers, and thus the s@h executable needs to feed the data with memory pointers and small "when this block is done move to the next block of data" parameters.

Those parameters needs to be communicated through the "slow" PCI-E bus, bandwidth is not an issue there but the overall speed of the highway.
--------------------
Think of a highway with almost no stops that leads from Washington D.C to Chicago.. It's 16 lanes in each direction.
When you have alot of shipments you send 10000 trucks from Washington to Chicago, more trucks can fit the highway and thus it arrives at it's destination faster and almost all at once.
But unfortunately the telephone and email has not been invented at this time so to know what to do with the shipment you send a driver back from Chicago to Washington with a letter which says "Thanks for the goods, where should we send this?".
The driver is going there on the highway , he has no friends and it's all silent and dull on the highway.
A couple of hours later the driver gets back to the office in Washington, the manager (CPU) opens the letter and reads "Thanks for the goods, where should we send this?" , Manager thinks OMG why didn't i think of that and quickly writes a letter which says "All goods from the 10000 Lorrys should be sent to Factory xxx at xxx road in Calgary".

All is good the driver heads back to Chicago, as usual he sees hardly none traffic at all and thinks why haven't anyone figured out a way to quickly send messages to eachother at different locations because he has a lot of time on the way to start to think of other things.
Starting to get annoyed about that he has to drive back and forth to deliver messages but he knows that it needs the delivery to be confirmed at it's destination (TCP) and if something happens on the way with the information they need to send another guy to do it's work again (TCP RESEND), he starts to sigh thinking if there is a good way to make use of foolproof deliverys with pidgins (UDP) but he quickly finds out, Crap! That isn't doable..
He's finally back to Chicago and heads to the office (GPU) in which the boss quickly opens the letter and smiles at start .. then.. he takes a paper and pen and starts to write .. "By which type, by truck, by rail, by flight" takes an envelope folds the paper and put it into the envelope and says.. "Sorry you need to go to main office ASAP"..

The truck driver starts to hate hes life as he starts to chugg the loooong way back again to Washington from Chicago curcing why the hell the trucks has a speedlimiter set att 100MPH, if the truck had a speed limiter set at atleast 110MPH it would save him atleast 45 minutes getting there, the traffic (bandwidth) is not an issue instead it's the speed of the highway that annoys him greatly as he curses why the hell someone hasn't invited faster trucks and those stupid guys at the American Road Agency (don't know the name) hasn't upped the speed limit at the highway's to atleast 110, it shouldn't do any harm!

Well times fly by (processing time), the load is still there at Chicago and hasn't moved anywhere due to undetailed information at start..
The driver gets back to the office in Washington (CPU) and the manager opens up and reads.
Here we go again, up with that paper he writes "By flight" folds that paper and put it into a new envelope and says "Well by this time i think you know what to do, huh?!" (cache) , then he smiles back to you and you gave out a long *sigh* and headed back to the truck and up to Chicago once again..

During that trip the driver starts to think.. Hey, have i seen this road before? . He hates that they haven't invented foolproof deliverys of pidgins once again (UDP Resend.. Hmm hey can't compute.. error) and quickly bins that idea. He utterly hates speedlimiter on trucks along the speedlimit on the roads and thinks of ways to speed it up a bit (overclocking).

Times fly by faster and he starts to think that soon this shipment could be of to it's real location.. For once he starts to grin as he arrives to Chicago.
When he arrives he rushes up to the office thinking Yes, finally we could be heading away with the goods , knocks on the door to the boss (GPU) , no answer?
... He knocks again but more firmly this time.. The boss coughes and finally says come in, apparantly he fell asleep (Gpu idle clockdown) but hey he is a human after all. Driver gives the letter to the boss with a smile and thinks Yes, Calgary.. But... WTF.... he grabs a new stupid paper and writes.. "Departure?" .... OMG.. he can't be serious shouldn't it be the nearest departure, it should be.
Buuut in this company it's all about confirmation.. So with the drivers head in a low position he says to the boss "Yea yea, i know the drill back to main office once again" (Cache).

The drivers head was blank, he now could hardly breathe, if only this loong trip could be more funny and speedier (overclocking).
Time flies by, and by the time the driver thinks of ways to improve things but with no other around he has trouble venting his ideas so he starts to close up to himself.

The truck is now back at washington, he drags himself up the stairs and thinks "this is it", he's exhausted.
Knocking on the door he almost immediately here the voice that states "come in", man those managers are fast i had barely touched the door in a split microsecond before he said come in.
Driver hands the letter once again and hopefully thinks that this is the last time he needs to go back Chicago but he's not overly convinced.
Manager writes four letters on the envelope and sighed "Why can't anyone think for themselves", he put that in a fresh new envelope and sealed it and gave to the driver..

With a already low head position the driver takes this letter and he mumbles cursing he's own miserably life, instead of a lorry driver he should've been a cook instead (software coder) or perhaps a braindead thing that makes all the food eatable like a stove or something (compiler).

Naah, of we go..

At that time when he travels that looong dreaded way back to Chicago he comes up with an idea.
Instead of having a huge lorry in which he takes a small envelope with him all the time, the highway should have two express lanes in each directions which the speed limit was 800Mph. If that were possible he could take another vehicle for those small packages which hasn't contributed to any sort of increased weight and be able to travel between Washington and Chicago at only one eights of that time required.
He shines up and thinks hey this is not a bad idea, slow crappy bulgy lorrys in this wide lane and a parallell high speed vehicle lane.
This is brilliant! But who the f*** listens to a small worker like me.
Everything is about the money and reducing costs these days and he quickly finds out that no company in the world would be willing to invest that money and time in producing it and he quickly finds himself stuck in time again!
He cries out "Why can't anyone with a high position enough come up with this, we need to think outside the box really!!" , he's frustrated.
The cargo at Chicago has been stopped for days now, and for what?
No detailed information at first so we can do it right from the beginning which in terms lead to fewer delays (software engineering and optimising)..

Time flies by , albeit slowly ofcourse.
The driver is back to Chicago and with heavy steps he walks to the bosses office and knocks. No answer! Exactly as last time he thinks. He waits, knocks on the door again.. No answer?!! "What?!" , knocks on the door and waits. Still no answer!! Bah, he puts he's hand to the knob and carefully opens the door, and that sight he wishes he would never have seen at all!
There he was the boss, hung from a beam in the roof and the driver was scared stiff (Driver hang ... Driver/Boss! :) ) .. Holy crap! How the f* did this happen? He quickly backs out and closes the door and wishes with all hes heart that this was a dream.
He starts to go from there to get anwers! A few steps away he notices that he has dropped the envelope and he turns around and walks into the door to the office and there he was again, sitting in the chair like nothing ever has happened with a cut gallow around hes neck staring at him at he enters the door shouting. "When you work at this company, you ALWAYS knock before entering a room!"
The driver thought he was dreaming (driver restart) but quickly responds "I'm sorry sir! Here are the last orders from Washington"

The boss yanks the letter from the driver and frowns, he opens up and reads.
The face slowly starts to smile saying "Well let's get to it".. Go to the other lorrys and wait for departure to the airport.

The driver shines up and starts to smile and starts to rush out from the door running towards the other trucks thinking "Atlast!!" , but he halts thinking what was that?
He heads back in to the room asking for permission to speak as he behaved so badly..
Driver: I'm just curious, what was it it said in that letter that could confirm this, i started getting used to the highway instead of doing what it sayed that i was employed for.
Boss: Well with this magic word, i could take care of the decisions much easier so for what it's worth for you that magic word was. ASAP
The driver stared at the boss and gasped "So with this magic letters or word my trips back and forth could have been saved if it had come directly from the beginning? (Software optimising, rescheduling)
Boss: Yes

The driver was so furious, he wants to kick the software developers in the a** but at the same time he was glad to came up with that idea with a high speed communincating lane parallell to the highway and hopefully someone makes that idea come to life in the future if he evers dare to speak of it..

Driver: "I'm on my way sir, thank you for not firing me for not beeing polite to knock at that door"
Boss: "Anytime, but always wait for an answer before entering my office in the future, got it?" (Interupt management)
Driver: "Most certainly Boss"

The driver smiled and turned again walking towards hes awaiting friends..

-------------------------

Well what can we make out of this then?!
Sorry for my little story but it points out that how matter how wide the PCI-E bus is if it only has a small payload that should arrive at it's destination.
If you increase the speed on the PCI-E you lower overall computing time because until the GPU knows what to do next it halts out and waits for further instructions.

And as i told in my little story until someone makes a parallell small payload high speed communication lane the impacts on PCI-E should matter somewhat to different degrees.

I presume that VHAR's benefits more of a faster PCI-E than for instance a regular MAR wu, because the GPU itself is more occupied working with it's data compared to constant "what to do next" messages on the PCI-E bus in VHAR case.


Sorry once again for making such a huge post but i wanted to make a small story out of this discussion and couldn't brace myself :)

Kind regards Vyper

_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group
ID: 1064931 · Report as offensive
1 · 2 · 3 · 4 · Next

Message boards : Number crunching : PCIe speed and CUDA performance


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.