Random Musings About the Value of CPUs vs CUDA

Message boards : Number crunching : Random Musings About the Value of CPUs vs CUDA
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next

AuthorMessage
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 848728 - Posted: 3 Jan 2009, 16:06:00 UTC - in response to Message 848650.  

About what "toys" and what achievements you talk ??

I suspect a rather oblique reference to Larrabee.


Hm, it's Intel's achievement. Is this person == Intel ?? If so, well, will look benchmarks for this new CPU :) And again, even this new CPU can benefit from co-processor ;)


Just a small note: Larrabee is rumored to be Intel's new high performance GPU, so it will compete with nVidia and ATi's higher end offerings.
ID: 848728 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 848744 - Posted: 3 Jan 2009, 16:51:31 UTC - in response to Message 848728.  

About what "toys" and what achievements you talk ??

I suspect a rather oblique reference to Larrabee.


Hm, it's Intel's achievement. Is this person == Intel ?? If so, well, will look benchmarks for this new CPU :) And again, even this new CPU can benefit from co-processor ;)


Just a small note: Larrabee is rumored to be Intel's new high performance GPU, so it will compete with nVidia and ATi's higher end offerings.

From AnandTech article:
"
Well, it is important to keep in mind that this is first and foremost NOT a GPU. It's a CPU. A many-core CPU that is optimized for data-parallel processing.
"
But it should be used as replacement to current GPUs (as far as I understand from that article). But maybe even this hybrid can co-exists with nVidia GPUs in single case? ;) If yes, all that I said about GPU as co-processor remains valid for this new chip too.
ID: 848744 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 848748 - Posted: 3 Jan 2009, 17:13:48 UTC - in response to Message 848370.  


I don't know/understand why you are so negative about CUDA?



I have promised to keep quiet about that............
It's not the reality of it...it's how it came about........

'Nough said.....'

Done discussing it.......end of line.


My post was not only for you..

..it was for all at the board which don't like CUDA.. [said in kind words..]

:-)

ID: 848748 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 848751 - Posted: 3 Jan 2009, 17:21:36 UTC
Last modified: 3 Jan 2009, 17:24:03 UTC

How would be the performance?

A PCIe 2.0 GPU in PCIe 1.0 slot.
A PCIe 2.0 GPU in PCIe 2.0 slot.

How big would be the slowdown with PCIe 1.0 slot?

---------------------------------------

In future the SETI@home-CUDA-app will always need the CPU/Core for crunching?
..and the PCIe-slot for communication/crunching?

---------------------------------------

What's with the architecture?

Maybe it would better to combine AMD-CPU with nVIDIA-GPU?
More performance?



Thanks!
ID: 848751 · Report as offensive
Profile Gecko
Volunteer tester
Avatar

Send message
Joined: 17 Nov 99
Posts: 454
Credit: 6,946,910
RAC: 47
United States
Message 848795 - Posted: 3 Jan 2009, 18:51:37 UTC - in response to Message 848634.  


look at the date dude ... At this time, SIMD was not so common.
I never heard somebody chalenging my ASM capabilities ...


Actually, there was much work w/ SIMD and ASM that occurred well before the Seti-Enhanced transition. The initial Seti-BOINC 4.x application was well optimized by the time the Enhanced transition took place in May 06'. Tetsuji Maverick Rai did significant SIMD hand coding/optimizing in 2005 as did Harold Naparst, Hans Dorn and of course, Crunch3r. On PPC front, Alex Kan basically hand-wrote vectorized code for almost the entire PPC application in 2005 to maximize VMX (Altivec) instruction usage, minimize L1 & L2 thrashing and make most efficient use of larger L2 caches in G4 & G5 PPCs. This is why a 1.25 Ghz PPC7455 G4 would produce similar RAC to a P4 running 2x the clock w/ the then-available optimized -doze aps. The Top Comp list in late 05' & early 06' was mostly PPC Macs until Core2 arrived.

ID: 848795 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 51477
Credit: 1,018,363,574
RAC: 1,004
United States
Message 848798 - Posted: 3 Jan 2009, 18:53:46 UTC - in response to Message 848795.  


look at the date dude ... At this time, SIMD was not so common.
I never heard somebody chalenging my ASM capabilities ...


Actually, there was much work w/ SIMD and ASM that occurred well before the Seti-Enhanced transition. The initial Seti-BOINC 4.x application was well optimized by the time the Enhanced transition took place in May 06'. Tetsuji Maverick Rai did significant SIMD hand coding/optimizing in 2005 as did Harold Naparst, Hans Dorn and of course, Crunch3r. On PPC front, Alex Kan basically hand-wrote vectorized code for almost the entire PPC application in 2005 to maximize VMX (Altivec) instruction usage, minimize L1 & L2 thrashing and make most efficient use of larger L2 caches in G4 & G5 PPCs. This is why a 1.25 Ghz PPC7455 G4 would produce similar RAC to a P4 running 2x the clock w/ the then-available optimized -doze aps. The Top Comp list in late 05' & early 06' was mostly PPC Macs until Core2 arrived.

(Gives a big kitty grin).....The Core 2's are what weaned me away from my OCd Semprons.....what a relief.
"Time is simply the mechanism that keeps everything from happening all at once."

ID: 848798 · Report as offensive
Profile Voyager
Volunteer tester
Avatar

Send message
Joined: 2 Nov 99
Posts: 602
Credit: 3,264,813
RAC: 0
United States
Message 848800 - Posted: 3 Jan 2009, 19:01:01 UTC - in response to Message 848599.  

Is it dumb not to use cuda? I see gpus for less money than a memory upgrade ,$59 with what sounds like more improvment.

Don't know where you are, but here in Australia it's not possible to get a reasonably fast CUDA capable card for any less than about $200.

I'm in Puget Sound Wa.
What I found was on Newegg, low-enders, 9400 $59 , 9500 69$ , 9600 $100 .
Just thinking maybe more bang for the buck than a memory upgrade, which may add ~5% to my crunching.
ID: 848800 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 848818 - Posted: 3 Jan 2009, 19:36:36 UTC - in response to Message 848751.  

In future the SETI@home-CUDA-app will always need the CPU/Core for crunching?

It already doesn't need full core for crunching. Look other threads.
CPU load ~3-5%
ID: 848818 · Report as offensive
Profile Dirk Sadowski
Volunteer tester

Send message
Joined: 6 Apr 07
Posts: 7105
Credit: 147,663,825
RAC: 5
Germany
Message 848841 - Posted: 3 Jan 2009, 20:05:12 UTC - in response to Message 848818.  
Last modified: 3 Jan 2009, 20:07:26 UTC

In future the SETI@home-CUDA-app will always need the CPU/Core for crunching?

It already doesn't need full core for crunching. Look other threads.
CPU load ~3-5%


Yes.. so the GPU communicate with the CPU.. over the PCIe-slot.. if crunching..

In the future it will be maybe that the S@H-GPU-CUDA-app don't need to communicate with the CPU if crunch WUs?
So a 'high' slot (PCIe 2.0) is no longer needed.
Only GPU-crunching without support from the CPU.
Only 'upload' and 'download' from the WU over the PCIe-slot to the GPU.
ID: 848841 · Report as offensive
Iona
Avatar

Send message
Joined: 12 Jul 07
Posts: 790
Credit: 22,438,118
RAC: 0
United Kingdom
Message 848865 - Posted: 3 Jan 2009, 20:28:00 UTC - in response to Message 848841.  

Maybe, in the future, but you'd still need some sort of external controller to do all the 'house-work' - never mind the PSU, case, OS, LAN/net connection, data storage etc...... Better to get rid of the bugs, first. Probably a mistake, for me to make comments on this subject, surrounded by lots of 'big hitters'! lol



Don't take life too seriously, as you'll never come out of it alive!
ID: 848865 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 848871 - Posted: 3 Jan 2009, 20:31:49 UTC - in response to Message 848841.  

Communications very limited. Look prev posts in this thread. With heavy communication over PCI-E computing speed will be too low. Data loaded in big batches on GPU then results are retrieved.
Full GPU processing could be even less effective - GPU very poor in places where many branching take place (GPU need to go both branch directions it cant take branch as CPU does). So GPU is the best for stream computations while CPU best in handling program logic.
ID: 848871 · Report as offensive
Profile Francois Piednoel
Avatar

Send message
Joined: 14 Jun 00
Posts: 898
Credit: 5,969,361
RAC: 0
United States
Message 848875 - Posted: 3 Jan 2009, 20:39:54 UTC - in response to Message 848644.  

On 2 ... do you think your GPU will do any better ...

I think it will do more processing for time spent by CPU to send data to GPU. And great data access locality just helps GPU too - it can keep needed data in GPU memory and not use PCI-E heavely.
Why you refuse to look at GPU not as another CPU better or worse then your but as co-processor? IT's possible to do almost whole WU inside GPU. YOu need only pass inital data array there and retrieve results from it. Look at task size - not SO big data array need to be feeded in ideal case. How many data transfers in current CUDA MB - it's question of optimisation of this app, not CUDA technology itself.
And it seems there is no many PCI-E transfers in current CUDA MB too - CPU load is really low.


If you look at findpulse, or AlexK version of it, it is still using data locality a lot, more than you think for sure. Most of it is cached. Same for the FFT. You need an increase in mem traffic, due to the increase of compute power. Nehalem does very well at FFT, it does them faster than the GPU, 8 by 8, your GPU without cache will struggle in findpulse, and the FFTs in parallel will not use the max bandwidth of the load ports.

Again... the point is GPU can do FFT (for example) in the same time while CPU doing ANOTHER FFT. If CPU does 10 FFTs while GPU finished one FFT (it's not the case, it's just example) - well, FINE, you will do almost 11 FFT instead of just 10 FFTs. Almost - because of some CPU share neded to feed GPU. Are you claim this share so big that CPU could make 11 FFTs per same time period if it would not feed GPU ??

Addon: If it's so for high end CPUs - please, provide benchmark data.
For not high end CPUs I know it's not true. CPU can't do 11 FFT per time period w/o GPU (in my example).



today, (time taken to send through PCI express + Time doing the FFT + Time sending it back ) > (doing FFT on Core i7 )

End of story.
ID: 848875 · Report as offensive
Profile Francois Piednoel
Avatar

Send message
Joined: 14 Jun 00
Posts: 898
Credit: 5,969,361
RAC: 0
United States
Message 848878 - Posted: 3 Jan 2009, 20:44:37 UTC - in response to Message 848818.  

In future the SETI@home-CUDA-app will always need the CPU/Core for crunching?

It already doesn't need full core for crunching. Look other threads.
CPU load ~3-5%



Hahaha ... ok, so let s see what a G92 does on a Celeron ... lol try to get this to Top 1.

NV claims that the GPU is the center of all, it is totally innacurate and wrong, on the top of this, they even already claim victory on their website ... what a joke. It is the same for most of the CUDA claims.

In French, we call this , farting in the wind, and say that you caused the Storm ... lol

ID: 848878 · Report as offensive
Profile Francois Piednoel
Avatar

Send message
Joined: 14 Jun 00
Posts: 898
Credit: 5,969,361
RAC: 0
United States
Message 848899 - Posted: 3 Jan 2009, 21:17:42 UTC - in response to Message 848644.  
Last modified: 3 Jan 2009, 21:19:11 UTC

[quote]On 2 ... do you think your GPU will do any better ...

I think it will do more processing for time spent by CPU to send data to GPU. And great data access locality just helps GPU too - it can keep needed data in GPU memory and not use PCI-E heavely.
Why you refuse to look at GPU not as another CPU better or worse then your but as co-processor? IT's possible to do almost whole WU inside GPU. YOu need only pass inital data array there and retrieve results from it. Look at task size - not SO big data array need to be feeded in ideal case. How many data transfers in current CUDA MB - it's question of optimisation of this app, not CUDA technology itself.
And it seems there is no many PCI-E transfers in current CUDA MB too - CPU load is really low.


If you look at findpulse, or AlexK version of it, it is still using data locality a lot, more than you think for sure. Most of it is cached. Same for the FFT. You need an increase in mem traffic, due to the increase of compute power. Nehalem does very well at FFT, it does them faster than the GPU, 8 by 8, your GPU without cache will struggle in findpulse, and the FFTs in parallel will not use the max bandwidth of the load ports.

Again... the point is GPU can do FFT (for example) in the same time while CPU doing ANOTHER FFT. If CPU does 10 FFTs while GPU finished one FFT (it's not the case, it's just example) - well, FINE, you will do almost 11 FFT instead of just 10 FFTs. Almost - because of some CPU share neded to feed GPU. Are you claim this share so big that CPU could make 11 FFTs per same time period if it would not feed GPU ??

The Story changed dramatically isn't it, from GPU is going to be 4x faster than Phenom, to "the GPU could do some FFTs" ...

If your GPU does some FFT, you are farting in the wind, with a Jeans (the PCI express bus)

And for the moment, you are the one who needs to provide data that a GPU can accelerate a Core i7 more than 1% ... because that would be a very expensive 1%

I am not even talking about the Dual Nehalem coming ... your GPU will be a drop of water in the sea. lol
ID: 848899 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 848903 - Posted: 3 Jan 2009, 21:20:43 UTC - in response to Message 848875.  
Last modified: 3 Jan 2009, 21:31:19 UTC


today, (time taken to send through PCI express + Time doing the FFT + Time sending it back ) > (doing FFT on Core i7 )

End of story.


No, only beginning of story.
You just can't realize (are you sure you know about parrallel computations? ;) ) that CPU is NOT SITTING IDLE when GPU does FFT.
So you posted WRONG expression. You need to compare
time taken to send through PCI express + Time sending it back ) ? (doing FFT on Core i7 )
You should not include time for GPU FFT here.
Moreover, PCI-E transfers can be asynchronous regarding CPU, so - CPU should not wait PCI-E in this case too. So what ?
ID: 848903 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 848905 - Posted: 3 Jan 2009, 21:25:54 UTC - in response to Message 848899.  


If your GPU does some FFT, you are farting in the wind, with a Jeans (the PCI express bus)

And for the moment, you are the one who needs to provide data that a GPU can accelerate a Core i7 more than 1% ... because that would be a very expensive 1%

I am not even talking about the Dual Nehalem coming ... your GPU will be a drop of water in the sea. lol

OMG.... Any NUMBER please ?
It's only your claims, no more. I see that my 9600GSO do short task for <7 mins while Q9450 2,66 GHz takes ~11-12 min for the same task. So, more than 1 additional core!
If you wanna more precise numbers I will post all benchmarkings (already posted) in this thread too. I'm really tired from your unproven claims. NUMBERS ????
ID: 848905 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 848907 - Posted: 3 Jan 2009, 21:32:52 UTC - in response to Message 848875.  

today, (time taken to send through PCI express + Time doing the FFT + Time sending it back ) > (doing FFT on Core i7 )

True. (at least for FFTs shorter than about 64K)

End of story.

False. The task is not to return FFT data, it is to FFT and process the data, then return extracted meta information. 6.06 may not yet do as much as it should on the GPU, but the small amount of CPU time needed indicates it's doing fairly well in that regard. If it were able to do all setiathome_enhanced WUs without crashing or finding false signals it would be a worthy addition to our crunching capabilities.
                                                             Joe
ID: 848907 · Report as offensive
Profile Raistmer
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 16 Jun 01
Posts: 6325
Credit: 106,370,077
RAC: 121
Russia
Message 848910 - Posted: 3 Jan 2009, 21:41:09 UTC - in response to Message 848907.  

today, (time taken to send through PCI express + Time doing the FFT + Time sending it back ) > (doing FFT on Core i7 )

True. (at least for FFTs shorter than about 64K)

                                                             Joe


Joe, why we should compare this numbers? Lets compare time to process full task on GPU with one FFT on CPU ... What sense in this comparison at all?
ID: 848910 · Report as offensive
Profile Francois Piednoel
Avatar

Send message
Joined: 14 Jun 00
Posts: 898
Credit: 5,969,361
RAC: 0
United States
Message 848915 - Posted: 3 Jan 2009, 21:49:30 UTC - in response to Message 848905.  


OMG.... Any NUMBER please ?


You are the one claiming performance improvement, where are your numbers?????
How can we verify them????

from now, I know you are a fan boy. I will keep adjusting your claims, right now, the public code accelerated nothing , too buggy. This is a fact, you cant change it.
ID: 848915 · Report as offensive
KWSN Sir Clark
Volunteer tester

Send message
Joined: 17 Aug 02
Posts: 139
Credit: 1,002,493
RAC: 8
United Kingdom
Message 848917 - Posted: 3 Jan 2009, 21:51:22 UTC - in response to Message 846910.  
Last modified: 3 Jan 2009, 21:55:13 UTC

Like Vyper Boinc Manager showed the % rapidly counting down in chunks but only after several hours of crunching so I don't know what's going on.

Well, giving it another go now.

I've decided to give CUDA another go this time without any other projects vying for CPU time but I've got to get through two Astropulses and several WCG


Well, I've ended up with over 20 CUDA WUs all of a sudden

So far only one failure which reset the video driver.

So far so good. Going up in chunks of between 0.04% to 0.12% a second.

It's taken 12 minutes for a 14 credit WU, longer WUs seem to be taking approx 30 minutes.

Compared to the roughly 4 hours pre-CUDA I'm seeing approx 8x speed-up so far for the long WUs.

I'm not sure what the difference is compared to my initial attempt with CUDA which had slow WUs
ID: 848917 · Report as offensive
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 11 · Next

Message boards : Number crunching : Random Musings About the Value of CPUs vs CUDA


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.