Random Musings About the Value of CPUs vs CUDA

Author	Message
OzzFan Volunteer tester Send message Joined: 9 Apr 02 Posts: 15691 Credit: 84,761,841 RAC: 28	Message 848728 - Posted: 3 Jan 2009, 16:06:00 UTC - in response to Message 848650. About what "toys" and what achievements you talk ?? I suspect a rather oblique reference to Larrabee. Hm, it's Intel's achievement. Is this person == Intel ?? If so, well, will look benchmarks for this new CPU :) And again, even this new CPU can benefit from co-processor ;) Just a small note: Larrabee is rumored to be Intel's new high performance GPU, so it will compete with nVidia and ATi's higher end offerings. ID: 848728 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848744 - Posted: 3 Jan 2009, 16:51:31 UTC - in response to Message 848728. About what "toys" and what achievements you talk ?? I suspect a rather oblique reference to Larrabee. Hm, it's Intel's achievement. Is this person == Intel ?? If so, well, will look benchmarks for this new CPU :) And again, even this new CPU can benefit from co-processor ;) Just a small note: Larrabee is rumored to be Intel's new high performance GPU, so it will compete with nVidia and ATi's higher end offerings. From AnandTech article: " Well, it is important to keep in mind that this is first and foremost NOT a GPU. It's a CPU. A many-core CPU that is optimized for data-parallel processing. " But it should be used as replacement to current GPUs (as far as I understand from that article). But maybe even this hybrid can co-exists with nVidia GPUs in single case? ;) If yes, all that I said about GPU as co-processor remains valid for this new chip too. ID: 848744 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 848748 - Posted: 3 Jan 2009, 17:13:48 UTC - in response to Message 848370. I don't know/understand why you are so negative about CUDA? I have promised to keep quiet about that............ It's not the reality of it...it's how it came about........ 'Nough said.....' Done discussing it.......end of line. My post was not only for you.. ..it was for all at the board which don't like CUDA.. [said in kind words..] :-) ID: 848748 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 848751 - Posted: 3 Jan 2009, 17:21:36 UTC Last modified: 3 Jan 2009, 17:24:03 UTC How would be the performance? A PCIe 2.0 GPU in PCIe 1.0 slot. A PCIe 2.0 GPU in PCIe 2.0 slot. How big would be the slowdown with PCIe 1.0 slot? --------------------------------------- In future the SETI@home-CUDA-app will always need the CPU/Core for crunching? ..and the PCIe-slot for communication/crunching? --------------------------------------- What's with the architecture? Maybe it would better to combine AMD-CPU with nVIDIA-GPU? More performance? Thanks! ID: 848751 ·

Gecko Volunteer tester Send message Joined: 17 Nov 99 Posts: 454 Credit: 6,946,910 RAC: 47	Message 848795 - Posted: 3 Jan 2009, 18:51:37 UTC - in response to Message 848634. look at the date dude ... At this time, SIMD was not so common. I never heard somebody chalenging my ASM capabilities ... Actually, there was much work w/ SIMD and ASM that occurred well before the Seti-Enhanced transition. The initial Seti-BOINC 4.x application was well optimized by the time the Enhanced transition took place in May 06'. Tetsuji Maverick Rai did significant SIMD hand coding/optimizing in 2005 as did Harold Naparst, Hans Dorn and of course, Crunch3r. On PPC front, Alex Kan basically hand-wrote vectorized code for almost the entire PPC application in 2005 to maximize VMX (Altivec) instruction usage, minimize L1 & L2 thrashing and make most efficient use of larger L2 caches in G4 & G5 PPCs. This is why a 1.25 Ghz PPC7455 G4 would produce similar RAC to a P4 running 2x the clock w/ the then-available optimized -doze aps. The Top Comp list in late 05' & early 06' was mostly PPC Macs until Core2 arrived. ID: 848795 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51468 Credit: 1,018,363,574 RAC: 1,004	Message 848798 - Posted: 3 Jan 2009, 18:53:46 UTC - in response to Message 848795. look at the date dude ... At this time, SIMD was not so common. I never heard somebody chalenging my ASM capabilities ... Actually, there was much work w/ SIMD and ASM that occurred well before the Seti-Enhanced transition. The initial Seti-BOINC 4.x application was well optimized by the time the Enhanced transition took place in May 06'. Tetsuji Maverick Rai did significant SIMD hand coding/optimizing in 2005 as did Harold Naparst, Hans Dorn and of course, Crunch3r. On PPC front, Alex Kan basically hand-wrote vectorized code for almost the entire PPC application in 2005 to maximize VMX (Altivec) instruction usage, minimize L1 & L2 thrashing and make most efficient use of larger L2 caches in G4 & G5 PPCs. This is why a 1.25 Ghz PPC7455 G4 would produce similar RAC to a P4 running 2x the clock w/ the then-available optimized -doze aps. The Top Comp list in late 05' & early 06' was mostly PPC Macs until Core2 arrived. (Gives a big kitty grin).....The Core 2's are what weaned me away from my OCd Semprons.....what a relief. "Freedom is just Chaos, with better lighting." Alan Dean Foster ID: 848798 ·

Voyager Volunteer tester Send message Joined: 2 Nov 99 Posts: 602 Credit: 3,264,813 RAC: 0	Message 848800 - Posted: 3 Jan 2009, 19:01:01 UTC - in response to Message 848599. Is it dumb not to use cuda? I see gpus for less money than a memory upgrade ,$59 with what sounds like more improvment. Don't know where you are, but here in Australia it's not possible to get a reasonably fast CUDA capable card for any less than about $200. I'm in Puget Sound Wa. What I found was on Newegg, low-enders, 9400 $59 , 9500 69$ , 9600 $100 . Just thinking maybe more bang for the buck than a memory upgrade, which may add ~5% to my crunching. ID: 848800 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848818 - Posted: 3 Jan 2009, 19:36:36 UTC - in response to Message 848751. In future the SETI@home-CUDA-app will always need the CPU/Core for crunching? It already doesn't need full core for crunching. Look other threads. CPU load ~3-5% ID: 848818 ·

Sutaru Tsureku Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 848841 - Posted: 3 Jan 2009, 20:05:12 UTC - in response to Message 848818. Last modified: 3 Jan 2009, 20:07:26 UTC In future the SETI@home-CUDA-app will always need the CPU/Core for crunching? It already doesn't need full core for crunching. Look other threads. CPU load ~3-5% Yes.. so the GPU communicate with the CPU.. over the PCIe-slot.. if crunching.. In the future it will be maybe that the S@H-GPU-CUDA-app don't need to communicate with the CPU if crunch WUs? So a 'high' slot (PCIe 2.0) is no longer needed. Only GPU-crunching without support from the CPU. Only 'upload' and 'download' from the WU over the PCIe-slot to the GPU. ID: 848841 ·

Iona Send message Joined: 12 Jul 07 Posts: 790 Credit: 22,438,118 RAC: 0	Message 848865 - Posted: 3 Jan 2009, 20:28:00 UTC - in response to Message 848841. Maybe, in the future, but you'd still need some sort of external controller to do all the 'house-work' - never mind the PSU, case, OS, LAN/net connection, data storage etc...... Better to get rid of the bugs, first. Probably a mistake, for me to make comments on this subject, surrounded by lots of 'big hitters'! lol Don't take life too seriously, as you'll never come out of it alive! ID: 848865 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848871 - Posted: 3 Jan 2009, 20:31:49 UTC - in response to Message 848841. Communications very limited. Look prev posts in this thread. With heavy communication over PCI-E computing speed will be too low. Data loaded in big batches on GPU then results are retrieved. Full GPU processing could be even less effective - GPU very poor in places where many branching take place (GPU need to go both branch directions it cant take branch as CPU does). So GPU is the best for stream computations while CPU best in handling program logic. ID: 848871 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 848875 - Posted: 3 Jan 2009, 20:39:54 UTC - in response to Message 848644. On 2 ... do you think your GPU will do any better ... I think it will do more processing for time spent by CPU to send data to GPU. And great data access locality just helps GPU too - it can keep needed data in GPU memory and not use PCI-E heavely. Why you refuse to look at GPU not as another CPU better or worse then your but as co-processor? IT's possible to do almost whole WU inside GPU. YOu need only pass inital data array there and retrieve results from it. Look at task size - not SO big data array need to be feeded in ideal case. How many data transfers in current CUDA MB - it's question of optimisation of this app, not CUDA technology itself. And it seems there is no many PCI-E transfers in current CUDA MB too - CPU load is really low. If you look at findpulse, or AlexK version of it, it is still using data locality a lot, more than you think for sure. Most of it is cached. Same for the FFT. You need an increase in mem traffic, due to the increase of compute power. Nehalem does very well at FFT, it does them faster than the GPU, 8 by 8, your GPU without cache will struggle in findpulse, and the FFTs in parallel will not use the max bandwidth of the load ports. Again... the point is GPU can do FFT (for example) in the same time while CPU doing ANOTHER FFT. If CPU does 10 FFTs while GPU finished one FFT (it's not the case, it's just example) - well, FINE, you will do almost 11 FFT instead of just 10 FFTs. Almost - because of some CPU share neded to feed GPU. Are you claim this share so big that CPU could make 11 FFTs per same time period if it would not feed GPU ?? Addon: If it's so for high end CPUs - please, provide benchmark data. For not high end CPUs I know it's not true. CPU can't do 11 FFT per time period w/o GPU (in my example). today, (time taken to send through PCI express + Time doing the FFT + Time sending it back ) > (doing FFT on Core i7 ) End of story. ID: 848875 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 848878 - Posted: 3 Jan 2009, 20:44:37 UTC - in response to Message 848818. In future the SETI@home-CUDA-app will always need the CPU/Core for crunching? It already doesn't need full core for crunching. Look other threads. CPU load ~3-5% Hahaha ... ok, so let s see what a G92 does on a Celeron ... lol try to get this to Top 1. NV claims that the GPU is the center of all, it is totally innacurate and wrong, on the top of this, they even already claim victory on their website ... what a joke. It is the same for most of the CUDA claims. In French, we call this , farting in the wind, and say that you caused the Storm ... lol ID: 848878 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 848899 - Posted: 3 Jan 2009, 21:17:42 UTC - in response to Message 848644. Last modified: 3 Jan 2009, 21:19:11 UTC [quote]On 2 ... do you think your GPU will do any better ... I think it will do more processing for time spent by CPU to send data to GPU. And great data access locality just helps GPU too - it can keep needed data in GPU memory and not use PCI-E heavely. Why you refuse to look at GPU not as another CPU better or worse then your but as co-processor? IT's possible to do almost whole WU inside GPU. YOu need only pass inital data array there and retrieve results from it. Look at task size - not SO big data array need to be feeded in ideal case. How many data transfers in current CUDA MB - it's question of optimisation of this app, not CUDA technology itself. And it seems there is no many PCI-E transfers in current CUDA MB too - CPU load is really low. If you look at findpulse, or AlexK version of it, it is still using data locality a lot, more than you think for sure. Most of it is cached. Same for the FFT. You need an increase in mem traffic, due to the increase of compute power. Nehalem does very well at FFT, it does them faster than the GPU, 8 by 8, your GPU without cache will struggle in findpulse, and the FFTs in parallel will not use the max bandwidth of the load ports. Again... the point is GPU can do FFT (for example) in the same time while CPU doing ANOTHER FFT. If CPU does 10 FFTs while GPU finished one FFT (it's not the case, it's just example) - well, FINE, you will do almost 11 FFT instead of just 10 FFTs. Almost - because of some CPU share neded to feed GPU. Are you claim this share so big that CPU could make 11 FFTs per same time period if it would not feed GPU ?? The Story changed dramatically isn't it, from GPU is going to be 4x faster than Phenom, to "the GPU could do some FFTs" ... If your GPU does some FFT, you are farting in the wind, with a Jeans (the PCI express bus) And for the moment, you are the one who needs to provide data that a GPU can accelerate a Core i7 more than 1% ... because that would be a very expensive 1% I am not even talking about the Dual Nehalem coming ... your GPU will be a drop of water in the sea. lol ID: 848899 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848903 - Posted: 3 Jan 2009, 21:20:43 UTC - in response to Message 848875. Last modified: 3 Jan 2009, 21:31:19 UTC today, (time taken to send through PCI express + Time doing the FFT + Time sending it back ) > (doing FFT on Core i7 ) End of story. No, only beginning of story. You just can't realize (are you sure you know about parrallel computations? ;) ) that CPU is NOT SITTING IDLE when GPU does FFT. So you posted WRONG expression. You need to compare time taken to send through PCI express + Time sending it back ) ? (doing FFT on Core i7 ) You should not include time for GPU FFT here. Moreover, PCI-E transfers can be asynchronous regarding CPU, so - CPU should not wait PCI-E in this case too. So what ? ID: 848903 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848905 - Posted: 3 Jan 2009, 21:25:54 UTC - in response to Message 848899. If your GPU does some FFT, you are farting in the wind, with a Jeans (the PCI express bus) And for the moment, you are the one who needs to provide data that a GPU can accelerate a Core i7 more than 1% ... because that would be a very expensive 1% I am not even talking about the Dual Nehalem coming ... your GPU will be a drop of water in the sea. lol OMG.... Any NUMBER please ? It's only your claims, no more. I see that my 9600GSO do short task for <7 mins while Q9450 2,66 GHz takes ~11-12 min for the same task. So, more than 1 additional core! If you wanna more precise numbers I will post all benchmarkings (already posted) in this thread too. I'm really tired from your unproven claims. NUMBERS ???? ID: 848905 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 848907 - Posted: 3 Jan 2009, 21:32:52 UTC - in response to Message 848875. today, (time taken to send through PCI express + Time doing the FFT + Time sending it back ) > (doing FFT on Core i7 ) True. (at least for FFTs shorter than about 64K) End of story. False. The task is not to return FFT data, it is to FFT and process the data, then return extracted meta information. 6.06 may not yet do as much as it should on the GPU, but the small amount of CPU time needed indicates it's doing fairly well in that regard. If it were able to do all setiathome_enhanced WUs without crashing or finding false signals it would be a worthy addition to our crunching capabilities. Joe ID: 848907 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848910 - Posted: 3 Jan 2009, 21:41:09 UTC - in response to Message 848907. today, (time taken to send through PCI express + Time doing the FFT + Time sending it back ) > (doing FFT on Core i7 ) True. (at least for FFTs shorter than about 64K) Joe Joe, why we should compare this numbers? Lets compare time to process full task on GPU with one FFT on CPU ... What sense in this comparison at all? ID: 848910 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 848915 - Posted: 3 Jan 2009, 21:49:30 UTC - in response to Message 848905. OMG.... Any NUMBER please ? You are the one claiming performance improvement, where are your numbers????? How can we verify them???? from now, I know you are a fan boy. I will keep adjusting your claims, right now, the public code accelerated nothing , too buggy. This is a fact, you cant change it. ID: 848915 ·

KWSN Sir Clark Volunteer tester Send message Joined: 17 Aug 02 Posts: 139 Credit: 1,002,493 RAC: 8	Message 848917 - Posted: 3 Jan 2009, 21:51:22 UTC - in response to Message 846910. Last modified: 3 Jan 2009, 21:55:13 UTC Like Vyper Boinc Manager showed the % rapidly counting down in chunks but only after several hours of crunching so I don't know what's going on. Well, giving it another go now. I've decided to give CUDA another go this time without any other projects vying for CPU time but I've got to get through two Astropulses and several WCG Well, I've ended up with over 20 CUDA WUs all of a sudden So far only one failure which reset the video driver. So far so good. Going up in chunks of between 0.04% to 0.12% a second. It's taken 12 minutes for a 14 credit WU, longer WUs seem to be taking approx 30 minutes. Compared to the roughly 4 hours pre-CUDA I'm seeing approx 8x speed-up so far for the long WUs. I'm not sure what the difference is compared to my initial attempt with CUDA which had slow WUs ID: 848917 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.