Random Musings About the Value of CPUs vs CUDA

Author	Message
Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848622 - Posted: 3 Jan 2009, 10:41:18 UTC - in response to Message 848613. http://setiathome.berkeley.edu/top_hosts.php 3666921 \| tasks 1 awit 11,210.50 1,370,822 GenuineIntel Intel(R) Xeon(R) CPU X5482 @ 3.20GHz [x86 Family 6 Model 23 Stepping 6] Darwin 9.5.0 What about your claims? Show me an NV set up in the top 20 ? OMG... well, will talk on your language: bring me one of this hosts, I insert my poor 9600GSO there and then show you that this top host + my GSO will outperform this top host w/o GPU. You just demonstrated that no one of current top 100 (or less) using GPU. And what ?? Just wait and one of these top hosts will use GPU and will become faster than it was before. ID: 848622 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 848624 - Posted: 3 Jan 2009, 10:43:11 UTC - in response to Message 848622. Last modified: 3 Jan 2009, 10:44:15 UTC http://setiathome.berkeley.edu/top_hosts.php 3666921 \| tasks 1 awit 11,210.50 1,370,822 GenuineIntel Intel(R) Xeon(R) CPU X5482 @ 3.20GHz [x86 Family 6 Model 23 Stepping 6] Darwin 9.5.0 What about your claims? Show me an NV set up in the top 20 ? OMG... well, will talk on your language: bring me one of this hosts, I insert my poor 9600GSO there and then show you that this top host + my GSO will outperform this top host w/o GPU. You just demonstrated that no one of current top 100 (or less) using GPU. And what ?? Just wait and one of these top hosts will use GPU and will become faster than it was before. In your dream, the PCI express bus will slow you down enough that it will never get there on very fast CPU, sorry!!!! you are dreaming. SETI is not like folding at home. And I forgot , Powerwise, it is a NV dissaster! lol ID: 848624 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848627 - Posted: 3 Jan 2009, 10:47:11 UTC - in response to Message 848617. really? You did not see the code I posted here? you are blind? No, I'm not blind and yes, I didn't see the code you posted here. Also I didn't see applications that was built based on this code too. Links please ? HAHAHHHAHAHA http://setiathome.berkeley.edu/forum_thread.php?id=38030&nowrap=true#525611 ... see. code ... haaaaaaaaaaa I did post code too, the one with the red tongue in the alien face :-P I guess you did not pay attention one more time. LoL :) And what this code do ? What function in MB or AP it can speed up ? It's just one single SIMD loop, not an app, not function - are you joking telling about you post code ?? Current AK8 or even opt AP have many such SIMD loops already - what you wanna demonstrate by your own one ? ID: 848627 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848633 - Posted: 3 Jan 2009, 10:50:35 UTC - in response to Message 848624. Last modified: 3 Jan 2009, 10:53:21 UTC In your dream, the PCI express bus will slow you down enough that it will never get there on very fast CPU, sorry!!!! you are dreaming. Well, computing over PCI-E bus just nonsense. Data feeded to GPU then GPU processed data (in its own memory space, no PCI-E bus communication involved) then results go back. As long data feeding/retrieving compose small fraction of total data processing time - this solution will free main CPU for additional work. ID: 848633 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 848634 - Posted: 3 Jan 2009, 10:51:17 UTC - in response to Message 848627. really? You did not see the code I posted here? you are blind? No, I'm not blind and yes, I didn't see the code you posted here. Also I didn't see applications that was built based on this code too. Links please ? HAHAHHHAHAHA http://setiathome.berkeley.edu/forum_thread.php?id=38030&nowrap=true#525611 ... see. code ... haaaaaaaaaaa I did post code too, the one with the red tongue in the alien face :-P I guess you did not pay attention one more time. LoL :) And what this code do ? What function in MB or AP it can speed up ? It's just one single SIMD loop, not an app, not function - are you joking telling about you post code ?? Current AK8 or even opt AP have many such SIMD loops already - what you wanna demonstrate by your own one ? look at the date dude ... At this time, SIMD was not so common. I never heard somebody chalenging my ASM capabilities ... hahaha this is the best, I think you need to start doing your home work. lol. I designed many of the instructions you used every day, want to teach me how to use them? lol. Discussion is over ... use google dude! ID: 848634 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 848637 - Posted: 3 Jan 2009, 10:59:47 UTC - in response to Message 848633. In your dream, the PCI express bus will slow you down enough that it will never get there on very fast CPU, sorry!!!! you are dreaming. Well, computing over PCI-E bus just nonsense. Data feeded to GPU then GPU processed data (in its own memory space, no PCI-E bus communication involved) then results go back. As long data feeding/retrieving compose small fraction of total data processing time - this solution will free main CPU for additional work. I SEE, YOU ARE A MAGICIEN ... YOU SEND DATA FROM CPU TO GPU, AND IT DOES NOT TAKE PCI EXTPRESS TIME ... INTERESTING VOODOO, YOU GOT TO GIVE US THE "RECETTE" LOL LOOK AT THE CODE OF SETI FOR CUDA, VTUNE IT, AND STOP STAYING INNACURATE STATEMENTS. ID: 848637 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848638 - Posted: 3 Jan 2009, 11:02:13 UTC - in response to Message 848634. Last modified: 3 Jan 2009, 11:06:52 UTC look at the date dude ... At this time, SIMD was not so common. I never heard somebody chalenging my ASM capabilities ... hahaha this is the best, I think you need to start doing your home work. lol. I designed many of the instructions you used every day, want to teach me how to use them? lol. Discussion is over ... use google dude! 1) Yes, I looked on date, it's ~time when Lunatics did AK8 port. And sure highly SIMDified KWSN V2.4 was online much before your post anyway :P So, what you demonstrated by that single loop? What part of code ?? 2) Well, I looking at your posts pretty long time already and yes, your ASM capabilities are doubtful for me, sorry. I have habit to trust benchmarks, not loud words. When I build app with your code involved, benchmark it versus code that do same task before and see some speedup - then I will trust words more. Till now there is nothing to benchmark from you ;) 3) I'm not wanna teach you assembler, not even intrinsics that you used in your post (it's not assembler for your info ;) ). Your claims just not founded on any checkable basis. You claim you develop new asm instructions - fine, great claim. And so? How this claim applicable to current topic ? ID: 848638 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848639 - Posted: 3 Jan 2009, 11:05:01 UTC - in response to Message 848637. I SEE, YOU ARE A MAGICIEN ... YOU SEND DATA FROM CPU TO GPU, AND IT DOES NOT TAKE PCI EXTPRESS TIME ... Sure it takes time. Moreover, it takes CPU time too. That about this statement was: As long data feeding/retrieving compose small fraction of total data processing time - this solution will free main CPU for additional work. And yes, this feeding/retrieving takes small fraction ~3-4% of Q9450 time. ID: 848639 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 848640 - Posted: 3 Jan 2009, 11:06:10 UTC Again, the claims NV did was on a Phenom running non optimized code ... if you use a Core i7 with optimized code, you punish big time the G92, and if you use a skulltrail, the punishement is even bigger ... and the MAC ... lol! the top! On hightly performing processors, the time to send through PCI express is heavy compare to fast DDRIII ram for Nehalem. Sorry, it is all dreamland ... I ll believe it when I see it on TOP 1 vtune tell me it is not going to work, seti uses too many time the same puslse and compare ( FindPulse() ) ID: 848640 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848641 - Posted: 3 Jan 2009, 11:11:28 UTC - in response to Message 848640. Last modified: 3 Jan 2009, 11:12:49 UTC Again, the claims NV did was on a Phenom running non optimized code ... if you use a Core i7 with optimized code, you punish big time the G92, and if you use a skulltrail, the punishement is even bigger ... and the MAC ... lol! the top! On hightly performing processors, the time to send through PCI express is heavy compare to fast DDRIII ram for Nehalem. Sorry, it is all dreamland ... I ll believe it when I see it on TOP 1 vtune tell me it is not going to work, seti uses too many time the same puslse and compare ( FindPulse() ) 1)Know nothing about NVidia claims abouth some Phenom, I have no phenom avalable. I have Yorkfield Q9450 and GeForce 9600GSO. And do benchmarks on this host. 2) PCI-E will be used only at begin of computations and after computations, as I already said it's nonsense to do computations over PCI-E bus. But memory access used constantly during processing. Caause current SETI datasets don't fit in L1 and L2 caches (the same situation will be with L3 cache too - too many cores). ID: 848641 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 848643 - Posted: 3 Jan 2009, 11:18:07 UTC - in response to Message 848641. Again, the claims NV did was on a Phenom running non optimized code ... if you use a Core i7 with optimized code, you punish big time the G92, and if you use a skulltrail, the punishement is even bigger ... and the MAC ... lol! the top! On hightly performing processors, the time to send through PCI express is heavy compare to fast DDRIII ram for Nehalem. Sorry, it is all dreamland ... I ll believe it when I see it on TOP 1 vtune tell me it is not going to work, seti uses too many time the same puslse and compare ( FindPulse() ) 1)Know nothing about NVidia claims abouth some Phenom, I have no phenom avalable. I have Yorkfield Q9450 and GeForce 9600GSO. And do benchmarks on this host. 2) PCI-E will be used only at begin of computations and after computations, as I already said it's nonces to do computations over PCI-E bus. But memory access used constantly during processing. Caause current SETI datasets don't fit in L1 and L2 caches (the same situation will be with L3 cache too - too many cores). On 2 ... do you think your GPU will do any better ... If you look at findpulse, or AlexK version of it, it is still using data locality a lot, more than you think for sure. Most of it is cached. Same for the FFT. You need an increase in mem traffic, due to the increase of compute power. Nehalem does very well at FFT, it does them faster than the GPU, 8 by 8, your GPU without cache will struggle in findpulse, and the FFTs in parallel will not use the max bandwidth of the load ports. I did my homework, I know what I have to do :) The rest is in your imagination, good luck with this, I am done for the day. ID: 848643 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848644 - Posted: 3 Jan 2009, 11:29:25 UTC - in response to Message 848643. Last modified: 3 Jan 2009, 11:33:08 UTC On 2 ... do you think your GPU will do any better ... I think it will do more processing for time spent by CPU to send data to GPU. And great data access locality just helps GPU too - it can keep needed data in GPU memory and not use PCI-E heavely. Why you refuse to look at GPU not as another CPU better or worse then your but as co-processor? IT's possible to do almost whole WU inside GPU. YOu need only pass inital data array there and retrieve results from it. Look at task size - not SO big data array need to be feeded in ideal case. How many data transfers in current CUDA MB - it's question of optimisation of this app, not CUDA technology itself. And it seems there is no many PCI-E transfers in current CUDA MB too - CPU load is really low. If you look at findpulse, or AlexK version of it, it is still using data locality a lot, more than you think for sure. Most of it is cached. Same for the FFT. You need an increase in mem traffic, due to the increase of compute power. Nehalem does very well at FFT, it does them faster than the GPU, 8 by 8, your GPU without cache will struggle in findpulse, and the FFTs in parallel will not use the max bandwidth of the load ports. Again... the point is GPU can do FFT (for example) in the same time while CPU doing ANOTHER FFT. If CPU does 10 FFTs while GPU finished one FFT (it's not the case, it's just example) - well, FINE, you will do almost 11 FFT instead of just 10 FFTs. Almost - because of some CPU share neded to feed GPU. Are you claim this share so big that CPU could make 11 FFTs per same time period if it would not feed GPU ?? Addon: If it's so for high end CPUs - please, provide benchmark data. For not high end CPUs I know it's not true. CPU can't do 11 FFT per time period w/o GPU (in my example). ID: 848644 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13797 Credit: 208,696,464 RAC: 304	Message 848645 - Posted: 3 Jan 2009, 11:33:44 UTC - in response to Message 848609. About what "toys" and what achievements you talk ?? I suspect a rather oblique reference to Larrabee. As to the cache references, i've no idea. The memory bandwidth on any mid range to highend video card is much more than that of a CPU, particulalry the latest models. And as for the PCIe references? I'm guessing the work is processed by the CPU to make it sutiable for the GPU, then it goes through the PCIe bus to the GPU & the GPU processes & returns the result. Not much data would be sent to the GPU, and bugger all would come back in the way of a result. And when you consider the speed of what is the 1st generation of the GPU application, compared to what would be the 6th (or more) of the CPU application, it shows just how fast the GPU is; and how much faster it can be with time & application development. Grant Darwin NT ID: 848645 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848650 - Posted: 3 Jan 2009, 11:44:03 UTC - in response to Message 848645. Last modified: 3 Jan 2009, 11:46:25 UTC About what "toys" and what achievements you talk ?? I suspect a rather oblique reference to Larrabee. Hm, it's Intel's achievement. Is this person == Intel ?? If so, well, will look benchmarks for this new CPU :) And again, even this new CPU can benefit from co-processor ;) The memory bandwidth on any mid range to highend video card is much more than that of a CPU, particulalry the latest models. With highly optimized app you should compare with L2 bandwidth, not with memory (if data locality so high that almost all processing performed inside L2 cache w/o main memory accesses). And as for the PCIe references? PCI-E bus surely slow than memory bus, but it used much-much-much less often. It's roughly the same: to compare HDD speed with memory speed - sure HDD much-much-much slower, but it's used only for checkpointing! I'm guessing the work is processed by the CPU to make it sutiable for the GPU, then it goes through the PCIe bus to the GPU & the GPU processes & returns the result. Not much data would be sent to the GPU, and bugger all would come back in the way of a result. Yes. And the less CPU pre-processing is needed the more viable CUDA solution will be. And when you consider the speed of what is the 1st generation of the GPU application, compared to what would be the 6th (or more) of the CPU application, it shows just how fast the GPU is; and how much faster it can be with time & application development. Sure. ID: 848650 ·

gomeyer Volunteer tester Send message Joined: 21 May 99 Posts: 488 Credit: 50,370,425 RAC: 0	Message 848655 - Posted: 3 Jan 2009, 12:10:45 UTC Now y'all did it: Made me remove my filter on Who? just to see what the brouhaha was all about. . . yawn The filter is back on. ID: 848655 ·

Ehran Send message Joined: 21 Dec 03 Posts: 4 Credit: 894,870 RAC: 0	Message 848660 - Posted: 3 Jan 2009, 12:26:47 UTC - in response to Message 848655. i just got the new client for boinc and updated the drivers to use cuda. what i'm seeing is that units run through considerably faster than before. my problem lies in that it seems to corrupt about 2/3 of the work units instead of doing them properly. it also seems to cause my video drivers to fail and recover every time it pooches a work unit which does nothing good for my temper. ID: 848660 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51470 Credit: 1,018,363,574 RAC: 1,004	Message 848661 - Posted: 3 Jan 2009, 12:28:25 UTC - in response to Message 848660. i just got the new client for boinc and updated the drivers to use cuda. what i'm seeing is that units run through considerably faster than before. my problem lies in that it seems to corrupt about 2/3 of the work units instead of doing them properly. it also seems to cause my video drivers to fail and recover every time it pooches a work unit which does nothing good for my temper. LOL...ya mean I am not the only cruncher here with a temper??? "Time is simply the mechanism that keeps everything from happening all at once." ID: 848661 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848665 - Posted: 3 Jan 2009, 12:38:17 UTC - in response to Message 848660. Last modified: 3 Jan 2009, 12:39:14 UTC i just got the new client for boinc and updated the drivers to use cuda. what i'm seeing is that units run through considerably faster than before. my problem lies in that it seems to corrupt about 2/3 of the work units instead of doing them properly. it also seems to cause my video drivers to fail and recover every time it pooches a work unit which does nothing good for my temper. Please look another threads (and maybe even better - threads on beta site - here too much noise) about current CUDA errors. There is a scripts (2 of them already) and modified app build that could help you diminish effects of bugs in current app version (we all wait more stable one soon). Summary of current bugs: driver crashing/freezing/overflows on VLARs (tasks with AR <0.1). Crashing/overflows on VLARS with AR~0.13. Overflows on some of VHARs (AR>~2,7). I recommend to abort all VLARs with AR <0.1 and keep eye or abort too all other tasks from "group of risk". That way you will get much more stable and productive work with current CUDA MB versions. Hope new version will be more easy in use :) ID: 848665 ·

-= Vyper =- Volunteer tester Send message Joined: 5 Sep 99 Posts: 1652 Credit: 1,065,191,981 RAC: 2,537	Message 848704 - Posted: 3 Jan 2009, 14:23:56 UTC - in response to Message 848660. i just got the new client for boinc and updated the drivers to use cuda. what i'm seeing is that units run through considerably faster than before. my problem lies in that it seems to corrupt about 2/3 of the work units instead of doing them properly. it also seems to cause my video drivers to fail and recover every time it pooches a work unit which does nothing good for my temper. Remember that this Cuda .exe should be considered as beta but it is on main. Buggs will be ironed out later i assure you. It's a pitty that this thread has become a pissing contest over something that i still could recall as "beta" phase. Time will only tell if Cuda is here to stay or if it's being replaced by other hardware. I urge all to tone down a bit and to avoid the frustration there is always the option to disable Cuda in s@h preferences until this "beta" bugs are ironed out. My 2 cents only.. Kind regards to all Vyper _________________________________________________________________________ Addicted to SETI crunching! Founder of GPU Users Group ID: 848704 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51470 Credit: 1,018,363,574 RAC: 1,004	Message 848707 - Posted: 3 Jan 2009, 14:28:53 UTC - in response to Message 848704. i just got the new client for boinc and updated the drivers to use cuda. what i'm seeing is that units run through considerably faster than before. my problem lies in that it seems to corrupt about 2/3 of the work units instead of doing them properly. it also seems to cause my video drivers to fail and recover every time it pooches a work unit which does nothing good for my temper. Remember that this Cuda .exe should be considered as beta but it is on main. Buggs will be ironed out later i assure you. It's a pitty that this thread has become a pissing contest over something that i still could recall as "beta" phase. Time will only tell if Cuda is here to stay or if it's being replaced by other hardware. I urge all to tone down a bit and to avoid the frustration there is always the option to disable Cuda in s@h preferences until this "beta" bugs are ironed out. My 2 cents only.. Kind regards to all Vyper Well, you just reaffirmed my contention that this should have been left in Beta testing until it was truly ready....... Oh well.......a small step for Seti.......a little step back for the whole project....... Another step forward when it gets ironed out..... "Time is simply the mechanism that keeps everything from happening all at once." ID: 848707 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.