Random Musings About the Value of CPUs vs CUDA

Author	Message
Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 848919 - Posted: 3 Jan 2009, 21:53:29 UTC - in response to Message 848907. today, (time taken to send through PCI express + Time doing the FFT + Time sending it back ) > (doing FFT on Core i7 ) True. (at least for FFTs shorter than about 64K) End of story. False. The task is not to return FFT data, it is to FFT and process the data, then return extracted meta information. 6.06 may not yet do as much as it should on the GPU, but the small amount of CPU time needed indicates it's doing fairly well in that regard. If it were able to do all setiathome_enhanced WUs without crashing or finding false signals it would be a worthy addition to our crunching capabilities. Joe well, after the FFT, you process findpluse() if i am right ... and it is in the cache, usually in the L2, and for sure in the L3 with Core i7. There is not time to do extra FFTs,and that is why it will not help. It may help on very low end CPU, but then, the findpulse will be slow too. (low end CPU usually have less cache, and will cache miss) why do you think NV toke a phenom to compare too, they knew exactly that they could not accelerate Core i7 who? ID: 848919 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 848920 - Posted: 3 Jan 2009, 21:54:25 UTC - in response to Message 848917. Like Vyper Boinc Manager showed the % rapidly counting down in chunks but only after several hours of crunching so I don't know what's going on. Well, giving it another go now. I've decided to give CUDA another go this time without any other projects vying for CPU time but I've got to get through two Astropulses and several WCG Well, I've ended up with over 20 CUDA WUs all of a sudden So far only one failure which reset the video driver. So far so good. Going up in chunks of 0.04% a second. It's taken 12 minutes for a 14 credit WU, longer WUs seem to be taking approx 30 minutes. Compared to the roughly 4 hours pre-CUDA I'm seeing approx 8x speed-up so far for the long WUs. I'm not sure what the difference is compared to initial attempt can you point out the units? ID: 848920 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848921 - Posted: 3 Jan 2009, 21:58:28 UTC - in response to Message 848915. Last modified: 3 Jan 2009, 22:04:31 UTC OMG.... Any NUMBER please ? You are the one claiming performance improvement, where are your numbers????? How can we verify them???? from now, I know you are a fan boy. I will keep adjusting your claims, right now, the public code accelerated nothing , too buggy. This is a fact, you cant change it. 1) ok, my numbers (sure you should read that thread before if you so interesting in GPU/CPU performance comparisons ;) ) http://setiathome.berkeley.edu/beta/forum_thread.php?id=1440 Thread called "CUDA MB benchmarking" pretty straightforward name, isn't it? It's not very good maner to answer by question on question, right? So, your benchmarks ? 2) I don't know how you can verify them - at least you need so hatred GPU with CUDA support :P But others can do it with easy - SETI CUDA (as all SETI CPU versions) can be run in standalone mode and there is very handy benchmarking tool from Lunatics that automates testing process. If you ever did some measuremetns with SETI app and not just say loud words about future "neha", "lara","aha" ;) and so on and so forth you should know how to use it. ADDON: 3) And, BTW, your RAC 252 now, still dropping... waiting monday, right ? ;) ID: 848921 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 848926 - Posted: 3 Jan 2009, 22:03:57 UTC - in response to Message 848921. Last modified: 3 Jan 2009, 22:06:45 UTC OMG.... Any NUMBER please ? You are the one claiming performance improvement, where are your numbers????? How can we verify them???? from now, I know you are a fan boy. I will keep adjusting your claims, right now, the public code accelerated nothing , too buggy. This is a fact, you cant change it. 1) ok, my numbers (sure you should read that thread before if you so interesting in GPU/CPU performance comparisons ;) ) http://setiathome.berkeley.edu/beta/forum_thread.php?id=1440 Thread called "CUDA MB benchmarking" pretty straightforward name, isn't it? It's not very good maner to answer by question on question, right? So, your benchmarks ? 2) I don't know how you can verify them - at least you need so hatred GPU with CUDA support :P But others can do it with easy - SETI CUDA (as all SETI CPU versions) can be run in standalone mode and there is very handy benchmarking tool from Lunatics that automates testing process. If you ever did some measuremetns with SETI app and not just say loud words about future "neha", "lara","aha" ;) and so on and so forth you should know how to use it. those are just few units ... we want to average RAC with and without, this is what matters, the rest is farting in the wind. You find some units that do show some gain, from my long experience in SETI, it does not mean the other 99% of the units will not decelerate by 5X ... ID: 848926 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848930 - Posted: 3 Jan 2009, 22:10:55 UTC - in response to Message 848926. Last modified: 3 Jan 2009, 22:12:05 UTC those are just few units ... we want to average RAC with and without, this is what matters, the rest is farting in the wind. LoL, it's very revealing words, indeed! :) Are you know that it's the same set that was used for PGOing of AK8 opt app? Here are different AR represented, total execution time reflect performance of app being tested on whole SETI@home data set. So it's not "just few units" at all for anyone who did any benchmarking for SETI before... And don't speak about RAC with me, your RAC still dropping, waiting monday ... Your numbers ? Your benchmarking tools (apparently you never used Lunatics toolset) ? Any things that could be reproduced from you ?? Loud words again ? ID: 848930 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 848932 - Posted: 3 Jan 2009, 22:18:33 UTC - in response to Message 848930. Last modified: 3 Jan 2009, 22:19:12 UTC those are just few units ... we want to average RAC with and without, this is what matters, the rest is farting in the wind. LoL, it's very revealing words, indeed! :) Are you know that it's the same set that was used for PGOing of AK8 opt app? Here are different AR represented, total execution time reflect performance of app being tested on whole SETI@home data set. So it's not "just few units" at all for anyone who did any benchmarking for SETI before... And don't speak about RAC with me, your RAC still dropping, waiting monday ... Your numbers ? Your benchmarking tools (apparently you never used Lunatics toolset) ? Any things that could be reproduced from you ?? Loud words again ? Just show up an impressive RAC or just close your month, you can say what ever you want, you can t show an good RAC ... that should make a point! lol .... ID: 848932 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 848935 - Posted: 3 Jan 2009, 22:22:44 UTC - in response to Message 848930. those are just few units ... we want to average RAC with and without, this is what matters, the rest is farting in the wind. LoL, it's very revealing words, indeed! :) Are you know that it's the same set that was used for PGOing of AK8 opt app? Here are different AR represented, total execution time reflect performance of app being tested on whole SETI@home data set. So it's not "just few units" at all for anyone who did any benchmarking for SETI before... And don't speak about RAC with me, your RAC still dropping, waiting monday ... Your numbers ? Your benchmarking tools (apparently you never used Lunatics toolset) ? Any things that could be reproduced from you ?? Loud words again ? This is my benchmark: WHERE is yours? I can get to 18 000 RAC ... that what matters, then, I move on to other similar project, get good at it, and move on again ... My RAC was 18 000 the 11th Nov 2008. yours was still around 7000, did not progress ... how come? your NV stuff should make it go 25 000 if we listen your claims ... duh! ID: 848935 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848936 - Posted: 3 Jan 2009, 22:27:44 UTC - in response to Message 848932. [offtopic] This picture just shows that your RAC dropping and probably will continue dropping. This projuect not about highest RAC at all (btw, high RAC can be produced artifically with easy, think you know methods :) ), it's about computations done. If you can support sustained RAC about your maximum - well, fine, no questions in this area (although this has no connection to CUDA quastion at all) Can you? Your graph demonstrates - you cant'. Your total too low to say anything good about systems you use. So don't knock your past RAC. RAC is valuable only if you can keep it long time. All other just blown record. [/offtopic] And returning to CUDA benchmarking question: ANY reproducible data from you? Sorry, you can't shut my mouth with this graph saying nothing. ID: 848936 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 848938 - Posted: 3 Jan 2009, 22:36:33 UTC Last modified: 3 Jan 2009, 22:44:16 UTC Show something that can make your curve go like this, then, you can say that your technology will improve SETI if you can't show something like this, well, you have no impact on your average RANKING over the other users, and you are farting in the wind. See the gain of Skulltrail on my curve, from Skulltrail proto in December 2007, to May 2008, where I moved on to an other project. (see, it gets stipper) Those are real benchmark over time, if you can't show this, and only few units, it is misleading at best. (Trust me, I learn this from working on my own code, it can look very good on lunatic benchtool, and be only as good at the code from Alex in reality ... I learned how to shut up after this) You can see that when I added nehalem, my work ranking immediatly started to gain compare to other used, showing that I was crunking faster than the average users, this is how you see on SETI is you have a technology that will win. your curve does not look anylook nothing like this recently. there is no break through your curve, you did not get any faster recently, so, stop telling us the opposite. (my guess is that you bought a Core 2 in August 2008 ... hehehe ) Best regards .... PS: You keep dancing around, changing subject to subject: Show us with and without NV RAC then, it may be true ... otherwise, you are in lalalala land. ID: 848938 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848941 - Posted: 3 Jan 2009, 22:43:08 UTC - in response to Message 848935. This is my benchmark: It's NOT benchmark result. You really don't know that? WHERE is yours? I can get to 18 000 RAC ... that what matters, then, I move on to other similar project, get good at it, and move on again ... My RAC was 18 000 the 11th Nov 2008. yours was still around 7000, did not progress ... how come? your NV stuff should make it go 25 000 if we listen your claims ... duh! My own hosts statistics you can look by yourself on any statistic server - I don't hide my hosts :P And I didn't do any claims. I provide facts. Claims - it's your prerogative ;) You did many claims about CUDA performance, many liters of dirt you spilled on CUDA. So, I want to see data that allows such behavior for you. Now about RAC of my quad: Yes, it's something that should increase if CUDA can speed up things, indeed, you right in that. Before CUDA MB release it did mostly SETI with AK6 app. After CUDA MB release it did AP + CUDA few days, now it does Einstein@home on its CPU cores and CUDA MB app on GPU. Moreover, I do regular standalone testing on this host too, because I'm interesting in debugging and speeding up CUDA app, not just in blaming CUDA. These reasons lead to RAC drop (at least RAC for SETI). Whe I will finish with standalone testing lets see on sustained RAC of this host (total one, not just for SETI - SETI now only on GPU, CPU does Einstein). Hope this answer question about RAC of this host (my total RAC consists of few hosts - some of them not always available, some have no connections with CUDA at all - so total RAC can't be used as indicator at all). ID: 848941 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 848944 - Posted: 3 Jan 2009, 22:47:14 UTC - in response to Message 848941. This is my benchmark: It's NOT benchmark result. You really don't know that? WHERE is yours? I can get to 18 000 RAC ... that what matters, then, I move on to other similar project, get good at it, and move on again ... My RAC was 18 000 the 11th Nov 2008. yours was still around 7000, did not progress ... how come? your NV stuff should make it go 25 000 if we listen your claims ... duh! My own hosts statistics you can look by yourself on any statistic server - I don't hide my hosts :P And I didn't do any claims. I provide facts. Claims - it's your prerogative ;) You did many claims about CUDA performance, many liters of dirt you spilled on CUDA. So, I want to see data that allows such behavior for you. Now about RAC of my quad: Yes, it's something that should increase if CUDA can speed up things, indeed, you right in that. Before CUDA MB release it did mostly SETI with AK6 app. After CUDA MB release it did AP + CUDA few days, now it does Einstein@home on its CPU cores and CUDA MB app on GPU. Moreover, I do regular standalone testing on this host too, because I'm interesting in debugging and speeding up CUDA app, not just in blaming CUDA. These reasons lead to RAC drop (at least RAC for SETI). Whe I will finish with standalone testing lets see on sustained RAC of this host (total one, not just for SETI - SETI now only on GPU, CPU does Einstein). Hope this answer question about RAC of this host (my total RAC consists of few hosts - some of them not always available, some have no connections with CUDA at all - so total RAC can't be used as indicator at all). really, the RAC is not the ultimate SETI benchmark???????????? what ever dude ... Classified: FANBOY. ID: 848944 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848946 - Posted: 3 Jan 2009, 22:49:58 UTC - in response to Message 848938. PS: You keep dancing around, changing subject to subject: Show us with and without NV RAC then, it may be true ... otherwise, you are in lalalala land. No, we surely in different lands with you ;) I'm not dancing around, not my style. I still wanna see benchmark results from you, timings in seconds for workunit completed on your CPU, completed on your GPU. With loaded CPU with idle CPU and so on. Any real data that can be reproduced, discussed and so on. You wanted talk about RAC, Ok I can talk about RAC too, but it's not my question. You want RAC (but you should speak about sustained RAC, we all know RAC is very variable thing) with GPU and w/o GPU - Ok you will recive it too, just later - now I have no production host with CUDA and sustained RAC (see earlier post). But where is your RAC with and w/o CUDA ? What you show to us? Some fast CPU ? Fine, and what ? Where comparison with CUDA ? ID: 848946 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848949 - Posted: 3 Jan 2009, 22:53:31 UTC - in response to Message 848944. really, the RAC is not the ultimate SETI benchmark???????????? what ever dude ... Classified: FANBOY. Yes, REALLY. Sustained RAC - maybe, but RAC itself too variable thing - is this info really new for you? And better to avoid classification of my person - you really will not like if I will start to classify you, right ?;) And who is dancing now? Just for protocol - renew my question - where your timings ? ID: 848949 ·

KWSN Sir Clark Volunteer tester Send message Joined: 17 Aug 02 Posts: 139 Credit: 1,002,493 RAC: 8	Message 848951 - Posted: 3 Jan 2009, 22:57:29 UTC - in response to Message 848920. Like Vyper Boinc Manager showed the % rapidly counting down in chunks but only after several hours of crunching so I don't know what's going on. Well, giving it another go now. I've decided to give CUDA another go this time without any other projects vying for CPU time but I've got to get through two Astropulses and several WCG Well, I've ended up with over 20 CUDA WUs all of a sudden So far only one failure which reset the video driver. So far so good. Going up in chunks of 0.04% a second. It's taken 12 minutes for a 14 credit WU, longer WUs seem to be taking approx 30 minutes. Compared to the roughly 4 hours pre-CUDA I'm seeing approx 8x speed-up so far for the long WUs. I'm not sure what the difference is compared to initial attempt can you point out the units? One from the first batch: Task 1 One from recent batch: Task 2 Can't see any difference in the flop count or anything else reported in the Task information, Task 1 took hours rather than minutes Task 2 did. ID: 848951 ·

Josef W. Segur Volunteer developer Volunteer tester Send message Joined: 30 Oct 99 Posts: 4504 Credit: 1,414,761 RAC: 0	Message 848954 - Posted: 3 Jan 2009, 22:58:56 UTC - in response to Message 848919. today, (time taken to send through PCI express + Time doing the FFT + Time sending it back ) > (doing FFT on Core i7 ) True. (at least for FFTs shorter than about 64K) End of story. False. The task is not to return FFT data, it is to FFT and process the data, then return extracted meta information. 6.06 may not yet do as much as it should on the GPU, but the small amount of CPU time needed indicates it's doing fairly well in that regard. If it were able to do all setiathome_enhanced WUs without crashing or finding false signals it would be a worthy addition to our crunching capabilities. Joe well, after the FFT, you process findpluse() if i am right ... and it is in the cache, usually in the L2, and for sure in the L3 with Core i7. There is not time to do extra FFTs,and that is why it will not help. The output from the FFT is converted to a Power Spectrum, and Spike finding is always done. For some chirp/fft pairs the full array of Power Spectrums is transposed and Pulse, Triplet, or Gaussian searches are done. It may help on very low end CPU, but then, the findpulse will be slow too. (low end CPU usually have less cache, and will cache miss) why do you think NV toke a phenom to compare too, they knew exactly that they could not accelerate Core i7 who? Perhaps you missed this part of the NVIDIA Press Release? ii Based on a consistent and reproducible SETI@home workload. Time-to-compute is measured and lower time is better. NVIDIAÃ‚Â® GeForceÃ‚Â® GTX 280-based system processes workload on the NVIDIA GPU and is based on an NVIDIA nForceÃ‚Â® 780i SLIÃ¢â€žÂ¢-based motherboard, NVIDIA GTX 280 GPU, Intel Core i7 965 CPU, 2GB DDR2 DRAM and processes the workload in 391 seconds. Ã¢â‚¬Å“Fastest consumer multicore CPU-based systemÃ¢â‚¬Â processes the entire workload on CPU and is based on an ATI Radeon HD4870 GPU, Intel x58-based motherboard, Intel Core i7 965, 3GB DDR3 DRAM and processes the workload in 670 seconds. I suspect the "consistent and reproducible SET@home workload" was poorly chosen, otherwise the tendency to crash and/or produce false positives would have been caught, and they don't say if HT was in use on the 965. Still, if the GTX 280 can do say four tasks in 391 seconds while 3/4 or 7/8 of the 965 is still available for other work, that's a productivity increase. I look forward to seeing what a Larrabee based GPU card can do on a similar test. Joe ID: 848954 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848958 - Posted: 3 Jan 2009, 23:00:32 UTC - in response to Message 848949. Last modified: 3 Jan 2009, 23:02:49 UTC Ah, I just realized, you posted my rank in BOINC ? LoL, are you really think it's saying something about CUDA performance ?? Is it real "analytical" approach of your firm? ROFL. Just for your information: BOINC ranks depends of performance my own hosts (all of them, not just CUDA enabled), from their up time, from accessible work from projects I participate, and finally - from parformance of all other hosts involved in comparison! Are you really think this value can say anything about CUDA parformance on my host? It's just not true. But I provided values that illustrate CUDA performance on my host, exactly CUDA performance, not something else. Can you provide same data about your host or not? Pretty simple question. ID: 848958 ·

Raistmer Volunteer developer Volunteer tester Send message Joined: 16 Jun 01 Posts: 6325 Credit: 106,370,077 RAC: 121	Message 848960 - Posted: 3 Jan 2009, 23:07:15 UTC - in response to Message 848938. there is no break through your curve, you did not get any faster recently, so, stop telling us the opposite. (my guess is that you bought a Core 2 in August 2008 ... hehehe ) 1) I described reasons why you didn't see any speedups on this graph - too many factors in play - that's why such graphs CAN'T be used as benchmarks. 2) You right I installed BOINC on quad in august :) And it gave great performance boost, now it's fastest of my host and soon will complete more work than all my other hosts did, no doubts in that. ID: 848960 ·

Voyager Volunteer tester Send message Joined: 2 Nov 99 Posts: 602 Credit: 3,264,813 RAC: 0	Message 848961 - Posted: 3 Jan 2009, 23:11:25 UTC I would hope this thread would just not be arguing, but provide simple info on cuda. It looks like it's here for now, so I would hope I can learn from the boards. I don't want to be granted a Phd in theory, please something understandable so helpful. ID: 848961 ·

Dirk Sadowski Volunteer tester Send message Joined: 6 Apr 07 Posts: 7105 Credit: 147,663,825 RAC: 5	Message 849002 - Posted: 4 Jan 2009, 0:54:31 UTC - in response to Message 848751. Here are Intel- and nVIDIA- fans.. ;-D I'm a SETI@home-fan! :-D So I would like to have the best hardware to crunch in less time! Sorry for my ignorance.. I think my post is going under in this thread.. So please back to topic.. How would be the performance? A PCIe 2.0 GPU in PCIe 1.0 slot. A PCIe 2.0 GPU in PCIe 2.0 slot. How big would be the slowdown with PCIe 1.0 slot? --------------------------------------- In future the SETI@home-CUDA-app will always need the CPU/Core for crunching? ..and the PCIe-slot for communication/crunching? --------------------------------------- What's with the architecture? Maybe it would better to combine AMD-CPU with nVIDIA-GPU? More performance? Thanks! Larrabee? When it's available to buy? :-) AND, the S@H-CUDA-app can run on this GPU? ID: 849002 ·

Grant (SSSF) Volunteer tester Send message Joined: 19 Aug 99 Posts: 13960 Credit: 208,696,464 RAC: 304	Message 849021 - Posted: 4 Jan 2009, 1:43:13 UTC - in response to Message 849002. Last modified: 4 Jan 2009, 1:46:20 UTC Larrabee? When it's available to buy? :-) Rumours are, sometime later this year. Like anything new, expect prices to be excessive, and performance to be OK at best. V2/2nd revision is expected mid 2010 & will more likey give a better idea of just what it is capable of. AND, the S@H-CUDA-app can run on this GPU? Given that a Larrabee video card will essentially be a whole bunch of modified x86 CPUs on the same silicon, theoretically it should be possible to run Seti on it with minimal changes to get it to do so. From my previous link Programming for Larrabee The Larrabee programming model is what sets it apart from the competition. While competing GPU architectures have become increasingly programmable over the years, Larrabee starts from a position of being fully programmable. To the developer, it appears as exactly what it is - an arrangement of fully cache coherent x86 microprocessors. The first iteration of Larrabee will hide this fact from the OS through its graphics driver, but future versions of the chip could conceivably populate task manager just like your desktop x86 cores do today. Given that the initial card will probably have 24 or so cores, and the later one (2010) 64 or more, that will mean the possibility of processing that many Work Units at a time (or close to it- depending on what mode your video card is running in for your desktop & applications). Wild personal specualtion- like this first Seti CUDA effort, i expect the intial Larrabee to be pretty underwheliming- lots of potential, reasonable performance, but plenty of work still to be done. However as Intel revise the microcode, work on the caches, chip to chip communications, data transfer etc, etc, and if developers get on the band wangon, late 2010/early 2011 could be a very interesting period for distributed computing. Grant Darwin NT ID: 849021 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.