Are you ready for the next generation CPU?

Author	Message
Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 397881 - Posted: 15 Aug 2006, 4:38:06 UTC demo: http://setiathome.berkeley.edu/top_hosts.php going up ... going up ... FrancoisP ID: 397881 ·

sterling0466 Send message Joined: 5 Oct 00 Posts: 204 Credit: 742,621 RAC: 0	Message 397889 - Posted: 15 Aug 2006, 5:11:47 UTC Yes, these new Apple machines are nice, and very fast...but that is partially due to having FOUR processors. Take out three of those four processors and put it up against my AMD Athlon 64 FX 51 and let's see what happens when one processor is tested against one processor. Not trying to brag, start anything, or spread any 'flame' postings...just simply stating the facts. One processor -vs- one processor, fair is fair. As I have stated in the past, Apple has had some really cool and great ideas, both hardware and software related...it is a shame that Apple and some IBM/PC Clone company cannot share ideas and see what happens...the entire computer science field would be generations ahead of where we are now. ID: 397889 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 397893 - Posted: 15 Aug 2006, 5:18:34 UTC - in response to Message 397889. Last modified: 15 Aug 2006, 5:21:38 UTC Not trying to brag, start anything, or spread any 'flame' postings...just simply stating the facts. One processor -vs- one processor, fair is fair. I am actually running on a 975XBX with one processor package only :) you have one CPU, and I have one too... I just have 4 cores. Francois who? Skulltrail D5400XS ID: 397893 ·

KWSN - Chicken of Angnor Volunteer developer Volunteer tester Send message Joined: 9 Jul 99 Posts: 1199 Credit: 6,615,780 RAC: 0	Message 397896 - Posted: 15 Aug 2006, 5:21:43 UTC Francois, I saw you seem to be using my code base for your apps :o) Now, what's interesting me is the 4 cores part - you wouldn't perchance be running an ES quad-core, would you? To my knowledge, no Core 2 CPU has HT enabled, do they? Impressive performance numbers there, to be sure. Regards, Simon. Donate to SETI@Home via PayPal! Optimized SETI@Home apps + Information ID: 397896 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 397900 - Posted: 15 Aug 2006, 5:29:02 UTC - in response to Message 397896. Last modified: 15 Aug 2006, 5:29:22 UTC Yes, I toke your nice code and re-compiled it for the Merom new instruction, added some more FFT hand coding, and yes, you are seeing a ES Quad core 2 at work, and yes, it is insanely fast. I do that as my hobby, and I am lucky enough to have some nice hardware. Thanks for making the recompile easy, as soon as we are done with the tuning, i ll be happy to give you back the entiere code modification. I am actually looking at doing 4 FFT in parallel using SIMD. In the case of SETI FFT, we can probably archive 99% SIMDed efficency. do you know if anybody tryed before? Francois ID: 397900 ·

KWSN - Chicken of Angnor Volunteer developer Volunteer tester Send message Joined: 9 Jul 99 Posts: 1199 Credit: 6,615,780 RAC: 0	Message 397903 - Posted: 15 Aug 2006, 5:35:52 UTC Last modified: 15 Aug 2006, 6:14:00 UTC No, not with 4 in parallel. There have been a couple of people who have done some inline assembly for doing two operations at the same time to feed the execution units on pre-Core 2 CPU models, but so far, no X86 CPU has been able to execute as many ops in one cycle so there was no need :o) I'm very interested in your code changes - what I'm trying to do with my site and the apps and How-Tos is to gather together as many capable people working on optimizations as possible. Your input is most welcome. Regards, Simon. Donate to SETI@Home via PayPal! Optimized SETI@Home apps + Information ID: 397903 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 397908 - Posted: 15 Aug 2006, 5:54:37 UTC - in response to Message 397903. I'm very interested in your code changes - what I'm trying to do with my site and the apps and How-Tos is to gather together as many capable people working on optimizations as possible. Your input is most welcome. I figured out that SmartHeap 8.0 gives some nice % on the seti code, the heap allocation and stack allocation are pretty intensive, and smartheap gave a nice boost. Intel compiler 9.1 provide the support for MNI. MKL and IPP new versions are supporting it too. 1 cycle per SSEx instruction is awesome, you can transform most of the algorythm and get much more efficent. I am still exploring it, but the scaling looks WOW... Notice that I am using the machine that is crunching seti, for compiling and web browsing etc ... the average of the machine just passed 2800 :) Seti is running in the back ground and I dont really feel it. Very exciting times! FrancoisP ID: 397908 ·

1mp0Â£173 Volunteer tester Send message Joined: 3 Apr 99 Posts: 8423 Credit: 356,897 RAC: 0	Message 397912 - Posted: 15 Aug 2006, 6:04:22 UTC http://setiathome.berkeley.edu/show_host_detail.php?hostid=2302665 It may be a single die, but four cores equals four cpus. ID: 397912 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 397914 - Posted: 15 Aug 2006, 6:11:02 UTC - in response to Message 397912. http://setiathome.berkeley.edu/show_host_detail.php?hostid=2302665 It may be a single die, but four cores equals four cpus. ok :) if you want to count like this, it is ok. Francois ID: 397914 ·

Alex Kan Volunteer developer Send message Joined: 4 Dec 03 Posts: 127 Credit: 29,269 RAC: 0	Message 397927 - Posted: 15 Aug 2006, 7:11:29 UTC - in response to Message 397889. Yes, these new Apple machines are nice, and very fast...but that is partially due to having FOUR processors. Take out three of those four processors and put it up against my AMD Athlon 64 FX 51 and let's see what happens when one processor is tested against one processor. Not trying to brag, start anything, or spread any 'flame' postings...just simply stating the facts. One processor -vs- one processor, fair is fair. If you're going to compare your FX-51 to the G5 Quads, at least put an optimized client on it first! Prior to anything on the Core microarchitecture, I haven't seen anything close to the Quads running v6, in terms of work unit times. (It's also worth nothing that each of those four cores is running its own separate SETI process.) Also, I haven't seen anyone on SETI running a Mac Pro yet. I've been looking to test my Intel Mac clients on them for quite some time, but people I've talked to seem to be waiting to get their money's worth from their current machines before they take the plunge. Francois, I've noticed that your connection to Intel is more than just being on their SETI team and linking to their website, so I'm curious as to what your findings are about SETI performance. You mentioned SSE4--I was of the impression that most of the new instructions are for integer arithmetic, so which of these have actually been useful? Also, you mentioned the idea of hand-coding a replacement FFT and using SIMD to do four FFTs in parallel--is that actually faster than using SIMD for a single FFT at the lengths that SETI uses? And for my final question...who is "we?" :) ID: 397927 ·

Francois Piednoel Send message Joined: 14 Jun 00 Posts: 898 Credit: 5,969,361 RAC: 0	Message 397935 - Posted: 15 Aug 2006, 7:52:23 UTC - in response to Message 397927. Francois, I've noticed that your connection to Intel is more than just being on their SETI team and linking to their website, so I'm curious as to what your findings are about SETI performance. You mentioned SSE4--I was of the impression that most of the new instructions are for integer arithmetic, so which of these have actually been useful? Also, you mentioned the idea of hand-coding a replacement FFT and using SIMD to do four FFTs in parallel--is that actually faster than using SIMD for a single FFT at the lengths that SETI uses? And for my final question...who is "we?" :) As you probably notice, I am playing with Seti since 2000. Seti is always an interesting problem of distributed computer and the FFT is a chalenge for my little brain by itself. If you look at the FFT using 4 vectors in parallel, you have to try to code your FFT in a way you minimize the penalities: Branching, Memory footprint, and in the case of Core, you want to use as many SSEx 128Bits instruction as you can. to use SIMD efficenly, you want to move your data from Array of Structure to Structure of Structure. For example, in 3D, it is very common to store X,Y,Z,W in memory like this: X,Y,Z,W,X,Y,Z,W,X,Y,Z,W,X,Y,Z,W,X,Y,Z,W... (Array of Structure) The natural way to store your SIMD data is XXXXXXXXXXX.... YYYYYYYYYYYY... ZZZZZZZZ..... WWWWWWWWWWW... (Structure of Array) But this have the bad side effect to open more memory streams and most of the modern processors allow only 4 or 8 streams open in the some time. One of my co-worker, AlexK came up with this data structure in 1998 call Structure of Structure: XXXX,YYYY,ZZZZ,WWWW,XXXX,YYYY,ZZZZ,WWWW,XXXX,YYYY,ZZZZ,WWWW... Like this, you access only with one or 2 memory streams, your data locality is tight, and your cache lines get really efficent. What I am doing today in SETI code is simply trying to apply Alex idea to FFT. I ll need few more weeks to get it done, it is a nice mind game, but it should increase dramatically the intruction per clock on the FFT side. Let's be clear, I am doing SETI for fun, I am a very happy/lucky man, my hobby and my Job are very interlaced, i rarely have the feeling of working, intel did not ask me to do anything on seti. Intel gives me access to the best toys I can dream of. Performance is general is a very interesting problem, and not only about computers, I do it as well on cars. Anybody who wants to help on the SIMDized of SETI is welcome :) FrancoisP ID: 397935 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51561 Credit: 1,018,363,574 RAC: 1,004	Message 397939 - Posted: 15 Aug 2006, 8:05:53 UTC Very interesting stuff, although I must admit it is far beyond my level of knowledge. Hopefully Simon can make use of some of these ideas or coding schemes in some of his upcoming releases. I very much appreciate the fact that Simon's approach is to elicit input from other programmers, and work together with them for a common cause. He has already done some great work, but who knows what working with other like minded people could come up with? Thank You to Simon and all you others who are willing to share your expertise and work along with him!! "Time is simply the mechanism that keeps everything from happening all at once." ID: 397939 ·

sterling0466 Send message Joined: 5 Oct 00 Posts: 204 Credit: 742,621 RAC: 0	Message 398050 - Posted: 15 Aug 2006, 12:37:24 UTC - in response to Message 397912. http://setiathome.berkeley.edu/show_host_detail.php?hostid=2302665 It may be a single die, but four cores equals four cpus. Thank you, even people with an Intel or an AMD Dual Core or Core 2 need to realize this fact...you really are running multiple CPUs, they are just packaged into one piece of hardware. (I must admit, my next home unit may just have one of those Dual Core or Core 2 AMD Chips...I just have to mow a heck of a lot of yards over the summer to afford the hardware!!!) ID: 398050 ·

Saimek Send message Joined: 25 Jan 00 Posts: 121 Credit: 454,423 RAC: 0	Message 398068 - Posted: 15 Aug 2006, 13:25:41 UTC wow.. Kentsfield onboard =) i'm impressed =] i just sold my X2 3800+ and i'm getting an 6400 + gigabyte DS3 =] hoping to get an 3,6 Ghz 24/7 stable overclock... =] ID: 398068 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51561 Credit: 1,018,363,574 RAC: 1,004	Message 398105 - Posted: 15 Aug 2006, 14:48:17 UTC - in response to Message 398050. Last modified: 15 Aug 2006, 14:53:56 UTC http://setiathome.berkeley.edu/show_host_detail.php?hostid=2302665 It may be a single die, but four cores equals four cpus. Thank you, even people with an Intel or an AMD Dual Core or Core 2 need to realize this fact...you really are running multiple CPUs, they are just packaged into one piece of hardware. (I must admit, my next home unit may just have one of those Dual Core or Core 2 AMD Chips...I just have to mow a heck of a lot of yards over the summer to afford the hardware!!!) This is not a mystery, Seti knows how many cpus you have. Just click on the computer id and look at the computer summary screen. It reports my Conroe as having 2 cpus, which it does, and Francois' cpu reports 4 cpus, which it has. It just shows up under one computer (host) id. But it is just like having multiple computers installed in one piece of hardware. Kind of like having a small Seti crunching team in one computer. "Time is simply the mechanism that keeps everything from happening all at once." ID: 398105 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51561 Credit: 1,018,363,574 RAC: 1,004	Message 398109 - Posted: 15 Aug 2006, 14:53:23 UTC BTW, would Francois' processor be considered a core 2 quad? I didn't think they had been released yet. Heck, you can't hardly even buy an E6600 or E6700 off the shelf yet. Or do his connections to Intel get him an engineering sample or such? "Time is simply the mechanism that keeps everything from happening all at once." ID: 398109 ·

Paydirt Send message Joined: 17 Sep 00 Posts: 53 Credit: 37,938 RAC: 0	Message 398124 - Posted: 15 Aug 2006, 15:39:30 UTC I think he works for Intel? I'm surprised by some of the responses I've seen to this thread. People are soo stuck in needing to be right or having to see things one specific way that they cannot get excited about something that is new and cool in computing. So what if it is 4 CPUs? Who cares? Are we trying to prove a point, because Francois isn't trying to prove one. It's great for SETI and awesome for the volunteer grid computing community! WE ARE ALL IN THIS TOGETHER! Apple, AMD, Intel, IBM, etc. Whatever it takes! ID: 398124 ·

kittyman Volunteer tester Send message Joined: 9 Jul 00 Posts: 51561 Credit: 1,018,363,574 RAC: 1,004	Message 398187 - Posted: 15 Aug 2006, 16:50:09 UTC - in response to Message 398124. Last modified: 15 Aug 2006, 16:53:09 UTC I think he works for Intel? I'm surprised by some of the responses I've seen to this thread. People are soo stuck in needing to be right or having to see things one specific way that they cannot get excited about something that is new and cool in computing. So what if it is 4 CPUs? Who cares? Are we trying to prove a point, because Francois isn't trying to prove one. It's great for SETI and awesome for the volunteer grid computing community! WE ARE ALL IN THIS TOGETHER! Apple, AMD, Intel, IBM, etc. Whatever it takes! Sure Francois is trying to prove a point! He's trying to prove that Intel finally has a butt-kicking architecture available with the new Core 2 cpus. After all, he works for Intel, and I am sure he is excited about what is new and cool in computing, 'cuz Intel is it. I'm sure excited about it, my new X6800 is doing things my AMD FX60 can't even touch! I think what Francois is doing is absolutely fantastic!! Even if the rest of us cannot afford some of the grand toys that he has access to directly from Intel, what he is doing scales down to the processors that are coming on the market in a price range most of us can afford. He has not tried to hide the fact that he works for Intel, and he has already said that Intel did not instruct him to work on Seti, I truly believe he is doing this as a very excited hobbyist. And the manner in which he is doing it is beyond reproach...being openly willing to share optimized code. As far as I am concerned, Francois can beat the Intel drum all he wants. What could be better than an Intel insider who is willing to work with Simon on his optimized apps? This is win-win for everybody! "Time is simply the mechanism that keeps everything from happening all at once." ID: 398187 ·

Bart Barenbrug Send message Joined: 7 Jul 04 Posts: 52 Credit: 337,401 RAC: 0	Message 398766 - Posted: 15 Aug 2006, 19:28:54 UTC Indeed. Parallel is the way to go (us boinc users should know a thing or two about that), and working towards using this kind of parallellism effectively is a great step forward. One day, when we're all using dual-processor machines, with each processor being quad-core, and each of those cores hyperthreaded, we'll still be benefitting from this work (I just don't want to be the one to write the task balancing and task migration code for such a beast, with all the different penalties of migrating a task between hyperthreads on the same core, between cores on the same processor, or between processors etc. g). ID: 398766 ·

KWSN - Chicken of Angnor Volunteer developer Volunteer tester Send message Joined: 9 Jul 99 Posts: 1199 Credit: 6,615,780 RAC: 0	Message 398788 - Posted: 15 Aug 2006, 19:46:10 UTC - in response to Message 397935. [...] If you look at the FFT using 4 vectors in parallel, you have to try to code your FFT in a way you minimize the penalities: Branching, Memory footprint, and in the case of Core, you want to use as many SSEx 128Bits instruction as you can. to use SIMD efficenly, you want to move your data from Array of Structure to Structure of Structure. For example, in 3D, it is very common to store X,Y,Z,W in memory like this: X,Y,Z,W,X,Y,Z,W,X,Y,Z,W,X,Y,Z,W,X,Y,Z,W... (Array of Structure) The natural way to store your SIMD data is XXXXXXXXXXX.... YYYYYYYYYYYY... ZZZZZZZZ..... WWWWWWWWWWW... (Structure of Array) But this have the bad side effect to open more memory streams and most of the modern processors allow only 4 or 8 streams open in the some time. One of my co-worker, AlexK came up with this data structure in 1998 call Structure of Structure: XXXX,YYYY,ZZZZ,WWWW,XXXX,YYYY,ZZZZ,WWWW,XXXX,YYYY,ZZZZ,WWWW... Like this, you access only with one or 2 memory streams, your data locality is tight, and your cache lines get really efficent. [...] Anybody who wants to help on the SIMDized of SETI is welcome :) FrancoisP Salut Francois, do you believe this could also be adapted for pre-Core 2 CPUs, with 2 FFTs in parallel instead of 4? I'm pretty sure that current code does not specifically do this, as Ben Herndon pointed out to me - you may be interested in his (and Dr. Korpela's) Sourceforge project (regrettably, it's not current code, but has lots of inline assembly as well as some specific code to feed execution units in parallel with minimal penalties). As others have posted here, I'm very much in favour of getting all people working on optimizations in contact with each other (and hence, pooling resources towards a common goal). Regrettably, I still cannot code C/C++ or indeed know assembly, though those will be skills to acquire in the future. If you would like, you could head over to my Seti@Home site and register - I would be glad to give you access to the pre-release application board and have your input. There is already another Intel employee registered - Intel being quite a large company though, you probably don't know each other - his name is Greg Eckert, and he works as Instructor training manager in the Intel Software College. The more, the merrier ;o) Regards, Simon. Donate to SETI@Home via PayPal! Optimized SETI@Home apps + Information ID: 398788 ·

©2025 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.