Are you ready for the next generation CPU? |
![]() |
Message boards : Number crunching : Are you ready for the next generation CPU?
| Author | Message |
|---|---|
|
demo: | |
| ID: 397881 | | |
|
Yes, these new Apple machines are nice, and very fast...but that is partially due to having FOUR processors. Take out three of those four processors and put it up against my AMD Athlon 64 FX 51 and let's see what happens when one processor is tested against one processor. Not trying to brag, start anything, or spread any 'flame' postings...just simply stating the facts. One processor -vs- one processor, fair is fair. | |
| ID: 397889 | | |
Not trying to brag, start anything, or spread any 'flame' postings...just simply stating the facts. One processor -vs- one processor, fair is fair. I am actually running on a 975XBX with one processor package only :) you have one CPU, and I have one too... I just have 4 cores. Francois ____________ who? Skulltrail D5400XS | |
| ID: 397893 | | |
|
Francois, | |
| ID: 397896 | | |
|
Yes, I toke your nice code and re-compiled it for the Merom new instruction, added some more FFT hand coding, and yes, you are seeing a ES Quad core 2 at work, and yes, it is insanely fast. I do that as my hobby, and I am lucky enough to have some nice hardware. | |
| ID: 397900 | | |
|
No, not with 4 in parallel. | |
| ID: 397903 | | |
I figured out that SmartHeap 8.0 gives some nice % on the seti code, the heap allocation and stack allocation are pretty intensive, and smartheap gave a nice boost. Intel compiler 9.1 provide the support for MNI. MKL and IPP new versions are supporting it too. 1 cycle per SSEx instruction is awesome, you can transform most of the algorythm and get much more efficent. I am still exploring it, but the scaling looks WOW... Notice that I am using the machine that is crunching seti, for compiling and web browsing etc ... the average of the machine just passed 2800 :) Seti is running in the back ground and I dont really feel it. Very exciting times! FrancoisP | |
| ID: 397908 | | |
|
http://setiathome.berkeley.edu/show_host_detail.php?hostid=2302665 | |
| ID: 397912 | | |
http://setiathome.berkeley.edu/show_host_detail.php?hostid=2302665 ok :) if you want to count like this, it is ok. Francois | |
| ID: 397914 | | |
Yes, these new Apple machines are nice, and very fast...but that is partially due to having FOUR processors. Take out three of those four processors and put it up against my AMD Athlon 64 FX 51 and let's see what happens when one processor is tested against one processor. Not trying to brag, start anything, or spread any 'flame' postings...just simply stating the facts. One processor -vs- one processor, fair is fair. If you're going to compare your FX-51 to the G5 Quads, at least put an optimized client on it first! Prior to anything on the Core microarchitecture, I haven't seen anything close to the Quads running v6, in terms of work unit times. (It's also worth nothing that each of those four cores is running its own separate SETI process.) Also, I haven't seen anyone on SETI running a Mac Pro yet. I've been looking to test my Intel Mac clients on them for quite some time, but people I've talked to seem to be waiting to get their money's worth from their current machines before they take the plunge. Francois, I've noticed that your connection to Intel is more than just being on their SETI team and linking to their website, so I'm curious as to what your findings are about SETI performance. You mentioned SSE4--I was of the impression that most of the new instructions are for integer arithmetic, so which of these have actually been useful? Also, you mentioned the idea of hand-coding a replacement FFT and using SIMD to do four FFTs in parallel--is that actually faster than using SIMD for a single FFT at the lengths that SETI uses? And for my final question...who is "we?" :) | |
| ID: 397927 | | |
As you probably notice, I am playing with Seti since 2000. Seti is always an interesting problem of distributed computer and the FFT is a chalenge for my little brain by itself. If you look at the FFT using 4 vectors in parallel, you have to try to code your FFT in a way you minimize the penalities: Branching, Memory footprint, and in the case of Core, you want to use as many SSEx 128Bits instruction as you can. to use SIMD efficenly, you want to move your data from Array of Structure to Structure of Structure. For example, in 3D, it is very common to store X,Y,Z,W in memory like this: X,Y,Z,W,X,Y,Z,W,X,Y,Z,W,X,Y,Z,W,X,Y,Z,W... (Array of Structure) The natural way to store your SIMD data is XXXXXXXXXXX.... YYYYYYYYYYYY... ZZZZZZZZ..... WWWWWWWWWWW... (Structure of Array) But this have the bad side effect to open more memory streams and most of the modern processors allow only 4 or 8 streams open in the some time. One of my co-worker, AlexK came up with this data structure in 1998 call Structure of Structure: XXXX,YYYY,ZZZZ,WWWW,XXXX,YYYY,ZZZZ,WWWW,XXXX,YYYY,ZZZZ,WWWW... Like this, you access only with one or 2 memory streams, your data locality is tight, and your cache lines get really efficent. What I am doing today in SETI code is simply trying to apply Alex idea to FFT. I ll need few more weeks to get it done, it is a nice mind game, but it should increase dramatically the intruction per clock on the FFT side. Let's be clear, I am doing SETI for fun, I am a very happy/lucky man, my hobby and my Job are very interlaced, i rarely have the feeling of working, intel did not ask me to do anything on seti. Intel gives me access to the best toys I can dream of. Performance is general is a very interesting problem, and not only about computers, I do it as well on cars. Anybody who wants to help on the SIMDized of SETI is welcome :) FrancoisP | |
| ID: 397935 | | |
|
Very interesting stuff, although I must admit it is far beyond my level of knowledge. Hopefully Simon can make use of some of these ideas or coding schemes in some of his upcoming releases. I very much appreciate the fact that Simon's approach is to elicit input from other programmers, and work together with them for a common cause. He has already done some great work, but who knows what working with other like minded people could come up with? Thank You to Simon and all you others who are willing to share your expertise and work along with him!! | |
| ID: 397939 | | |
http://setiathome.berkeley.edu/show_host_detail.php?hostid=2302665 Thank you, even people with an Intel or an AMD Dual Core or Core 2 need to realize this fact...you really are running multiple CPUs, they are just packaged into one piece of hardware. (I must admit, my next home unit may just have one of those Dual Core or Core 2 AMD Chips...I just have to mow a heck of a lot of yards over the summer to afford the hardware!!!) ____________ | |
| ID: 398050 | | |
|
wow.. Kentsfield onboard =) i'm impressed =] i just sold my X2 3800+ and i'm getting an 6400 + gigabyte DS3 =] hoping to get an 3,6 Ghz 24/7 stable overclock... =] | |
| ID: 398068 | | |
http://setiathome.berkeley.edu/show_host_detail.php?hostid=2302665 This is not a mystery, Seti knows how many cpus you have. Just click on the computer id and look at the computer summary screen. It reports my Conroe as having 2 cpus, which it does, and Francois' cpu reports 4 cpus, which it has. It just shows up under one computer (host) id. But it is just like having multiple computers installed in one piece of hardware. Kind of like having a small Seti crunching team in one computer. ____________ 4 kitties on a Seti mission...Meeeeeeooowwwrrrrr!!! The Genuine Kittyman..........accept no substitutes. ![]() | |
| ID: 398105 | | |
|
BTW, would Francois' processor be considered a core 2 quad? I didn't think they had been released yet. Heck, you can't hardly even buy an E6600 or E6700 off the shelf yet. Or do his connections to Intel get him an engineering sample or such? | |
| ID: 398109 | | |
|
I think he works for Intel? | |
| ID: 398124 | | |
I think he works for Intel? Sure Francois is trying to prove a point! He's trying to prove that Intel finally has a butt-kicking architecture available with the new Core 2 cpus. After all, he works for Intel, and I am sure he is excited about what is new and cool in computing, 'cuz Intel is it. I'm sure excited about it, my new X6800 is doing things my AMD FX60 can't even touch! I think what Francois is doing is absolutely fantastic!! Even if the rest of us cannot afford some of the grand toys that he has access to directly from Intel, what he is doing scales down to the processors that are coming on the market in a price range most of us can afford. He has not tried to hide the fact that he works for Intel, and he has already said that Intel did not instruct him to work on Seti, I truly believe he is doing this as a very excited hobbyist. And the manner in which he is doing it is beyond reproach...being openly willing to share optimized code. As far as I am concerned, Francois can beat the Intel drum all he wants. What could be better than an Intel insider who is willing to work with Simon on his optimized apps? This is win-win for everybody! ____________ 4 kitties on a Seti mission...Meeeeeeooowwwrrrrr!!! The Genuine Kittyman..........accept no substitutes. ![]() | |
| ID: 398187 | | |
|
Indeed. Parallel is the way to go (us boinc users should know a thing or two about that), and working towards using this kind of parallellism effectively is a great step forward. One day, when we're all using dual-processor machines, with each processor being quad-core, and each of those cores hyperthreaded, we'll still be benefitting from this work (I just don't want to be the one to write the task balancing and task migration code for such a beast, with all the different penalties of migrating a task between hyperthreads on the same core, between cores on the same processor, or between processors etc. *g*). | |
| ID: 398766 | | |
[...] Salut Francois, do you believe this could also be adapted for pre-Core 2 CPUs, with 2 FFTs in parallel instead of 4? I'm pretty sure that current code does not specifically do this, as Ben Herndon pointed out to me - you may be interested in his (and Dr. Korpela's) Sourceforge project (regrettably, it's not current code, but has lots of inline assembly as well as some specific code to feed execution units in parallel with minimal penalties). As others have posted here, I'm very much in favour of getting all people working on optimizations in contact with each other (and hence, pooling resources towards a common goal). Regrettably, I still cannot code C/C++ or indeed know assembly, though those will be skills to acquire in the future. If you would like, you could head over to my Seti@Home site and register - I would be glad to give you access to the pre-release application board and have your input. There is already another Intel employee registered - Intel being quite a large company though, you probably don't know each other - his name is Greg Eckert, and he works as Instructor training manager in the Intel Software College. The more, the merrier ;o) Regards, Simon. ____________ Donate to SETI@Home via PayPal! Optimized SETI@Home apps + Information | |
| ID: 398788 | | |
And the manner in which he is doing it is beyond reproach...being openly willing to share optimized code. I don't mean to be the resident nitpicker, but as I was reading this, I started to get the wrong impression. "Beyond reproach" is a bad thing, as defined by a dictionary: Noun: reproach That isn't what you meant, is it? ____________ BOINC FAQ Service BOINC & Optimized SETI download repository | |
| ID: 398868 | | |
And the manner in which he is doing it is beyond reproach...being openly willing to share optimized code. The word "beyond" modify's the meaning. Now he cannot be reproached. Now it's a compliment. ____________ Boinc....Boinc....Boinc....Boinc | |
| ID: 398872 | | |
The word "beyond" modify's the meaning. Now he cannot be reproached. Now it's a compliment. Interesting. I took "beyond" to modify it differently, such as "beyond disgrace" or "beyond criticism", like going "beyond the depths of hell". Like a criticism worse than disgrace or contempt. ____________ BOINC FAQ Service BOINC & Optimized SETI download repository | |
| ID: 398952 | | |
The word "beyond" modify's the meaning. Now he cannot be reproached. Now it's a compliment. Your logic is good, your awareness of common usage is less good. It is a common enough idiom to make it into the American Heritage Dictionary, thusly: IDIOM: beyond reproach So good as to preclude any possibility of criticism. ____________ | |
| ID: 399023 | | |
|
Please, let's not digress ;o) | |
| ID: 399033 | | |
If you look at the FFT using 4 vectors in parallel, you have to try to code your FFT in a way you minimize the penalities: Branching, Memory footprint, and in the case of Core, you want to use as many SSEx 128Bits instruction as you can. It sounds like you're planning to write your own FFT implementation. Are you using some other implementation as a base, or are you starting from first principles? I'd always assumed that IPP's FFT performance was already pretty good, and keeping your data in split-complex format already eliminates most of your data shuffling, while still only using two memory streams. Also, what effects does this have with regard to the amount of memory touched per FFT (or group of FFTs)? If you're doing four 128K complex-to-complex in-place FFTs simultaneously, you're touching 4 MB per FFT per core, which is already pushing the limits of L2 cache. | |
| ID: 399104 | | |
|
I run 4 cpu's myself and 16 Gig of ram so far it's been really good I run seti 24/7 and none of my other programs have slowed down while seti is running | |
| ID: 399110 | | |
I run 4 cpu's myself and 16 Gig of ram so far it's been really good I run seti 24/7 and none of my other programs have slowed down while seti is running Based on your rendering time,and your log, you are not using optimized code, please go to Simon web site and get an SSE2 or SSE3 binaries, and install it. you have to drop the XML in the project Seti directory, and his faster binary. good luck ;) FrancoisP | |
| ID: 399160 | | |
And the manner in which he is doing it is beyond reproach...being openly willing to share optimized code. Oh, lord no! I posted that shortly before I left for work today, just got home and started catching up on the forums. What I meant to say, I think, was "above reproach", meaning that I do not believe his actions can, could, or should be criticized. I hope that the overall tone of my post conveyed the proper sentiment. And no, I am not at all offended by you questioning my wording. I am certainly not an English major. EDIT....and after reading the rest of the posts on the subject, maybe my usage was OK after all. I think in the context of the whole post at least, my intended meaning came through. But I would rather this post continue with the open disussions of Alex, Simon, and Francois. Or any other posts concerning the number crunching optimization they are working on. ____________ 4 kitties on a Seti mission...Meeeeeeooowwwrrrrr!!! The Genuine Kittyman..........accept no substitutes. ![]() | |
| ID: 399178 | | |
http://setiathome.berkeley.edu/show_host_detail.php?hostid=2302665 Back in the day, a single CPU was hundreds of vacuum tubes or a few thousand transistors. Then we got integrated circuits, and a single (more capable) CPU was thousands of chips spread across dozens of circuit boards.... ... then we could squeeze all of that onto one board. Then the 4004 came out, and we had all of that on one chip. Then dual-core chips which behave exactly like you had two chips on the same motherboard. Quad-core chips are no different. It seems intuitively obvious that the CPU is not packaging, and I see no valid engineering reason to call a chip with four CPUs anything other than four CPUs. ____________ | |
| ID: 399398 | | |
http://setiathome.berkeley.edu/show_host_detail.php?hostid=2302665 Somewhat later, many boards had a 386 for integer calculations and a 387 FPU for floating point. Then the FPU was integrated, then a second FPU was integrated. How do we count these? | |
| ID: 399409 | | |
http://setiathome.berkeley.edu/show_host_detail.php?hostid=2302665 As one processor. An 80387 by itself is practically useless. Packaging does not count (unless you are in marketing, then it's the only thing). ____________ | |
| ID: 399443 | | |
EDIT....and after reading the rest of the posts on the subject, maybe my usage was OK after all. I think in the context of the whole post at least, my intended meaning came through. But I would rather this post continue with the open disussions of Alex, Simon, and Francois. Or any other posts concerning the number crunching optimization they are working on. Fair enough. Apparently I was wrong, as I was not aware of this "common idiom" (though, you'd think if it were so common, I would have heard of it, but I digress). My apologies for hijacking the thread on this matter, 'twas not my intention. This will be the last I speak of on this matter. Please continue with the original topic. ____________ BOINC FAQ Service BOINC & Optimized SETI download repository | |
| ID: 399550 | | |
Somewhat later, many boards had a 386 for integer calculations and a 387 FPU for floating point. I'd have to agree with Ned here. An 80387 is actually a co-processor, not a central processing unit, meaning it requires a main processor to operate. Thusly, on multicore processors, they have multiple CPUs (all being main processors capable of individual calculations without requiring a host processor) in one packaging. Essentially a dual core processor has two CPUs in one package. You could disable one through software (theoretically) and still be able to operate with the other CPU. ____________ BOINC FAQ Service BOINC & Optimized SETI download repository | |
| ID: 399561 | | |
It sounds like you're planning to write your own FFT implementation. Are you using some other implementation as a base, or are you starting from first principles? I'd always assumed that IPP's FFT performance was already pretty good, and keeping your data in split-complex format already eliminates most of your data shuffling, while still only using two memory streams. Hi Alex (and Francois), Back in the day...pre FFTW3...I converted ooura's FFT to simd. (SSE only). Its on the sourceforge pages. Didn't get around to benchmarking it against FTTW3 but I think the ooura SIMD was about the same speed as intel's (at that time). The benchmark pages on www.fftw.org showed that FFTW beat everyone's FFTs except intel's IPP in speed. The main problems with FFT at the larger sizes (32K, 64K, 128K...) were memory and cache access times...although with Hypertransport and DDR2 that may no longer be the case. Reorganizing in Francois way avoids a lot of twiddling...and the SSE3 opcodes for sideways adds and subs should also speed things up. But the biggest boost I believe, would be some method to compute passes over L1 or L2 cache sized blocks of data. These would have to include all memory used for the computation, and somehow localizing it in blocks. Just my 2c P.S.: Hey Francois...you work at Intel...obviously you are a coder...probably a coder at intel also. Maybe you can get them to change the IPP Libraries CPU identification code to remove that check for "GenuineIntel" and just check the flags for SSE, SSE2, and SSE3 on any CPU brand. ;) | |
| ID: 399616 | | |
Message boards : Number crunching : Are you ready for the next generation CPU?
| Copyright © 2009 University of California |