Are you ready for the next generation CPU?

Message boards : Number crunching : Are you ready for the next generation CPU?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Benher
Volunteer developer
Volunteer tester

Send message
Joined: 25 Jul 99
Posts: 517
Credit: 465,152
RAC: 0
United States
Message 399616 - Posted: 16 Aug 2006, 21:36:42 UTC - in response to Message 399104.  
Last modified: 16 Aug 2006, 21:39:20 UTC

It sounds like you're planning to write your own FFT implementation. Are you using some other implementation as a base, or are you starting from first principles? I'd always assumed that IPP's FFT performance was already pretty good, and keeping your data in split-complex format already eliminates most of your data shuffling, while still only using two memory streams.

Also, what effects does this have with regard to the amount of memory touched per FFT (or group of FFTs)? If you're doing four 128K complex-to-complex in-place FFTs simultaneously, you're touching 4 MB per FFT per core, which is already pushing the limits of L2 cache.


Hi Alex (and Francois),

Back in the day...pre FFTW3...I converted ooura's FFT to simd. (SSE only). Its on the sourceforge pages.

Didn't get around to benchmarking it against FTTW3 but I think the ooura SIMD was about the same speed as intel's (at that time).

The benchmark pages on www.fftw.org showed that FFTW beat everyone's FFTs except intel's IPP in speed.

The main problems with FFT at the larger sizes (32K, 64K, 128K...) were memory and cache access times...although with Hypertransport and DDR2 that may no longer be the case. Reorganizing in Francois way avoids a lot of twiddling...and the SSE3 opcodes for sideways adds and subs should also speed things up.

But the biggest boost I believe, would be some method to compute passes over L1 or L2 cache sized blocks of data. These would have to include all memory used for the computation, and somehow localizing it in blocks.

Just my 2c

P.S.: Hey Francois...you work at Intel...obviously you are a coder...probably a coder at intel also. Maybe you can get them to change the IPP Libraries CPU identification code to remove that check for "GenuineIntel" and just check the flags for SSE, SSE2, and SSE3 on any CPU brand. ;)
ID: 399616 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15687
Credit: 84,761,841
RAC: 62
United States
Message 399561 - Posted: 16 Aug 2006, 20:26:10 UTC - in response to Message 399409.  

Somewhat later, many boards had a 386 for integer calculations and a 387 FPU for floating point.

Then the FPU was integrated, then a second FPU was integrated.

How do we count these?


I'd have to agree with Ned here. An 80387 is actually a co-processor, not a central processing unit, meaning it requires a main processor to operate.

Thusly, on multicore processors, they have multiple CPUs (all being main processors capable of individual calculations without requiring a host processor) in one packaging. Essentially a dual core processor has two CPUs in one package. You could disable one through software (theoretically) and still be able to operate with the other CPU.
ID: 399561 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15687
Credit: 84,761,841
RAC: 62
United States
Message 399550 - Posted: 16 Aug 2006, 20:21:07 UTC - in response to Message 399178.  

EDIT....and after reading the rest of the posts on the subject, maybe my usage was OK after all. I think in the context of the whole post at least, my intended meaning came through. But I would rather this post continue with the open disussions of Alex, Simon, and Francois. Or any other posts concerning the number crunching optimization they are working on.


Fair enough. Apparently I was wrong, as I was not aware of this "common idiom" (though, you'd think if it were so common, I would have heard of it, but I digress). My apologies for hijacking the thread on this matter, 'twas not my intention. This will be the last I speak of on this matter.

Please continue with the original topic.
ID: 399550 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 399443 - Posted: 16 Aug 2006, 17:20:59 UTC - in response to Message 399409.  

http://setiathome.berkeley.edu/show_host_detail.php?hostid=2302665

It may be a single die, but four cores equals four cpus.


ok :) if you want to count like this, it is ok.

Francois

Back in the day, a single CPU was hundreds of vacuum tubes or a few thousand transistors.

Then we got integrated circuits, and a single (more capable) CPU was thousands of chips spread across dozens of circuit boards....

... then we could squeeze all of that onto one board.

Then the 4004 came out, and we had all of that on one chip.


Somewhat later, many boards had a 386 for integer calculations and a 387 FPU for floating point.

Then the FPU was integrated, then a second FPU was integrated.

How do we count these?


As one processor. An 80387 by itself is practically useless.

Packaging does not count (unless you are in marketing, then it's the only thing).
ID: 399443 · Report as offensive
Josef W. Segur
Volunteer developer
Volunteer tester

Send message
Joined: 30 Oct 99
Posts: 4504
Credit: 1,414,761
RAC: 0
United States
Message 399409 - Posted: 16 Aug 2006, 16:45:17 UTC - in response to Message 399398.  

http://setiathome.berkeley.edu/show_host_detail.php?hostid=2302665

It may be a single die, but four cores equals four cpus.


ok :) if you want to count like this, it is ok.

Francois

Back in the day, a single CPU was hundreds of vacuum tubes or a few thousand transistors.

Then we got integrated circuits, and a single (more capable) CPU was thousands of chips spread across dozens of circuit boards....

... then we could squeeze all of that onto one board.

Then the 4004 came out, and we had all of that on one chip.


Somewhat later, many boards had a 386 for integer calculations and a 387 FPU for floating point.

Then the FPU was integrated, then a second FPU was integrated.

How do we count these?
ID: 399409 · Report as offensive
1mp0£173
Volunteer tester

Send message
Joined: 3 Apr 99
Posts: 8423
Credit: 356,897
RAC: 0
United States
Message 399398 - Posted: 16 Aug 2006, 16:35:54 UTC - in response to Message 397914.  
Last modified: 16 Aug 2006, 16:36:15 UTC

http://setiathome.berkeley.edu/show_host_detail.php?hostid=2302665

It may be a single die, but four cores equals four cpus.


ok :) if you want to count like this, it is ok.

Francois

Back in the day, a single CPU was hundreds of vacuum tubes or a few thousand transistors.

Then we got integrated circuits, and a single (more capable) CPU was thousands of chips spread across dozens of circuit boards....

... then we could squeeze all of that onto one board.

Then the 4004 came out, and we had all of that on one chip.

Then dual-core chips which behave exactly like you had two chips on the same motherboard.

Quad-core chips are no different.

It seems intuitively obvious that the CPU is not packaging, and I see no valid engineering reason to call a chip with four CPUs anything other than four CPUs.
ID: 399398 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 50494
Credit: 1,018,363,574
RAC: 2,276
United States
Message 399178 - Posted: 16 Aug 2006, 7:09:31 UTC - in response to Message 398868.  
Last modified: 16 Aug 2006, 7:17:04 UTC

And the manner in which he is doing it is beyond reproach...being openly willing to share optimized code.


I don't mean to be the resident nitpicker, but as I was reading this, I started to get the wrong impression. "Beyond reproach" is a bad thing, as defined by a dictionary:

Noun: reproach
1. A mild rebuke or criticism; "Words of reproach"
2. Disgrace or shame; "He brought reproach upon his family"

Verb: reproach
1. Express criticism towards; "The President reproached the General for his irresponsible behavior"


That isn't what you meant, is it?


Oh, lord no! I posted that shortly before I left for work today, just got home and started catching up on the forums.
What I meant to say, I think, was "above reproach", meaning that I do not believe his actions can, could, or should be criticized. I hope that the overall tone of my post conveyed the proper sentiment.
And no, I am not at all offended by you questioning my wording. I am certainly not an English major.
EDIT....and after reading the rest of the posts on the subject, maybe my usage was OK after all. I think in the context of the whole post at least, my intended meaning came through. But I would rather this post continue with the open disussions of Alex, Simon, and Francois. Or any other posts concerning the number crunching optimization they are working on.
"Learn from yesterday. Live for today. Hope for tomorrow." Albert Einstein
"With cats." kittyman

ID: 399178 · Report as offensive
Profile Francois Piednoel
Avatar

Send message
Joined: 14 Jun 00
Posts: 898
Credit: 5,969,361
RAC: 0
United States
Message 399160 - Posted: 16 Aug 2006, 4:47:03 UTC - in response to Message 399110.  
Last modified: 16 Aug 2006, 4:47:56 UTC

I run 4 cpu's myself and 16 Gig of ram so far it's been really good I run seti 24/7 and none of my other programs have slowed down while seti is running


Based on your rendering time,and your log, you are not using optimized code, please go to Simon web site and get an SSE2 or SSE3 binaries, and install it.
you have to drop the XML in the project Seti directory, and his faster binary.

good luck ;)
FrancoisP
ID: 399160 · Report as offensive
Randy Hancock
Avatar

Send message
Joined: 10 Aug 06
Posts: 169
Credit: 220,579
RAC: 0
United States
Message 399110 - Posted: 16 Aug 2006, 3:13:00 UTC

I run 4 cpu's myself and 16 Gig of ram so far it's been really good I run seti 24/7 and none of my other programs have slowed down while seti is running
ID: 399110 · Report as offensive
Alex Kan
Volunteer developer

Send message
Joined: 4 Dec 03
Posts: 127
Credit: 29,269
RAC: 0
United States
Message 399104 - Posted: 16 Aug 2006, 3:04:45 UTC - in response to Message 397935.  
Last modified: 16 Aug 2006, 3:10:14 UTC

If you look at the FFT using 4 vectors in parallel, you have to try to code your FFT in a way you minimize the penalities: Branching, Memory footprint, and in the case of Core, you want to use as many SSEx 128Bits instruction as you can.

<snip>

One of my co-worker, AlexK came up with this data structure in 1998 call Structure of Structure:

XXXX,YYYY,ZZZZ,WWWW,XXXX,YYYY,ZZZZ,WWWW,XXXX,YYYY,ZZZZ,WWWW...

Like this, you access only with one or 2 memory streams, your data locality is tight, and your cache lines get really efficent.

What I am doing today in SETI code is simply trying to apply Alex idea to FFT.
I ll need few more weeks to get it done, it is a nice mind game, but it should increase dramatically the intruction per clock on the FFT side.

It sounds like you're planning to write your own FFT implementation. Are you using some other implementation as a base, or are you starting from first principles? I'd always assumed that IPP's FFT performance was already pretty good, and keeping your data in split-complex format already eliminates most of your data shuffling, while still only using two memory streams.

Also, what effects does this have with regard to the amount of memory touched per FFT (or group of FFTs)? If you're doing four 128K complex-to-complex in-place FFTs simultaneously, you're touching 4 MB per FFT per core, which is already pushing the limits of L2 cache.
ID: 399104 · Report as offensive
Profile KWSN - Chicken of Angnor
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 9 Jul 99
Posts: 1199
Credit: 6,615,780
RAC: 0
Austria
Message 399033 - Posted: 16 Aug 2006, 2:15:00 UTC

Please, let's not digress ;o)

Constructive intellectual exchange is never a bad thing, though grasp of language or lack thereof kind of wasn't the original topic.

Anyway, I'd be interested to know whether I'm correct in the assumption that fundamentally, quad-core Core2 chips are feature-identical to current dual-core models.

I'm not sure whether this is information that is still under NDA or not, so of course understand if you cannot answer Francois ;o)

Regards,
Simon.
Donate to SETI@Home via PayPal!

Optimized SETI@Home apps + Information
ID: 399033 · Report as offensive
archae86

Send message
Joined: 31 Aug 99
Posts: 909
Credit: 1,582,816
RAC: 0
United States
Message 399023 - Posted: 16 Aug 2006, 2:06:01 UTC - in response to Message 398952.  
Last modified: 16 Aug 2006, 2:06:47 UTC

The word "beyond" modify's the meaning. Now he cannot be reproached. Now it's a compliment.


Interesting. I took "beyond" to modify it differently, such as "beyond disgrace" or "beyond criticism", like going "beyond the depths of hell". Like a criticism worse than disgrace or contempt.

Your logic is good, your awareness of common usage is less good.

It is a common enough idiom to make it into the American Heritage Dictionary, thusly:

IDIOM: beyond reproach So good as to preclude any possibility of criticism.

ID: 399023 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15687
Credit: 84,761,841
RAC: 62
United States
Message 398952 - Posted: 15 Aug 2006, 23:35:47 UTC - in response to Message 398872.  

The word "beyond" modify's the meaning. Now he cannot be reproached. Now it's a compliment.


Interesting. I took "beyond" to modify it differently, such as "beyond disgrace" or "beyond criticism", like going "beyond the depths of hell". Like a criticism worse than disgrace or contempt.
ID: 398952 · Report as offensive
Profile Geek@Play
Volunteer tester
Avatar

Send message
Joined: 31 Jul 01
Posts: 2467
Credit: 86,146,931
RAC: 0
United States
Message 398872 - Posted: 15 Aug 2006, 22:09:04 UTC - in response to Message 398868.  
Last modified: 15 Aug 2006, 22:09:56 UTC

And the manner in which he is doing it is beyond reproach...being openly willing to share optimized code.


I don't mean to be the resident nitpicker, but as I was reading this, I started to get the wrong impression. "Beyond reproach" is a bad thing, as defined by a dictionary:

Noun: reproach
1. A mild rebuke or criticism; "Words of reproach"
2. Disgrace or shame; "He brought reproach upon his family"

Verb: reproach
1. Express criticism towards; "The President reproached the General for his irresponsible behavior"


That isn't what you meant, is it?


The word "beyond" modify's the meaning. Now he cannot be reproached. Now it's a compliment.



Boinc....Boinc....Boinc....Boinc....
ID: 398872 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15687
Credit: 84,761,841
RAC: 62
United States
Message 398868 - Posted: 15 Aug 2006, 22:03:26 UTC - in response to Message 398187.  

And the manner in which he is doing it is beyond reproach...being openly willing to share optimized code.


I don't mean to be the resident nitpicker, but as I was reading this, I started to get the wrong impression. "Beyond reproach" is a bad thing, as defined by a dictionary:

Noun: reproach
1. A mild rebuke or criticism; "Words of reproach"
2. Disgrace or shame; "He brought reproach upon his family"

Verb: reproach
1. Express criticism towards; "The President reproached the General for his irresponsible behavior"


That isn't what you meant, is it?
ID: 398868 · Report as offensive
Profile KWSN - Chicken of Angnor
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 9 Jul 99
Posts: 1199
Credit: 6,615,780
RAC: 0
Austria
Message 398788 - Posted: 15 Aug 2006, 19:46:10 UTC - in response to Message 397935.  

[...]
If you look at the FFT using 4 vectors in parallel, you have to try to code your FFT in a way you minimize the penalities: Branching, Memory footprint, and in the case of Core, you want to use as many SSEx 128Bits instruction as you can.
to use SIMD efficenly, you want to move your data from Array of Structure to Structure of Structure.

For example, in 3D, it is very common to store X,Y,Z,W in memory like this:
X,Y,Z,W,X,Y,Z,W,X,Y,Z,W,X,Y,Z,W,X,Y,Z,W... (Array of Structure)

The natural way to store your SIMD data is
XXXXXXXXXXX....
YYYYYYYYYYYY...
ZZZZZZZZ.....
WWWWWWWWWWW... (Structure of Array)

But this have the bad side effect to open more memory streams and most of the modern processors allow only 4 or 8 streams open in the some time.
One of my co-worker, AlexK came up with this data structure in 1998 call Structure of Structure:

XXXX,YYYY,ZZZZ,WWWW,XXXX,YYYY,ZZZZ,WWWW,XXXX,YYYY,ZZZZ,WWWW...

Like this, you access only with one or 2 memory streams, your data locality is tight, and your cache lines get really efficent.
[...]
Anybody who wants to help on the SIMDized of SETI is welcome :)

FrancoisP


Salut Francois,

do you believe this could also be adapted for pre-Core 2 CPUs, with 2 FFTs in parallel instead of 4? I'm pretty sure that current code does not specifically do this, as Ben Herndon pointed out to me - you may be interested in his (and Dr. Korpela's) Sourceforge project (regrettably, it's not current code, but has lots of inline assembly as well as some specific code to feed execution units in parallel with minimal penalties).

As others have posted here, I'm very much in favour of getting all people working on optimizations in contact with each other (and hence, pooling resources towards a common goal). Regrettably, I still cannot code C/C++ or indeed know assembly, though those will be skills to acquire in the future.

If you would like, you could head over to my Seti@Home site and register - I would be glad to give you access to the pre-release application board and have your input.

There is already another Intel employee registered - Intel being quite a large company though, you probably don't know each other - his name is Greg Eckert, and he works as Instructor training manager in the Intel Software College.

The more, the merrier ;o)

Regards,
Simon.
Donate to SETI@Home via PayPal!

Optimized SETI@Home apps + Information
ID: 398788 · Report as offensive
Bart Barenbrug

Send message
Joined: 7 Jul 04
Posts: 52
Credit: 337,401
RAC: 0
Netherlands
Message 398766 - Posted: 15 Aug 2006, 19:28:54 UTC

Indeed. Parallel is the way to go (us boinc users should know a thing or two about that), and working towards using this kind of parallellism effectively is a great step forward. One day, when we're all using dual-processor machines, with each processor being quad-core, and each of those cores hyperthreaded, we'll still be benefitting from this work (I just don't want to be the one to write the task balancing and task migration code for such a beast, with all the different penalties of migrating a task between hyperthreads on the same core, between cores on the same processor, or between processors etc. *g*).
ID: 398766 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 50494
Credit: 1,018,363,574
RAC: 2,276
United States
Message 398187 - Posted: 15 Aug 2006, 16:50:09 UTC - in response to Message 398124.  
Last modified: 15 Aug 2006, 16:53:09 UTC

I think he works for Intel?

I'm surprised by some of the responses I've seen to this thread. People are soo stuck in needing to be right or having to see things one specific way that they cannot get excited about something that is new and cool in computing. So what if it is 4 CPUs? Who cares? Are we trying to prove a point, because Francois isn't trying to prove one.

It's great for SETI and awesome for the volunteer grid computing community! WE ARE ALL IN THIS TOGETHER! Apple, AMD, Intel, IBM, etc. Whatever it takes!


Sure Francois is trying to prove a point! He's trying to prove that Intel finally has a butt-kicking architecture available with the new Core 2 cpus. After all, he works for Intel, and I am sure he is excited about what is new and cool in computing, 'cuz Intel is it. I'm sure excited about it, my new X6800 is doing things my AMD FX60 can't even touch!
I think what Francois is doing is absolutely fantastic!! Even if the rest of us cannot afford some of the grand toys that he has access to directly from Intel, what he is doing scales down to the processors that are coming on the market in a price range most of us can afford.
He has not tried to hide the fact that he works for Intel, and he has already said that Intel did not instruct him to work on Seti, I truly believe he is doing this as a very excited hobbyist. And the manner in which he is doing it is beyond reproach...being openly willing to share optimized code.
As far as I am concerned, Francois can beat the Intel drum all he wants. What could be better than an Intel insider who is willing to work with Simon on his optimized apps? This is win-win for everybody!
"Learn from yesterday. Live for today. Hope for tomorrow." Albert Einstein
"With cats." kittyman

ID: 398187 · Report as offensive
Paydirt

Send message
Joined: 17 Sep 00
Posts: 53
Credit: 37,938
RAC: 0
United States
Message 398124 - Posted: 15 Aug 2006, 15:39:30 UTC

I think he works for Intel?

I'm surprised by some of the responses I've seen to this thread. People are soo stuck in needing to be right or having to see things one specific way that they cannot get excited about something that is new and cool in computing. So what if it is 4 CPUs? Who cares? Are we trying to prove a point, because Francois isn't trying to prove one.

It's great for SETI and awesome for the volunteer grid computing community! WE ARE ALL IN THIS TOGETHER! Apple, AMD, Intel, IBM, etc. Whatever it takes!
ID: 398124 · Report as offensive
kittyman Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Jul 00
Posts: 50494
Credit: 1,018,363,574
RAC: 2,276
United States
Message 398109 - Posted: 15 Aug 2006, 14:53:23 UTC

BTW, would Francois' processor be considered a core 2 quad? I didn't think they had been released yet. Heck, you can't hardly even buy an E6600 or E6700 off the shelf yet. Or do his connections to Intel get him an engineering sample or such?
"Learn from yesterday. Live for today. Hope for tomorrow." Albert Einstein
"With cats." kittyman

ID: 398109 · Report as offensive
1 · 2 · Next

Message boards : Number crunching : Are you ready for the next generation CPU?


 
©2020 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.