AMD Optimized application tests and recommendations.

Message boards : Number crunching : AMD Optimized application tests and recommendations.
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 660245 - Posted: 15 Oct 2007, 23:05:09 UTC
Last modified: 15 Oct 2007, 23:07:08 UTC

OK, I've collected enough to make the presentation. The absolute best Simon apps available are:

For AMD4 X2 6000, AMD64 X2 5200, and AMD64 3700 Sandiego:
First place:
KWSN 2.4 SSE2 AMD MB, OR KWSN 2.4 SSE2 AMD MB-P, OR KWSN 2.4 SSE2 Intel P4-P

Second place:
KWSN 2.4 SSE3 Intel P4-P

Third place:
KWSN 2.4 SSE2 IPP Ben-Joe (64b)

Last place:
Stock 5.27

For AMD64 X2 4800
First place:
KWSN 2.4 SSE2 AMD MB-P but only due to erratic behavior.

Second place:
KWSN 2.4 SSE2 AMD MB, KWSN 2.4 SSE2 Intel P4-P

Third place:
KWSN 2.4 SSE3 Intel P4-P

Fourth place:
KWSN 2.4 SSE2 IPP Ben-Joe (64b)

Last place:
Stock 5.27

Mobile AMD64 3700
First place:
KWSN 2.4 SSE2 AMD MB, OR KWSN 2.4 SSE2 AMD MB-P, OR KWSN 2.4 SSE2 Intel P4-P

Second place:
KWSN 2.4 SSE3 Intel P4-P

Last place:
Stock 5.27

AMD64 2800 clawhammer

First Place:
KWSN 2.4 SSE2 AMD MB, or KWSN 2.4 SSE2 AMD MB-P

Second Place:
KWSN 2.4 SSE2 Intel P4-P

Last place:
Stock 5.27

Basically, I recommend running KWSN 2.4 SSE2 AMD MB, unless some other data specifies some advantage, I'm not aware of.






NOTE: the application listed with a -P extension are "special edition" unavailable from Simons site.
ID: 660245 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 660264 - Posted: 16 Oct 2007, 0:30:01 UTC - in response to Message 660245.  

Basically, I recommend running KWSN 2.4 SSE2 AMD MB, unless some other data specifies some advantage, I'm not aware of.


Would you say "KWSN 2.4 SSE2 AMD MB" is the best application for the first generation Opteron line? They are supposed to be similar to the Athlon 64's except for larger L2 cache.
ID: 660264 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 660271 - Posted: 16 Oct 2007, 0:44:26 UTC

Send me one and I'll test her out. LOL Other than that, I'd make the assumption it's faster for any AMD64 processor. The other options were either the same or slower on the procesors tested. Basically, With the processors tested, the recommended Simon App, was the right one to recommend.
ID: 660271 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20331
Credit: 7,508,002
RAC: 20
United Kingdom
Message 660512 - Posted: 16 Oct 2007, 11:39:52 UTC

Interesting work there Astro.

Even more interesting are the results...

Why do the SSE2 clients run better than the SSE3 clients?...

Is this an artifact of the Intel compiler optimising for the Intel architecture? Or is all this from gcc?

Happy crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 660512 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 660532 - Posted: 16 Oct 2007, 13:05:34 UTC

Hi Martin, The SSE3 version was KWSN 2.4 SSE3 Intel-P4. It was designed for P4's and PD's. I had to give it the "special treatment" so it would work on my AMD processors with SSE3 instruction set. I was under the impression that the "special treatment" meant that it would use SSE3. I don't know if it did or didn't, I just know it's marginally slower in all tests.
ID: 660532 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20331
Credit: 7,508,002
RAC: 20
United Kingdom
Message 660717 - Posted: 16 Oct 2007, 22:28:56 UTC - in response to Message 660532.  

Hi Martin, The SSE3 version was KWSN 2.4 SSE3 Intel-P4. It was designed for P4's and PD's. I had to give it the "special treatment" so it would work on my AMD processors with SSE3 instruction set. I was under the impression that the "special treatment" meant that it would use SSE3. I don't know if it did or didn't, I just know it's marginally slower in all tests.

Indeed a very good test.

Good stuff,

Happy crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 660717 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 660744 - Posted: 16 Oct 2007, 23:02:57 UTC

What I would like to figure out is why/how my AMD64 X2 4800 using the "special version" of KWSN 2.4 SSE2 AMD MB seems to do work about 50% faster on some wus verses the others. What could possible account for this? I could see it if the high ones were normal and something caused a slow down on the others, but when I compare CC/hour across my machines it seems clear that the lower avg is normal and the high spikes are oddities. I wonder what could cause that, and If I could figure that out, then maybe I could cause the oddity to happen all the time (I.E get a 50% speed up).
ID: 660744 · Report as offensive
Jamie

Send message
Joined: 8 Feb 01
Posts: 28
Credit: 11,078,008
RAC: 0
United States
Message 660791 - Posted: 17 Oct 2007, 0:03:00 UTC - in response to Message 660264.  

Would you say "KWSN 2.4 SSE2 AMD MB" is the best application for the first generation Opteron line? They are supposed to be similar to the Athlon 64's except for larger L2 cache.


In my experience, the P-M builds run best on Opterons. I don't see that in Astro's comparison....
ID: 660791 · Report as offensive
Profile ML1
Volunteer moderator
Volunteer tester

Send message
Joined: 25 Nov 01
Posts: 20331
Credit: 7,508,002
RAC: 20
United Kingdom
Message 660803 - Posted: 17 Oct 2007, 0:20:24 UTC - in response to Message 660744.  
Last modified: 17 Oct 2007, 0:20:39 UTC

What I would like to figure out is why/how my AMD64 X2 4800 using the "special version" of KWSN 2.4 SSE2 AMD MB seems to do work about 50% faster on some wus verses the others. What could possible account for this?...

The mix of calculations done for that particular WU vs the memory bandwidth of your system?

Or even that and also the mix of WUs that the two cores are running and whether they can interleve for individually gaining full memory bandwidth?

If you ran one WU only per two cores, you might see your 50% speedup all the time?

... Just guessing!

Happy crunchin',
Martin

See new freedom: Mageia Linux
Take a look for yourself: Linux Format
The Future is what We all make IT (GPLv3)
ID: 660803 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 660962 - Posted: 17 Oct 2007, 2:49:49 UTC - in response to Message 660744.  
Last modified: 17 Oct 2007, 3:00:21 UTC

What I would like to figure out is why/how my AMD64 X2 4800 using the "special version" of KWSN 2.4 SSE2 AMD MB seems to do work about 50% faster on some wus verses the others. What could possible account for this? I could see it if the high ones were normal and something caused a slow down on the others, but when I compare CC/hour across my machines it seems clear that the lower avg is normal and the high spikes are oddities. I wonder what could cause that, and If I could figure that out, then maybe I could cause the oddity to happen all the time (I.E get a 50% speed up).


Some theorising on what I noticed profiling code on SSE2 & SSE3 Pentium 4s:
For what it's worth, it would probably be a good point, for the sake of comparison, for me to mention the differnces in p4 generations that I'm finding on my own 2 machines. I have a Northwood 2 GHz(no HT), versus Cedar Mill 3.2 (with HT) (I Conspicuously skipped the 'presshot' generation).

Certain 'expensive' routines in the science app 'appear' to have issues related to data alignment and memory ordering (possibly unavoidable, too early for me to tell 100% and I'm looking at other things). (This determined using Intel's own tools with a high degree of operator [me] error)

Now data alignment is a function of memory allocation, which is essentially a random function of the machine/os and particular dataset conditions in a given run. Yes the beginning of the dataset is programmatically aligned (in the code I have, anyway), but the nature of the dataset seems to ensure crossing of cache line boundaries [split loads], and other assorted mischief (Penalties especially On Intel Chips.)

What this means to me, is a possible explanation of why SSE2 'may be' faster on certain machines than SSE3. SSE2 P4s tend to have smaller caches (~512k in my case). When these miss or stall or other horrible things, there is less to refill. With my Cedar Mill, with much larger 2 meg cache, a stall 'could' be more catastrophic if SSE2 were used with a large dataset, but the SSE3 variants seem to be more tolerant of this (perhaps extra instructions for dealing with misaligned data are used, preventing some performance penalities, but also perhaps they are fractionally slower on aligned data).

So I think the answer may be, if having to chose between SSE2 & 3 for now, something like "small cache - use sse2 and tolerate some cache misses but it uses fast 'data aligned' instructions so you end up ahead (luck of the draw on alignment),".... and with large caches 'Use SSE3 instructions to be more tolerant of misalignment and keep the cache filled"

Now what I can't answer is where AMD ICs fit into this equation. The last AMD IC I owned was an Athlon 1.2, (which rocked, but oddly my mum's p3 was faster :S ), But from that memory (vague recollection) I would guess the smaller cache arena would fit athlon XPs?

Just Thoughts & observations.

Jason


"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 660962 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 660976 - Posted: 17 Oct 2007, 3:08:19 UTC - in response to Message 660791.  

In my experience, the P-M builds run best on Opterons. I don't see that in Astro's comparison....


P-M builds? I don't see anything with that name on the lunatics.at site. What application are you referring to?
ID: 660976 · Report as offensive
Astro
Volunteer tester
Avatar

Send message
Joined: 16 Apr 02
Posts: 8026
Credit: 600,015
RAC: 0
Message 660993 - Posted: 17 Oct 2007, 3:33:06 UTC

I'd thought about the math done at different angle ranges, but the chart shows it happening at most angle ranges across the spectrum (except higher than 4).

Here's the chart for the AMD64 4800 with just the "special edition" work done with KWSN 2.4 SSE2 AMD MB.



@martin, I've switched it to use 1 processor, and will check them in the morning.
ID: 660993 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19077
Credit: 40,757,560
RAC: 67
United Kingdom
Message 661005 - Posted: 17 Oct 2007, 3:45:45 UTC - in response to Message 660976.  

In my experience, the P-M builds run best on Opterons. I don't see that in Astro's comparison....


P-M builds? I don't see anything with that name on the lunatics.at site. What application are you referring to?

Here's the link SSE2-Intel PM
ID: 661005 · Report as offensive
tfp
Volunteer tester

Send message
Joined: 20 Feb 01
Posts: 104
Credit: 3,137,259
RAC: 0
United States
Message 661007 - Posted: 17 Oct 2007, 3:49:35 UTC - in response to Message 660962.  

I have a Northwood 2 GHz(no HT), versus Cedar Mill 3.2 (with HT) (I Conspicuously skipped the 'presshot' generation).


Not to jump off topic but Cedar Mill is really part of the presshot generation its just that 65 nano got the heat more under control. However I believe it still is a prescott at its core.
ID: 661007 · Report as offensive
W-K 666 Project Donor
Volunteer tester

Send message
Joined: 18 May 99
Posts: 19077
Credit: 40,757,560
RAC: 67
United Kingdom
Message 661011 - Posted: 17 Oct 2007, 3:55:53 UTC

@Tony,
The higher claims you are seeing at VHAR (>1.2) are what I am seeing constantly on Pent M and C2D, these both have 2MByte per core.

Andy
ID: 661011 · Report as offensive
Profile jason_gee
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 24 Nov 06
Posts: 7489
Credit: 91,093,184
RAC: 0
Australia
Message 661032 - Posted: 17 Oct 2007, 4:14:53 UTC - in response to Message 661007.  
Last modified: 17 Oct 2007, 4:16:08 UTC

I have a Northwood 2 GHz(no HT), versus Cedar Mill 3.2 (with HT) (I Conspicuously skipped the 'presshot' generation).


Not to jump off topic but Cedar Mill is really part of the presshot generation its just that 65 nano got the heat more under control. However I believe it still is a prescott at its core.


On topic Bit:
Exploring the source has made me curious to explore further the architectural differences in SSE2/3 implementation across brands. I am not seeing the expected compiler differences across these settings, which leads me to suspect instruction and cache differences even among intel variants, not just for AMD parts.

I think the [current science app] Seti code may be particularly susceptible to certain architectural limitations. ( Which in retrospect sounds obvious I suppose, but it's been quite a lot of fiddling just to establish that basic facet of performance)

Off Topic Bit:
certainly debatable as to how much, and what kind of update, justifies a name change. Arguably each stepping is a different core altogether. I see enough performance per watt difference to call it a different processor, still definitely not a core2 of course, (LOL which I suppose you could call a p3 :P). but the thing DOES run at ~30 degrees Celcius, 100% load, on stock cooling, proper case flow in Australian springtime, which a friend's prescott in similar conditions is more like 60 degrees C. (so your heat observations hold)

"Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.
ID: 661032 · Report as offensive
DJStarfox

Send message
Joined: 23 May 01
Posts: 1066
Credit: 1,226,053
RAC: 2
United States
Message 661703 - Posted: 18 Oct 2007, 2:46:51 UTC - in response to Message 661005.  

In my experience, the P-M builds run best on Opterons. I don't see that in Astro's comparison....


P-M builds? I don't see anything with that name on the lunatics.at site. What application are you referring to?

Here's the link SSE2-Intel PM


I don't think that app will work because I'm running 64-bit linux. :) Thanks anyway.
ID: 661703 · Report as offensive
Jamie

Send message
Joined: 8 Feb 01
Posts: 28
Credit: 11,078,008
RAC: 0
United States
Message 665158 - Posted: 23 Oct 2007, 16:06:10 UTC - in response to Message 661703.  

In my experience, the P-M builds run best on Opterons. I don't see that in Astro's comparison....


P-M builds? I don't see anything with that name on the lunatics.at site. What application are you referring to?

Here's the link SSE2-Intel PM


I don't think that app will work because I'm running 64-bit linux. :) Thanks anyway.

Take it up with the Chicken Linux folks.

Crunch3r hasn't done a P-M build for his newer 2.4 builds anyway... apparently a change in the compiler.
ID: 665158 · Report as offensive
Profile Keith T.
Volunteer tester
Avatar

Send message
Joined: 23 Aug 99
Posts: 962
Credit: 537,293
RAC: 9
United Kingdom
Message 688099 - Posted: 3 Dec 2007, 3:55:29 UTC

Thanks for all your help Astro.
Sir Arthur C Clarke 1917-2008
ID: 688099 · Report as offensive
Profile The Gas Giant
Volunteer tester
Avatar

Send message
Joined: 22 Nov 01
Posts: 1904
Credit: 2,646,654
RAC: 0
Australia
Message 688187 - Posted: 3 Dec 2007, 7:37:23 UTC

Where did all the charts go?

ID: 688187 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : AMD Optimized application tests and recommendations.


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.