AMD Optimized application tests and recommendations.

Author	Message
Astro Volunteer tester Send message Joined: 16 Apr 02 Posts: 8026 Credit: 600,015 RAC: 0	Message 660245 - Posted: 15 Oct 2007, 23:05:09 UTC Last modified: 15 Oct 2007, 23:07:08 UTC OK, I've collected enough to make the presentation. The absolute best Simon apps available are: For AMD4 X2 6000, AMD64 X2 5200, and AMD64 3700 Sandiego: First place: KWSN 2.4 SSE2 AMD MB, OR KWSN 2.4 SSE2 AMD MB-P, OR KWSN 2.4 SSE2 Intel P4-P Second place: KWSN 2.4 SSE3 Intel P4-P Third place: KWSN 2.4 SSE2 IPP Ben-Joe (64b) Last place: Stock 5.27 For AMD64 X2 4800 First place: KWSN 2.4 SSE2 AMD MB-P but only due to erratic behavior. Second place: KWSN 2.4 SSE2 AMD MB, KWSN 2.4 SSE2 Intel P4-P Third place: KWSN 2.4 SSE3 Intel P4-P Fourth place: KWSN 2.4 SSE2 IPP Ben-Joe (64b) Last place: Stock 5.27 Mobile AMD64 3700 First place: KWSN 2.4 SSE2 AMD MB, OR KWSN 2.4 SSE2 AMD MB-P, OR KWSN 2.4 SSE2 Intel P4-P Second place: KWSN 2.4 SSE3 Intel P4-P Last place: Stock 5.27 AMD64 2800 clawhammer First Place: KWSN 2.4 SSE2 AMD MB, or KWSN 2.4 SSE2 AMD MB-P Second Place: KWSN 2.4 SSE2 Intel P4-P Last place: Stock 5.27 Basically, I recommend running KWSN 2.4 SSE2 AMD MB, unless some other data specifies some advantage, I'm not aware of. NOTE: the application listed with a -P extension are "special edition" unavailable from Simons site. ID: 660245 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 660264 - Posted: 16 Oct 2007, 0:30:01 UTC - in response to Message 660245. Basically, I recommend running KWSN 2.4 SSE2 AMD MB, unless some other data specifies some advantage, I'm not aware of. Would you say "KWSN 2.4 SSE2 AMD MB" is the best application for the first generation Opteron line? They are supposed to be similar to the Athlon 64's except for larger L2 cache. ID: 660264 ·

Astro Volunteer tester Send message Joined: 16 Apr 02 Posts: 8026 Credit: 600,015 RAC: 0	Message 660271 - Posted: 16 Oct 2007, 0:44:26 UTC Send me one and I'll test her out. LOL Other than that, I'd make the assumption it's faster for any AMD64 processor. The other options were either the same or slower on the procesors tested. Basically, With the processors tested, the recommended Simon App, was the right one to recommend. ID: 660271 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20331 Credit: 7,508,002 RAC: 20	Message 660512 - Posted: 16 Oct 2007, 11:39:52 UTC Interesting work there Astro. Even more interesting are the results... Why do the SSE2 clients run better than the SSE3 clients?... Is this an artifact of the Intel compiler optimising for the Intel architecture? Or is all this from gcc? Happy crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 660512 ·

Astro Volunteer tester Send message Joined: 16 Apr 02 Posts: 8026 Credit: 600,015 RAC: 0	Message 660532 - Posted: 16 Oct 2007, 13:05:34 UTC Hi Martin, The SSE3 version was KWSN 2.4 SSE3 Intel-P4. It was designed for P4's and PD's. I had to give it the "special treatment" so it would work on my AMD processors with SSE3 instruction set. I was under the impression that the "special treatment" meant that it would use SSE3. I don't know if it did or didn't, I just know it's marginally slower in all tests. ID: 660532 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20331 Credit: 7,508,002 RAC: 20	Message 660717 - Posted: 16 Oct 2007, 22:28:56 UTC - in response to Message 660532. Hi Martin, The SSE3 version was KWSN 2.4 SSE3 Intel-P4. It was designed for P4's and PD's. I had to give it the "special treatment" so it would work on my AMD processors with SSE3 instruction set. I was under the impression that the "special treatment" meant that it would use SSE3. I don't know if it did or didn't, I just know it's marginally slower in all tests. Indeed a very good test. Good stuff, Happy crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 660717 ·

Astro Volunteer tester Send message Joined: 16 Apr 02 Posts: 8026 Credit: 600,015 RAC: 0	Message 660744 - Posted: 16 Oct 2007, 23:02:57 UTC What I would like to figure out is why/how my AMD64 X2 4800 using the "special version" of KWSN 2.4 SSE2 AMD MB seems to do work about 50% faster on some wus verses the others. What could possible account for this? I could see it if the high ones were normal and something caused a slow down on the others, but when I compare CC/hour across my machines it seems clear that the lower avg is normal and the high spikes are oddities. I wonder what could cause that, and If I could figure that out, then maybe I could cause the oddity to happen all the time (I.E get a 50% speed up). ID: 660744 ·

Jamie Send message Joined: 8 Feb 01 Posts: 28 Credit: 11,078,008 RAC: 0	Message 660791 - Posted: 17 Oct 2007, 0:03:00 UTC - in response to Message 660264. Would you say "KWSN 2.4 SSE2 AMD MB" is the best application for the first generation Opteron line? They are supposed to be similar to the Athlon 64's except for larger L2 cache. In my experience, the P-M builds run best on Opterons. I don't see that in Astro's comparison.... ID: 660791 ·

ML1 Volunteer moderator Volunteer tester Send message Joined: 25 Nov 01 Posts: 20331 Credit: 7,508,002 RAC: 20	Message 660803 - Posted: 17 Oct 2007, 0:20:24 UTC - in response to Message 660744. Last modified: 17 Oct 2007, 0:20:39 UTC What I would like to figure out is why/how my AMD64 X2 4800 using the "special version" of KWSN 2.4 SSE2 AMD MB seems to do work about 50% faster on some wus verses the others. What could possible account for this?... The mix of calculations done for that particular WU vs the memory bandwidth of your system? Or even that and also the mix of WUs that the two cores are running and whether they can interleve for individually gaining full memory bandwidth? If you ran one WU only per two cores, you might see your 50% speedup all the time? ... Just guessing! Happy crunchin', Martin See new freedom: Mageia Linux Take a look for yourself: Linux Format The Future is what We all make IT (GPLv3) ID: 660803 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 660962 - Posted: 17 Oct 2007, 2:49:49 UTC - in response to Message 660744. Last modified: 17 Oct 2007, 3:00:21 UTC What I would like to figure out is why/how my AMD64 X2 4800 using the "special version" of KWSN 2.4 SSE2 AMD MB seems to do work about 50% faster on some wus verses the others. What could possible account for this? I could see it if the high ones were normal and something caused a slow down on the others, but when I compare CC/hour across my machines it seems clear that the lower avg is normal and the high spikes are oddities. I wonder what could cause that, and If I could figure that out, then maybe I could cause the oddity to happen all the time (I.E get a 50% speed up). Some theorising on what I noticed profiling code on SSE2 & SSE3 Pentium 4s: For what it's worth, it would probably be a good point, for the sake of comparison, for me to mention the differnces in p4 generations that I'm finding on my own 2 machines. I have a Northwood 2 GHz(no HT), versus Cedar Mill 3.2 (with HT) (I Conspicuously skipped the 'presshot' generation). Certain 'expensive' routines in the science app 'appear' to have issues related to data alignment and memory ordering (possibly unavoidable, too early for me to tell 100% and I'm looking at other things). (This determined using Intel's own tools with a high degree of operator [me] error) Now data alignment is a function of memory allocation, which is essentially a random function of the machine/os and particular dataset conditions in a given run. Yes the beginning of the dataset is programmatically aligned (in the code I have, anyway), but the nature of the dataset seems to ensure crossing of cache line boundaries [split loads], and other assorted mischief (Penalties especially On Intel Chips.) What this means to me, is a possible explanation of why SSE2 'may be' faster on certain machines than SSE3. SSE2 P4s tend to have smaller caches (~512k in my case). When these miss or stall or other horrible things, there is less to refill. With my Cedar Mill, with much larger 2 meg cache, a stall 'could' be more catastrophic if SSE2 were used with a large dataset, but the SSE3 variants seem to be more tolerant of this (perhaps extra instructions for dealing with misaligned data are used, preventing some performance penalities, but also perhaps they are fractionally slower on aligned data). So I think the answer may be, if having to chose between SSE2 & 3 for now, something like "small cache - use sse2 and tolerate some cache misses but it uses fast 'data aligned' instructions so you end up ahead (luck of the draw on alignment),".... and with large caches 'Use SSE3 instructions to be more tolerant of misalignment and keep the cache filled" Now what I can't answer is where AMD ICs fit into this equation. The last AMD IC I owned was an Athlon 1.2, (which rocked, but oddly my mum's p3 was faster :S ), But from that memory (vague recollection) I would guess the smaller cache arena would fit athlon XPs? Just Thoughts & observations. Jason "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 660962 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 660976 - Posted: 17 Oct 2007, 3:08:19 UTC - in response to Message 660791. In my experience, the P-M builds run best on Opterons. I don't see that in Astro's comparison.... P-M builds? I don't see anything with that name on the lunatics.at site. What application are you referring to? ID: 660976 ·

Astro Volunteer tester Send message Joined: 16 Apr 02 Posts: 8026 Credit: 600,015 RAC: 0	Message 660993 - Posted: 17 Oct 2007, 3:33:06 UTC I'd thought about the math done at different angle ranges, but the chart shows it happening at most angle ranges across the spectrum (except higher than 4). Here's the chart for the AMD64 4800 with just the "special edition" work done with KWSN 2.4 SSE2 AMD MB. @martin, I've switched it to use 1 processor, and will check them in the morning. ID: 660993 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19077 Credit: 40,757,560 RAC: 67	Message 661005 - Posted: 17 Oct 2007, 3:45:45 UTC - in response to Message 660976. In my experience, the P-M builds run best on Opterons. I don't see that in Astro's comparison.... P-M builds? I don't see anything with that name on the lunatics.at site. What application are you referring to? Here's the link SSE2-Intel PM ID: 661005 ·

tfp Volunteer tester Send message Joined: 20 Feb 01 Posts: 104 Credit: 3,137,259 RAC: 0	Message 661007 - Posted: 17 Oct 2007, 3:49:35 UTC - in response to Message 660962. I have a Northwood 2 GHz(no HT), versus Cedar Mill 3.2 (with HT) (I Conspicuously skipped the 'presshot' generation). Not to jump off topic but Cedar Mill is really part of the presshot generation its just that 65 nano got the heat more under control. However I believe it still is a prescott at its core. ID: 661007 ·

W-K 666 Volunteer tester Send message Joined: 18 May 99 Posts: 19077 Credit: 40,757,560 RAC: 67	Message 661011 - Posted: 17 Oct 2007, 3:55:53 UTC @Tony, The higher claims you are seeing at VHAR (>1.2) are what I am seeing constantly on Pent M and C2D, these both have 2MByte per core. Andy ID: 661011 ·

jason_gee Volunteer developer Volunteer tester Send message Joined: 24 Nov 06 Posts: 7489 Credit: 91,093,184 RAC: 0	Message 661032 - Posted: 17 Oct 2007, 4:14:53 UTC - in response to Message 661007. Last modified: 17 Oct 2007, 4:16:08 UTC I have a Northwood 2 GHz(no HT), versus Cedar Mill 3.2 (with HT) (I Conspicuously skipped the 'presshot' generation). Not to jump off topic but Cedar Mill is really part of the presshot generation its just that 65 nano got the heat more under control. However I believe it still is a prescott at its core. On topic Bit: Exploring the source has made me curious to explore further the architectural differences in SSE2/3 implementation across brands. I am not seeing the expected compiler differences across these settings, which leads me to suspect instruction and cache differences even among intel variants, not just for AMD parts. I think the [current science app] Seti code may be particularly susceptible to certain architectural limitations. ( Which in retrospect sounds obvious I suppose, but it's been quite a lot of fiddling just to establish that basic facet of performance) Off Topic Bit: certainly debatable as to how much, and what kind of update, justifies a name change. Arguably each stepping is a different core altogether. I see enough performance per watt difference to call it a different processor, still definitely not a core2 of course, (LOL which I suppose you could call a p3 :P). but the thing DOES run at ~30 degrees Celcius, 100% load, on stock cooling, proper case flow in Australian springtime, which a friend's prescott in similar conditions is more like 60 degrees C. (so your heat observations hold) "Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions. ID: 661032 ·

DJStarfox Send message Joined: 23 May 01 Posts: 1066 Credit: 1,226,053 RAC: 2	Message 661703 - Posted: 18 Oct 2007, 2:46:51 UTC - in response to Message 661005. In my experience, the P-M builds run best on Opterons. I don't see that in Astro's comparison.... P-M builds? I don't see anything with that name on the lunatics.at site. What application are you referring to? Here's the link SSE2-Intel PM I don't think that app will work because I'm running 64-bit linux. :) Thanks anyway. ID: 661703 ·

Jamie Send message Joined: 8 Feb 01 Posts: 28 Credit: 11,078,008 RAC: 0	Message 665158 - Posted: 23 Oct 2007, 16:06:10 UTC - in response to Message 661703. In my experience, the P-M builds run best on Opterons. I don't see that in Astro's comparison.... P-M builds? I don't see anything with that name on the lunatics.at site. What application are you referring to? Here's the link SSE2-Intel PM I don't think that app will work because I'm running 64-bit linux. :) Thanks anyway. Take it up with the Chicken Linux folks. Crunch3r hasn't done a P-M build for his newer 2.4 builds anyway... apparently a change in the compiler. ID: 665158 ·

Keith T. Volunteer tester Send message Joined: 23 Aug 99 Posts: 962 Credit: 537,293 RAC: 9	Message 688099 - Posted: 3 Dec 2007, 3:55:29 UTC Thanks for all your help Astro. Sir Arthur C Clarke 1917-2008 ID: 688099 ·

The Gas Giant Volunteer tester Send message Joined: 22 Nov 01 Posts: 1904 Credit: 2,646,654 RAC: 0	Message 688187 - Posted: 3 Dec 2007, 7:37:23 UTC Where did all the charts go? ID: 688187 ·

©2024 University of California

SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.