Core 2 comparison - Xeon E5320 vs E6300

Message boards : Number crunching : Core 2 comparison - Xeon E5320 vs E6300
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 475247 - Posted: 7 Dec 2006, 0:24:12 UTC - in response to Message 475178.  
Last modified: 7 Dec 2006, 0:26:38 UTC

I would suspect you are seeing a prime example of the bus-memory bottleneck in it's full 8 thread glory. To make it worse, the Xeon memory latency of FB Dimm at 5-5-5-15 compounds the problem.
If I understand the bandwidth correctly for a better apples to apples example, your Allendale is running 2 cores @ 1066FSB w/CL3(?) memory latency vs. the Quad Xeons essentially sharing FSB bandwidth for @ equivalent performance of a Allendale running at 533FSB, and at CL5 to add insult.


The newer Socket 771 Xeon boards support a dual FSB, so each Quad Core does not have to share it's FSB with the other socket, getting a full 12GB/s througput (assuming 1.333GHz FSB @ 72bit accesses [including ECC bits]) per socket, which is not enough to saturate a 32GB/s througput of the RAM.

Of course, his RAM isn't running at full potential since it is not in dual channel interleave mode, so perhaps they are fighting for RAM accesses.

If S@H's performance is most influenced by processor clock, cache, memory speed, and FSB in this order (is this correct?), you're likely benefitting w/ the extra cache w/ the 5320, at a "wash" w/ clock speed, and disadvantage w/ memory and FSB speed vs. a straight compare to your E6300.
More threads, yes, but operating at substantially reduced efficiency to where the total benefit scales at a somewhat disappointing fraction of the implied potential.

http://www.insight64.com/downloads/IntelligentDesign.pdf

Personally, I expected to see the dual socket operate at maybe @ 70% of the efficiency of 1 socket. Your 50% results (time vs. time) are sobering. Just these last few days, my debate has been between 2-CPU 5320 or 1 moderately OC QX6700. My perception has been to wait for the 45nm chips to see if they open the bottleneck before jumping into a dual socket. Unless you find a magic fix for your woes, you experience may be good inspiration for others to look at QX solutions w/ some overclocking and fast memory for optimal crunching consideration.


According to Intel, the Quad Core parts are only supposed to acheive a maximum of 1.5x (or 150%) boost over dual core parts, assuming a properly programmed application that can work with Intel's advanced pre-fetch cache.
ID: 475247 · Report as offensive
Profile Gecko
Volunteer tester
Avatar

Send message
Joined: 17 Nov 99
Posts: 454
Credit: 6,946,910
RAC: 47
United States
Message 475314 - Posted: 7 Dec 2006, 1:35:38 UTC - in response to Message 475247.  
Last modified: 7 Dec 2006, 1:39:00 UTC

The newer Socket 771 Xeon boards support a dual FSB, so each Quad Core does not have to share it's FSB with the other socket, getting a full 12GB/s througput (assuming 1.333GHz FSB @ 72bit accesses [including ECC bits]) per socket, which is not enough to saturate a 32GB/s througput of the RAM.

Of course, his RAM isn't running at full potential since it is not in dual channel interleave mode, so perhaps they are fighting for RAM accesses.


In the Quad core, doesn't the main slowdown occur for ex. when die 1 and die 2 each running at 1066 for example, are both competing for simultaneous bus access at 1066 to the memory controller? In essence, 2 separate dies (of 4 cores) sharing the same bus access to the memory controller, per socket? The two cores on each die however, move data at full speed between them on the same die, right? In an Allendale/Conroe, only 2 cores compete for the same FSB access. Wouldn't the memory controller also contribute to part of the performance hit coordinating twice the traffic w/ 2 busses in a 2 socket config. between the memory banks and individual cores?

ID: 475314 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 475368 - Posted: 7 Dec 2006, 2:39:52 UTC - in response to Message 475314.  
Last modified: 7 Dec 2006, 2:47:28 UTC

In the Quad core, doesn't the main slowdown occur for ex. when die 1 and die 2 each running at 1066 for example, are both competing for simultaneous bus access at 1066 to the memory controller? In essence, 2 separate dies (of 4 cores) sharing the same bus access to the memory controller, per socket? The two cores on each die however, move data at full speed between them on the same die, right? In an Allendale/Conroe, only 2 cores compete for the same FSB access.


Ah, yes. I thought you were trying to say that both Quad Cores had to compete for the same FSB, which is not true. But yes, every core in the Quad Core chip has to compete for a single FSB, whether it be 1066MHz or 1333MHz (1066 in the case of the 5320).

Sorry, I misunderstood what you said.

Wouldn't the memory controller also contribute to part of the performance hit coordinating twice the traffic w/ 2 busses in a 2 socket config. between the memory banks and individual cores?


Not necessarily. That would depend on if Intel increased the speed and bandwidth of the MCH to handle the extra traffic. But yes, there will be a small latency involved being that it has to go through the traces on the motherboard as opposed to having the MCH built right into the CPU, as with the AMDs.

However, that can be alleviated with a good code/data pre-fetch unit to keep the L2 and L1 caches full with the appropriate code/data. Intel has claimed, and some tests have confirmed, their accuracy this pre-fetch unit is up to 99%. On many of these tests, the Xeons have shown to out-perform the Opterons in almost every area (with a few exceptions). As long as you can keep the CPUs busy, it should be hard to notice a performance penalty using an external MCH.
ID: 475368 · Report as offensive
Profile Gecko
Volunteer tester
Avatar

Send message
Joined: 17 Nov 99
Posts: 454
Credit: 6,946,910
RAC: 47
United States
Message 475422 - Posted: 7 Dec 2006, 3:54:53 UTC - in response to Message 475368.  
Last modified: 7 Dec 2006, 4:08:47 UTC


Sorry, I misunderstood what you said.


No worries, I just meant the FSB inefficiencies of a Quad in a 2-cpu would be multiplied by same. The E6300 comparison would be really interesting in terms of processing per core once the memory access question of the E5320 is resolved. For S@H, 2x4 Xeons may not be a great leap over a 1x4 Kentsfield that it would logically suggest, especially since QX6700 can be OC'd and use the full range of performance memory etc. The greater inefficiencies going from 2, to 4 to 8 CPUs result in compromised performance & diminishing returns based on current board & chipset designs. Don't get me wrong, I'd gladly run a Clovertown if I had access, but don't see it as a performance/value proposition if considering building a cruncher doubling as a light duty workstation.

That would depend on if Intel increased the speed and bandwidth of the MCH to handle the extra traffic. But yes, there will be a small latency involved being that it has to go through the traces on the motherboard as opposed to having the MCH built right into the CPU, as with the AMDs.

However, that can be alleviated with a good code/data pre-fetch unit to keep the L2 and L1 caches full with the appropriate code/data. Intel has claimed, and some tests have confirmed, their accuracy this pre-fetch unit is up to 99%. On many of these tests, the Xeons have shown to out-perform the Opterons in almost every area (with a few exceptions). As long as you can keep the CPUs busy, it should be hard to notice a performance penalty using an external MCH.


Quite interesting and informative. Thanks for the explanation.

ID: 475422 · Report as offensive
zombie67 [MM]
Volunteer tester
Avatar

Send message
Joined: 22 Apr 04
Posts: 758
Credit: 27,771,894
RAC: 0
United States
Message 475495 - Posted: 7 Dec 2006, 6:14:23 UTC
Last modified: 7 Dec 2006, 6:14:38 UTC

I wonder if Crunch3r's BOINC client 5.7.5 with affinity would improve the results? I think it would with 4 different L2 caches.
Dublin, California
Team: SETI.USA
ID: 475495 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 475520 - Posted: 7 Dec 2006, 7:32:42 UTC - in response to Message 475495.  

I wonder if Crunch3r's BOINC client 5.7.5 with affinity would improve the results? I think it would with 4 different L2 caches.

I wondered about that, and I'm currently running Trux's 5.3.12.tx36 affinity client. Doesn't seem to have made much difference.
ID: 475520 · Report as offensive
zombie67 [MM]
Volunteer tester
Avatar

Send message
Joined: 22 Apr 04
Posts: 758
Credit: 27,771,894
RAC: 0
United States
Message 475540 - Posted: 7 Dec 2006, 8:46:02 UTC - in response to Message 475520.  

I wonder if Crunch3r's BOINC client 5.7.5 with affinity would improve the results? I think it would with 4 different L2 caches.

I wondered about that, and I'm currently running Trux's 5.3.12.tx36 affinity client. Doesn't seem to have made much difference.

When you get a chance, I would like to see if Crunch3r's version has a different result.
Dublin, California
Team: SETI.USA
ID: 475540 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 475549 - Posted: 7 Dec 2006, 9:37:57 UTC - in response to Message 475540.  

I wonder if Crunch3r's BOINC client 5.7.5 with affinity would improve the results? I think it would with 4 different L2 caches.

I wondered about that, and I'm currently running Trux's 5.3.12.tx36 affinity client. Doesn't seem to have made much difference.

When you get a chance, I would like to see if Crunch3r's version has a different result.

OK will do, probably over the weekend as I've got to go and get the 6300s bedded down in their new domain now.
ID: 475549 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 476421 - Posted: 8 Dec 2006, 12:25:07 UTC

I've now installed Crunch3r's client:

08/12/2006 11:59:16||BOINC 5.7.5.32 - 32 bit Edition by Crunch3r
08/12/2006 11:59:16||enabled features:
08/12/2006 11:59:16||-cpu_affinity
08/12/2006 11:59:16||-return_results_immediately
08/12/2006 11:59:16||
08/12/2006 11:59:16||
08/12/2006 11:59:16||Suspending network activity - user request
08/12/2006 11:59:18||Running CPU benchmarks
08/12/2006 11:59:28||Resuming network activity
08/12/2006 12:00:17||Benchmark results:
08/12/2006 12:00:17|| Number of CPUs: 8
08/12/2006 12:00:17|| 1748 floating point MIPS (Whetstone) per CPU
08/12/2006 12:00:17|| 1581 integer MIPS (Dhrystone) per CPU

(Sorry Rom, didn't mean to include RRI on a machine as fast as this, but it seems to come with the territory. Crunch3r's documentation is, shall we say, sparse: does anyone know if there's a switch to disable RRI while keeping affinity?)

I'll start a new series on the chart and post it when I've got a reasonable number of data points.

Two observations to be getting on with:

1) Crunch3r's benchmark is much slower than the one I posted at message 475123 for the Xeons, especially the Dhrystone. May be an artefact of the way I triggered them - I'll do a comparison run later.

2) I've noticed that there's a very big difference between the computers for VHAR WUs (AR > 1.25). The E6300s did these particularly well: the Xeon 5320 does them particularly badly.

Some rough figures:

E6300 did about 32 credits/hour on 'standard' 0.42AR: VHAR are scattered, but a bit of a cluster around 37 credits/hour (i.e. faster than standard)

E5320 does about 18 credits/hour on 'standard' 0.42AR: VHAR are again scattered, but in the range 8 - 15 credits/hour (i.e. significantly slower than standard). This really messes up the RDCF: on standard units I can go down to about 0.35 RDCF, but I've just reported some VHARs which have taken DCF up to over 1.1! Really messes up the cache fetch calculations....

Observations, anyone?

[P.S. for new readers - all timings etc. taken with Simon the Chicken's (and friends') app version 1.41]
ID: 476421 · Report as offensive
zombie67 [MM]
Volunteer tester
Avatar

Send message
Joined: 22 Apr 04
Posts: 758
Credit: 27,771,894
RAC: 0
United States
Message 476599 - Posted: 8 Dec 2006, 14:16:24 UTC - in response to Message 476421.  

I've now installed Crunch3r's client:
2) I've noticed that there's a very big difference between the computers for VHAR WUs (AR > 1.25). The E6300s did these particularly well: the Xeon 5320 does them particularly badly.
[...]
[P.S. for new readers - all timings etc. taken with Simon the Chicken's (and friends') app version 1.41]


Perhaps the culprit could be the 1.41 app? I wonder what the comparison would look like with the stock app? Maybe it is not that the quad is so much slower, but that the 1.41 app improves the dual core to a higher degree?
Dublin, California
Team: SETI.USA
ID: 476599 · Report as offensive
Profile Reuben Gathright
Avatar

Send message
Joined: 8 Mar 01
Posts: 213
Credit: 14,594,579
RAC: 0
United States
Message 476638 - Posted: 8 Dec 2006, 15:10:00 UTC

Alright, look up my Core 2 Based Xeon system and you will find that even after OVERCLOCKING MY DUAL 5120 XEON, using an optimized version of BOINC with affinity and running chicken's app... I am still slower than most Core 2 systems.

Reasons:
1) ECC memory is slower.
2) FB-Dimms have latency issues because of the their embedded control chips.

The good news is that even if I get beat WU for WU every day by single socket motherboards... as soon as I put in two quad chips no one can beat the total crunching power.
Overclock with the MSI G31M3-L and Intel E8600 3.33Ghz
Intel D865GLC Socket 478 Motherboard
~How To Overclock The Eee ASUS 1005HA Netbook To 1.9Ghz~
ID: 476638 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 476716 - Posted: 8 Dec 2006, 18:02:26 UTC - in response to Message 476599.  
Last modified: 8 Dec 2006, 18:03:01 UTC

I've now installed Crunch3r's client:
2) I've noticed that there's a very big difference between the computers for VHAR WUs (AR > 1.25). The E6300s did these particularly well: the Xeon 5320 does them particularly badly.
[...]
[P.S. for new readers - all timings etc. taken with Simon the Chicken's (and friends') app version 1.41]


Perhaps the culprit could be the 1.41 app? I wonder what the comparison would look like with the stock app? Maybe it is not that the quad is so much slower, but that the 1.41 app improves the dual core to a higher degree?

Well, that's partly why I started this thread in the first place - to see if that's a common experience: and if it is, perhaps to give a heads-up to the optimisers that their work isn't finished yet.... (sorry about that, folks - I'm really appreciative of all the work you've done already).

Also, if we can get some more experiences recorded, it might help people choosing their next rig.
ID: 476716 · Report as offensive
jrmy

Send message
Joined: 14 Jun 04
Posts: 10
Credit: 21,475
RAC: 0
United States
Message 480348 - Posted: 11 Dec 2006, 21:47:05 UTC - in response to Message 476716.  
Last modified: 11 Dec 2006, 21:47:40 UTC

Well, that's partly why I started this thread in the first place - to see if that's a common experience: and if it is, perhaps to give a heads-up to the optimisers that their work isn't finished yet.... (sorry about that, folks - I'm really appreciative of all the work you've done already).

Also, if we can get some more experiences recorded, it might help people choosing their next rig.
i am running a 6300(non OC), 1gb 667 ram, and chicken's core 2 enhanced app. I haven't noticed much of a difference in crunching times, if any difference at all, between it and the stock app.. each WU taking approx. 1.5-2h.

i can run any benchmarks and supply any further data if needed.
ID: 480348 · Report as offensive
Sisyfos

Send message
Joined: 22 Jul 00
Posts: 7
Credit: 2,796,632
RAC: 0
Denmark
Message 480392 - Posted: 11 Dec 2006, 22:28:14 UTC - in response to Message 476421.  

...This really messes up the RDCF: on standard units I can go down to about 0.35 RDCF, but I've just reported some VHARs which have taken DCF up to over 1.1! Really messes up the cache fetch calculations....


A comment on the RDCF...
I experienced an oddness with Crunch3r's 5.7.5.
Eg.: I received a batch of similar WUs with, for example, 1:30:00 to completion.
When I finished the first WU in 1:12:15 the "To completion" of the rest rose to 1:32:12 and in general my RDCF was approximately doubble of what it "should" be.
I subsequently reverted to 5.4.11 and the RDCF quickly went back to normal.
If you take a peek at my Toledo wich still runs 5.7.5, you'll see that RDCF is around 0.6. That amounts to 04:45:00'ish on the 0.42 variety WUs, but they only take around 02:20:00.

Just a FYI.
ID: 480392 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 480394 - Posted: 11 Dec 2006, 22:34:20 UTC - in response to Message 480392.  

I experienced an oddness with Crunch3r's 5.7.5.


BOINC v5.7.5 is a beta release direct from the developers. It is not put out by Crunch3r. The "bug" is still interesting to know, nonetheless. Perhaps this should be brought to JM7's or ROM's attention, if they aren't already aware of this.
ID: 480394 · Report as offensive
Profile Jakob Creutzfeld
Volunteer tester
Avatar

Send message
Joined: 13 Oct 00
Posts: 611
Credit: 2,025,000
RAC: 0
Germany
Message 480396 - Posted: 11 Dec 2006, 22:41:08 UTC - in response to Message 480348.  

i am running a 6300(non OC), 1gb 667 ram, and chicken's core 2 enhanced app. I haven't noticed much of a difference in crunching times, if any difference at all, between it and the stock app.. each WU taking approx. 1.5-2h.


Is a matter of the angle range. The two stock-crunched WU's that are still shown on your results page (the two oldest ones) took about 8,000 seconds to complete for about 33 Credits. The newer ones show about 7,500 seconds for about 62 credits! (For comparison: One of your result with 34.25 credits took 5,256.83 seconds with Simon's app).

So I think there's an increase of about 30%...

Andy
ID: 480396 · Report as offensive
Sisyfos

Send message
Joined: 22 Jul 00
Posts: 7
Credit: 2,796,632
RAC: 0
Denmark
Message 480397 - Posted: 11 Dec 2006, 22:41:51 UTC - in response to Message 480394.  
Last modified: 11 Dec 2006, 22:43:31 UTC

It was the "BOINC 5.7.5.32 - 32 bit Edition by Crunch3r" with Affinity and RRI to be specific.
I haven't tried the unaltered 5.7.5, so I wouldn't know if it has the same behaviour.

EDIT: In reply to OzzFan
ID: 480397 · Report as offensive
OzzFan Crowdfunding Project Donor*Special Project $75 donorSpecial Project $250 donor
Volunteer tester
Avatar

Send message
Joined: 9 Apr 02
Posts: 15691
Credit: 84,761,841
RAC: 28
United States
Message 480459 - Posted: 12 Dec 2006, 0:09:17 UTC - in response to Message 480397.  

It was the "BOINC 5.7.5.32 - 32 bit Edition by Crunch3r" with Affinity and RRI to be specific.
I haven't tried the unaltered 5.7.5, so I wouldn't know if it has the same behaviour.

EDIT: In reply to OzzFan


Ah, OK. That makes more sense now.
ID: 480459 · Report as offensive
jrmy

Send message
Joined: 14 Jun 04
Posts: 10
Credit: 21,475
RAC: 0
United States
Message 480479 - Posted: 12 Dec 2006, 0:22:59 UTC - in response to Message 480396.  

i am running a 6300(non OC), 1gb 667 ram, and chicken's core 2 enhanced app. I haven't noticed much of a difference in crunching times, if any difference at all, between it and the stock app.. each WU taking approx. 1.5-2h.


Is a matter of the angle range. The two stock-crunched WU's that are still shown on your results page (the two oldest ones) took about 8,000 seconds to complete for about 33 Credits. The newer ones show about 7,500 seconds for about 62 credits! (For comparison: One of your result with 34.25 credits took 5,256.83 seconds with Simon's app).

So I think there's an increase of about 30%...

Andy
oh
ID: 480479 · Report as offensive
Richard Haselgrove Project Donor
Volunteer tester

Send message
Joined: 4 Jul 99
Posts: 14653
Credit: 200,643,578
RAC: 874
United Kingdom
Message 480788 - Posted: 12 Dec 2006, 11:24:53 UTC

OK, the Xeons are just two weeks old so it's time for an update.


(click to see full detail)

I've been running the latest affinity client (Crunch3r 5.7.5.32) for about four days - green dots. It seems to be slightly quicker on average, but that may be the eye of faith! Unfortunately, I didn't get so many VHAR in this run, so not as much data as I would have liked.


(click to see full detail)

I didn't post this one last time, but I think it's interesting. Look how the E6300s (pink dots) outperform the Xeons (blue and green dots) at high AR. Remember, same science app (Simon's 1.41) throughout.

Current RAC for the machine (8 cores) is about 2250, and it made number 80 in the top list overnight - not bad for a young 'un. I reckon it should reach just over 3000 RAC, and might just make the top 20 if people stop moving the goalposts!

And finally, I see that Dell are now taking orders for this chassis with dual X5355 (2.66GHz) quad cores - at a price, and delivery not until next year. I think I'll wait until we get the RAM questions sorted out.
ID: 480788 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Core 2 comparison - Xeon E5320 vs E6300


 
©2024 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.