Cache is King, Pentium-M CPU is the fastest!!!

Message boards : Number crunching : Cache is King, Pentium-M CPU is the fastest!!!
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
Profile Team Cake Boy's

Send message
Joined: 24 Jul 99
Posts: 22
Credit: 6,923,381
RAC: 0
United Kingdom
Message 114299 - Posted: 24 May 2005, 7:38:23 UTC - in response to Message 114296.  
Last modified: 24 May 2005, 7:53:32 UTC

yep its had alot of improvments

The biggest one being a jump from the ageing p3 bus
to the p4 bus (think about it)

Pentium pro
Pentium 2
Pentium 3
all ran on the same bus (ancient)
it just jumped from 60mhz (p pro's came in 60 and 66mhz bus)
to 133

Now the moble is on a quad pumped 533 fsb and 800 soon no doubt with duel channel available to it with the right board


SSE 2
Improved branch preciction
64k L1 Cache
2meg L2 Cache
TLB Buffers
Modified Register Access System
Moddified Prefetch

Etc

but at the end of the day its still a tweaked p3
im so glad they have done this and the p3 core is moveing back to the desktop
in the form of a duel cpu desktop chip

i was an intel man until the p4 came out
have been athlon xp since!!

umm no
the new mobiles
are pentium 3 cpu's with new stuff added sse2 more cache etc

as in they are the same core
all these cpu's use the same origional core !!
but are just updated versions

Pentium Pro
Pentium II
Pentium III
Mobile Pentium-M

oh and guess what

the replacment for pentium 4 is going to be pentium pro based


That's interesting. If my memory serves me correctly. Intel dropped the Pentium-3 line in favor of the Pentium-4, as they could not keep up with the MHz increases with AMD's Athlon.

Also, I am pretty sure Pentium-M's are missing SSE2 instruction set, as there have been several motherboards that tried to use the P-M's CPU only to find bad reviews in places like AnandTech and Tomshardware due to lackluster gaming performance (lack of SSE2 instruction set). They do still have MMX and SSE 1.

I too believe that the Pentium-M's are based off the P3s. However, I'd say they are much better (have been significantly improved). For one thing, when it comes to Seti, a Pentium-M 1.6 Ghz will outcrunch a 2.2Ghz AMD Athlon. Back in the day, Mhz for Mhz, the P3's were actually a little slower than the Athlons.


Jimmy


ID: 114299 · Report as offensive
PhonAcq

Send message
Joined: 14 Apr 01
Posts: 1656
Credit: 30,658,217
RAC: 1
United States
Message 114326 - Posted: 24 May 2005, 11:44:10 UTC

This discussion is becoming a long thread, but in my mind and by my experience, several points are clear: (and probably cause for tomato throwing, so I'll duck)

1) out of the box the Pentium M is significantly faster than the standard desktop peers due to the cache.

This is well known for years. Hence the desktops coming out using the mobile chip.

2) a larger improvement in speed derives from using the optimized client that is properly tuned for the hardware.

My Dell Notebook PM 1.6 got a hugh boost in speed. Again, optimized compilers have been around since, what, the sixties. Not sure why Seti never used them before (or now).

3) the credit system in Boinc is a flop (not a MFLOP either).

Faster computers return more results to the project but get less credit all else being equal. Right now I usually get more credit than I request because the other results take more time; this will not be the case as more people run faster computers or use optimized clients. I would think that people who return more scientific results should get more 'credit'. Why is the Boinc credit algoritm so inverted? Perhaps Boinc is trying to be 'equitable' for some reason but this is nonsense in the capitalist marketplace, as it should be in this project. MHO

May this Farce be with You
ID: 114326 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 114373 - Posted: 24 May 2005, 14:35:26 UTC - in response to Message 114326.  

This discussion is becoming a long thread, but in my mind and by my experience, several points are clear: (and probably cause for tomato throwing, so I'll duck)

Don't duck, save money and catch your lunch ...


3) the credit system in Boinc is a flop (not a MFLOP either).

Faster computers return more results to the project but get less credit all else being equal. Right now I usually get more credit than I request because the other results take more time; this will not be the case as more people run faster computers or use optimized clients. I would think that people who return more scientific results should get more 'credit'. Why is the Boinc credit algoritm so inverted? Perhaps Boinc is trying to be 'equitable' for some reason but this is nonsense in the capitalist marketplace, as it should be in this project. MHO

This is an anomoly with the current benchmarking scheme. We have in the definition of the Cobble-Computer and how we calculate the credit claimed actually "embedded" the processing speed of the software.

I know it is a little hard to get the brain wrapped around that, but the credit claimed is partly based on assumptions built into the system as a whole. So, you actually have credit claimed based on a hardware plus software system when other parts assume that only the hardware can vary.

Since both can vary, the work performed and the grants on that claim are biased. Which gets back to my suggestion that we actually go back to using a calculation based on the "Reference Work Unit" to establish the execution speed of the system as a whole. With that in place, the faster systems, regardless of where they got the speed increase, will be properly "sized" and will ask for the same credit as will systems that are not as fast.

And you are correct, the claim should be on the actual amount of processing done and the rediculous claims of 0.05 Cobblestones for a work unit will stop showing up (Look at the claims by the top producing computer, a 64-CPU system) ...

ID: 114373 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13943
Credit: 208,696,464
RAC: 304
Australia
Message 114517 - Posted: 24 May 2005, 22:52:36 UTC - in response to Message 114288.  

umm no
the new mobiles
are pentium 3 cpu's with new stuff added sse2 more cache etc

as in they are the same core
all these cpu's use the same origional core !!
but are just updated versions

Pentium Pro
Pentium II
Pentium III
Mobile Pentium-M

Ah, no.
The Pentium M is very much more PIII like than a P4, but it is not the same as a PIII, not by a long shot.
Grant
Darwin NT
ID: 114517 · Report as offensive
Grant (SSSF)
Volunteer tester

Send message
Joined: 19 Aug 99
Posts: 13943
Credit: 208,696,464
RAC: 304
Australia
Message 114518 - Posted: 24 May 2005, 22:56:26 UTC - in response to Message 114326.  

1) out of the box the Pentium M is significantly faster than the standard desktop peers due to the cache.

This is well known for years. Hence the desktops coming out using the mobile chip.

Ah, no.

The P4 actually benefits much more form a larger cache than the Pentium M does due to the P4's higher clock speed & deeper pipelines.
While a larger cache does help the Pentium M, it's architectural improvments are what give it it's performance boost for a given clock speed.
Grant
Darwin NT
ID: 114518 · Report as offensive
Profile Team Cake Boy's

Send message
Joined: 24 Jul 99
Posts: 22
Credit: 6,923,381
RAC: 0
United Kingdom
Message 114538 - Posted: 25 May 2005, 0:15:32 UTC - in response to Message 114517.  

i advise you read up
the pentium-m is based on the p6 architecture

so are
Pentium Pro
Pentium II
Celeron
Celeron II
Pentium III
Pentium III Tualatin
Pentium-M

are you also saying a pentium pro isnt useing the same base core as a PIII?

umm no
the new mobiles
are pentium 3 cpu's with new stuff added sse2 more cache etc

as in they are the same core
all these cpu's use the same origional core !!
but are just updated versions

Pentium Pro
Pentium II
Pentium III
Mobile Pentium-M

Ah, no.
The Pentium M is very much more PIII like than a P4, but it is not the same as a PIII, not by a long shot.


ID: 114538 · Report as offensive
Profile Team Cake Boy's

Send message
Joined: 24 Jul 99
Posts: 22
Credit: 6,923,381
RAC: 0
United Kingdom
Message 114539 - Posted: 25 May 2005, 0:21:00 UTC

Intel’s Dothan Part Deux: Tiny Tim Meets California’s Gov

Ten years ago, who would have thought Arnold would be in the governor’s chair, especially in deep-blue California . . .

Several months ago, who would have thought Intel had something cool, quiet and beastly powerful up their sleeves?

Matt, your friendly neighborhood CompE is back in action, to unravel the mysteries of the (computing) world. (The Bermuda Triangle is a bit out of my reach, I think . . . ) If you haven’t read the previous article, “Intel’s Dothan: The Little Engine That Could?”, please do so before proceeding, otherwise a mob of angry neo-anarchist nuns will invade your house and rape your children. Of course, if you have no children, guess there’s no harm in reading on . . .

Dothan, while a bit on the wild side, looks increasingly appealing as 533FSB and Alviso-based motherboards lie on the horizon. But with all the press and marketing efforts going toward the {understatement}successful{/understatement} Centrino brand, relatively little is known about the brains of the operation, the Pentium M core architecture. This humble creature boasts nothing of today’s more forward looking platforms: Athlon 64 with 64bit, NX bit and on-die memory controller, with SSE3 coming soon and dual core/DDR2 coming later - and Prescott with DDR2, HT, and EMT64/EDB, with rapidly approaching dual core versions and sooner-than-expected virtualization technologies.

Despite all the aforementioned buzzwords, Dothan still brings the heat on clock-for-clock performance, based on a design from yesteryear quietly restored to life to rule the mobile roost. After all, Intel told us the P6 core was old hat, past its prime, ready to be retired. Perhaps the Pentium 3’s of the world got together, decided they’d heard enough Netburst propaganda to invoke gag reflexes, and a la Frankenstein, created a monster from the Ghost of CPU Engineers’ Past. Now that monster has broken its mobile chains with blistering performance – but, dear Lazarus, how is this possible?

Pentium Pro Is Alive and Kicking

Anyone here remember the days of the Pentium Pro? I remember being a little kid in the barber’s chair, talking to my barber and fellow enthusiast Jeff about how sweet it would be to have a dual 200MHz Pentium Pro system. We’ve grown a lot since then – Jeff’s comfy with a laptop, 200MHz is slow for an embedded processor, and me – OK, I lied, I’m running a Vapochill and putting together an SLI rig.

In reality, the Pentium M doesn’t go back that far. The Pentium Pro core dubbed “P6” has changed substantially from its initial incarnation, and P-M beats with the heart of a Pentium 3 (no surprise there). However, Dothan’s per-clock performance is much heftier than any P3 I’ve ever known – back in the day, the classic Slot A Athlon (codename K7, later K75 at 180nm and Mustang with Cu interconnect) and the Slot 1 P3 (codename Katmai, later Coppermine at 180nm w/ on-die cache) were on arguably equal footing. The move to SocketA/370 (Thunderbird/Mustang and Coppermine, respectively) didn’t make either a clear favorite, only increasing the performance divide in specific areas.

But Dothan bests even an Athlon FX in clock-for-clock performance (gaming benchmarks, which I’ll focus on for this article), and the 64/FX has significantly improved IPC over Athlon XP’s. Not to mention Dothan’s power loss/clock is nearly one fourth that of a .13um A64. Quite frankly, that’s amazing!!

(Now a few oranges have fallen into the apple barrel – we don’t have good power figures on current 90nm A64’s, and the San Diego/Venice chips on strained SOI are not here yet – because Dothan is a mobile, 90nm chip. Still, power differences of that magnitude don’t disappear overnight.)

Dothan’s power efficiency isn’t difficult to explain (anyone notice me avoiding the IPC? Good . . . ) While die shrinks don’t have the same power savings they used to (for numerous reasons), there’s nothing inherently power-greedy about 90nm production. Transistors can be designed for two kinds of switching behavior – fast, with high power leakage, or slow, with low power leakage. I don’t have to tell you what kind Prescott uses *wink*

Since Dothan runs at much lower clockspeeds, Intel can use transistors that do not switch as fast, saving a large amount of power leakage, and enabling lower voltage operation. The P-M design needs fewer transistors for execution, with the current 2MB cache taking the lion’s share of the die space, and it’s easy to limit power loss on cache memory. Further, though it’s not mentioned in anything I read, I would suspect Dothan makes aggressive use of clock gating, a simple but relatively new technique – quite simply, the clock signal is “gated” or turned-off to portions of the chip that are not in use. Core architectural features also contribute to power savings, but I won’t mention these: you can read them in my primary reference for the next section, Ars Technica’s Pentium M article.

Killer IPC – Sometimes A Bit of Tweaking Makes All The Difference

Do yourself a favor and read Ars’ article, if you want to know more about what makes the P-M unique (warning, extremely technical!). If you’d rather just have a general idea of how it accomplishes more with less, please keep reading.

We know Banias/Dothan have SSE2 units, and have around 12 to 14 pipeline stages, but details on the SSE2 units and the exact pipeline depth remain a mystery. (Intel has chosen not to reveal that information - remember, Centrino is the press-maker, not the processor behind it.) However, that information is not vital to the chip’s inner workings, at least from a performance standpoint. (Whoa d00d, it’s snowing at 4AM! ADD under control ) Intel made two key changes to the Pentium 3 core, based on the P4’s strengths – improved branch prediction and micro-ops fusion. I won’t cover the stack engine as though it does allow the CPU to do more useful work, it primarily adds power savings, in my ever-so-humble opinion.

In microprocessor terms, a “branch” is a point at which a program could take two or more execution paths. Because CPU’s actually continue to do work until the direction of the branch is determined, accurate branch prediction is crucial in keeping the processor doing useful work – a branch misprediction requires the CPU to flush the pipeline and start over, greatly delaying execution of the new branch. The Pentium 4 first brought us dynamic branch prediction (the Athlon 64 caught on later), but the Pentium M takes it further, making it possibly the most advanced branch prediction unit in wide use today (the Power PC 970 or G5 is comparable, but that’s not x86 so who gives a )

Entirely new is the “loop detector”, which allows the branch prediction unit better forewarning of execution threads exiting a loop. Again, if the PU guesses wrong, a pipeline flush is required. As with all things Netburst, the P4 has a more brute-force method of guessing the loop exit point, by keeping a history of previous loop exits with the branch history table – P-M essentially makes more efficient use of its BHT. Banias/Dothan are also equipped with the “indirect branch predictor”. Much like a pre-rendered cutscene, direct branches are those branches which the program has already told where to go next in case of a branch occurs – much like realtime cutscenes require calculation, indirect branches must have their next path determined at runtime. Instead of merely using the BHT as a history for the branch predictor’s reference, the P-M actually stores the new paths that indirect branches like to take, giving the PU more relevant information to use.

The Pentium M’s micro-ops fusion capabilities, on the other hand, are actually less advanced than those of the P4. In modern x86 CPU’s, x86 instructions from software programs are broken down into smaller “micro-ops” for the processor to handle internally. To do this, processing units called “x86 decoders” are used to break original x86 instructions into one or more micro-ops – the P4, for example, has two of such units, while the P3 has three. However, only one of the P3’s decoders can handle instructions that break down into two or more micro-ops. Two commonly used x86 instructions, which could only be handled by one decoder in the P3, can be handled by all three decoders in the P-M, resulting in potential execution of three times the instructions in parallel, and allowing certain memory-heavy operations three decoder targets instead of just one.

How does this work? In the P-M, the two so-called “simple” x86 decoders were not granted the capability to handle multiple micro-ops, as doing so would require too much power. However, in the two previously mentioned common x86 instructions, the output micro-ops are “fused” or handled as a single unit, allowing the simple decoders to handle these instructions where the P3’s simple decoders could not. In my yet-again-humble opinion, micro-ops fusion is the single most effective improvement in the P-M – you can see the possibilities when three operations can be handled at once!

But Mommy: Where’s the Hyper-Threading?

Whew – almost done folks.

What I find interesting is that in another Ars article, the question is raised – why no hyper-threading for the Pentium M? I don’t believe the chip could use it, or rather, because of the chip’s extremely high IPC, it’s less likely that execution units are sitting idle to handle work from other threads. The P4’s obvious low IPC allows one to imagine more execution units sitting idle in a given clock cycle, making HT a valuable asset.

As for dual core on Dothan, it will happen eventually. A little bird whispered of a likely scenario: when 65nm comes knocking, Dothan chips will be granted a second core, with the second core only in use when the system is plugged into the wall. Given one of Centrino’s highest priorities is low power consumption, I’d take this to be the most probable use of dual core, with desktop versions being able to utilize both cores 24/7 at the cost of power dissipation.

That is, assuming Intel allows desktop versions . . . wait, I’m talking to the wrong audience, Intel doesn’t have to allow desktop varieties for us to use them as such

What I mean is, don’t hold your breath for Intel to take this mainstream.

It’s late, I’m tired, and (hopefully! Goodness . . .) there’s nothing more to say. Here’s your friendly neighborhood computer engineer signing off . . .

*snore* . . . Kirsten . . . *snore*

ID: 114539 · Report as offensive
Profile Team Cake Boy's

Send message
Joined: 24 Jul 99
Posts: 22
Credit: 6,923,381
RAC: 0
United Kingdom
Message 114540 - Posted: 25 May 2005, 0:24:33 UTC
Last modified: 25 May 2005, 0:30:15 UTC


Pipeline and execution core
I normally discuss the front end first and the execution core second. Because Centrino is based on the PIII version of the P6 core, and because the execution core is where it differs the least from its predecessor, I'll go ahead and cover it first.

Not much is known about the details of the PM's pipeline--only that it's a little longer than that of the P3 yet shorter than that of the P4. I think that this slight increase in pipeline depth was done for two reasons. First, Intel wanted more headroom to ramp up the PM's clock speed, especially since they opted to tune the circuit layout for lower power consumption versus higher clockspeed. A few extra pipeline stages will help offset the hit to clockspeed headroom that their new layout gives. The second reason possibly has something to do with micro-ops fusion, about which we'll talk more in a moment.

Since I've never really covered the P6 architecture in much detail, I don't have an article to which I can refer you for the lowdown on the P6 execution core. So what follows is a brief breakdown of the P6 execution core in its PIII incarnation; you can pretty safely take all of this to apply to the PM, as well.
The diagram above makes the core look a bit "wider" than it actually is, but note the port designations on the arrows leading from the reservation station to to the execution units; there are five "dispatch ports" in the P6's reservation station through which micro-ops (a.k.a. uops) can travel to reach the execution units. The RS can dispatch up to 5 uops per cycle to the units, but the average is more around 2.5 to 3 uops per cycle.

I think it's possible that the number of entries in the PM's RS has been decreased from the P6 core's 20, and that the number of entries in the ROB has been decreased from 40. I say this because of the PM's micro-ops fusion technique, which I'll cover in the next section.

You'll notice that the P6 core's floating-point and integer capabilities are weaker than those of the 970; when compared to the P4, its floating-point capabilities don't look too bad but its integer capabilities are much weaker. Also, its SIMD capabilities aren't too shabby, either, when compared to the P4. Speaking of SIMD, I suspect that if the core diagram above is inaccurate with respect to the PM, then it's because Intel has somehow changed the way that the FPU + vector units handle the distribution of floating-point and vector work. This is just sort of a random hunch, though.

All of this means that, clock-for-clock, the PM's execution core stands up quite well to that of the P4, but of course, we would expect exactly this from just looking at the results of the numerous PIII vs. P4 comparisons that accompanied the P4's launch. Now, at this point, I'd normally feel obligated to point out that clock-for-clock comparisons don't mean much when the pipelines of the two architectures are as different as those of the PM and the P4. But when power is an issue, as it is in this case, the P4's higher clockrates spell higher power consumption, so the PM's better clock-for-clock performance is actually a mark in its favor as a mobile platform.

Aside from a deeper pipeline and thus higher clock speeds, one of the main performance-enhancing features that the P4 boasted over the PIII was improved branch prediction. Well, the PM takes the P4's branch prediction and does it not one but two better; the next section tells how.

The front end: branch prediction, micro-op fusion, and the stack engine
The front-end of the Pentium M is where all the interesting work has been done, at least from the standpoint of microarchitecture. This is where Intel has improved the P6 core in a way that makes it not only more power-efficient but more powerful as well.

Branch prediction
One of the most important changes to the PM's front end is its improved branch prediction capabilities. If you've read many of my articles in the past two years then you've been hearing a lot about the importance of branch prediction. This technique has emerged as one of the most important tools for keeping modern superpipelined architectures from wasting execution cycles on branch-related stalls. The PowerPC 970 spends massive resources on branch prediction, using two separate branch prediction schemes and history table for determining which one works the best. The P4 also spends quite a few transistors on branch prediction, but I'll say more about its branch prediction scheme in just a moment.

The PM takes the P4's scheme and builds on it by adding two new, more specialized branch predictors that work in tandem with primary, P4-based branch predictor: the loop detector and the indirect branch predictor.


More branch prediction
The loop detector
One of the most common types of branches that a processor encounters is test condition of a loop. In fact, loops are so common that the static method of branch prediction, in which all branches are assumed to be loop test conditions that evaluate to "taken", works reasonably well for processors with shallow pipelines (i.e. if a loop is set to go through 100 iterations before the loop counter runs out and the loop's test condition evaluates to "not taken," then a static branch predictor will have been 99% accurate for that portion of the code).

One problem with static branch predictors is that they always make a wrong prediction on the final iteration of the loop--the iteration on which the branch evaluates to "not taken"--thereby forcing a pipeline stall as the processor recovers from the erroneous prediction. And of course, static prediction works poorly for non-loop branches, like standard if-then branches. In such branches, a static prediction of "taken" is roughly the equivalent of a coin toss.

Dynamic predictors, like the P4's branch predictor, fix this by shortcoming by keeping track of the execution history of a particular branch instruction--whether it was "taken" or "not taken" the past few times it was evaluated--in order to give the processor a better idea of what its outcome on the current pass will probably be. The bigger the table used to track the branch's history, the more data the branch predictor has to work with and the more accurate its predictions can be. Now let's look at P4's branch predictor as I described it in my first PowerPC 970 article:

The P4's branch predictor uses a 4096-entry branch history table (BHT) to keep track of the branches in a program by recording whether they were taken or not taken when they executed on previous cycles. Using the BHT in combination with an undisclosed but highly accurate branch prediction algorithm, the P4 decides whether each branch that it encounters should be taken or not taken. If the branch is taken, a 4K-entry (or 4096-entry) branch target buffer (BTB) attached to the BHT helps the P4 predict the address at which it should begin speculative execution.

One of the main shortcomings of the P4's branch predictor is that, even though its BHT is relatively sizeable, it doesn't have enough space to store all the relevant execution history information on the loop branches that tend to take a very large number of iterations.

Now that we've got enough background, let's take a look at Intel's own description of the loop detector:

The Loop Detector analyzes branches to see if they have loop behavior. Loop behavior is defined as moving in one direction (taken or not-taken) a fixed number of times interspersed with a single movement in the opposite direction. When such a branch is detected, a set of counters are allocated in the predictor such that the behavior of the program can be predicted completely accurately for larger iteration counts than typically captured by global or local history predictors.

So the loop detector plugs into the PM's P4-style BHT and augments it by providing it with extra, more specialized data on the currently executing program's loops. The PM's predictor can then use this data to avoid being caught by surprise when the loop's exit condition is fulfilled.

The second type of new branch predictor that the PM introduces is the indirect predictor. Branches come in two flavors: direct and indirect. Direct branches have the branch target explicitly specified in the instruction, which means that the branch target is fixed at compile time. Indirect branches, on the other hand, have to load the branch target from a register and so can have multiple potential targets. Storing these potential targets is the function of the branch target buffer (BTB) described in the above P4 quote.

Direct branches are the easiest to predict, and can often be predicted with upwards of 97% accuracy. Indirect branches, in contrast, are notoriously difficult to predict, and at least one paper that I read on the topic puts indirect branch prediction accuracies at around 75% using the standard BTB method.

The PM's indirect branch predictor works a little like the branch history table that I described above, but instead of storing information on whether a particular branch was "taken" or "not taken" the past few times it was executed, it stores information about each indirect branch's favorite target addresses--which targets a particular branch usually likes to jump to, and under which conditions it likes to jump to them. So the PM's indirect branch predictor knows that for a particular indirect branch in the BHT with a specific set of "favorite" target addresses stored in the BTB, under this set of conditions it tends to jump to one target address, while under that set of conditions it likes to jump to another.

Intel claims that the combination of the loop detector and indirect branch predictor gives Centrino a 20% increase in overall branch prediction accuracy, resulting in a 7% real performance increase. Of course, the usual caveats apply to these statistics, i.e. the increase in branch prediction accuracy and that increase's effect on real-world performance depends heavily on the type of code being run.

Improved branch prediction gives the PM a leg up not only in terms of performance but in terms of power efficiency as well. Because of its improved branch prediction capabilities, the PM wastes less energy speculatively executing code that it will then have to throw away once it learns that it mispredicted a branch.

The PM's "micro-op" fusion feature is one of the processor's most fascinating innovations for this reason: it functions remarkably like the PowerPC 970's instruction grouping scheme. The analogy isn't perfect, but it is striking. Let me explain.

Most of the folks who've gotten this far into one of my articles are probably aware that the P6 architecture beaks down x86 ISA instructions into smaller instructions, called micro-ops or uops. The majority of x86 instructions translate into a single uop, while a small minority translate into two or three uops and a very small, very exotic and rarely used minority (string handling instructions, mainly) translate into sequences of many uops.

Because uops are what the P6's execution core dynamically executes, they're what the reorder buffer (ROB) and reservation station (RS) must keep track of so that they can be put back in program order after being executed out-of-order. In order to track a large number of in-flight uops the P6 core needs a large number of entries in its ROB and RS, entries that take up transistors and hence consume power. The PM cuts down on the number of ROB and RS entries needed by fusing together certain types of related uops and assigning them to a single ROB and RS entry. The fused uops still execute separately, as if they were normal, unfused uops, but they're tracked as a group.

If you read my 970 coverage, then what I just said should sound very familiar. The 970 does essentially the same thing--first it breaks down PPC architected instructions into multiple smaller internal operations (iops) for processing by the execution core, and then it groups them together for tracking in the group completion table; the iops still execute out-of-order, but they're tracked as groups to cut down on the number of GCT entries, thereby saving power and allowing the 970 to track more instructions with less logic and less overhead.

Now, you'll recall that the 970 had to introduce a number of architectural features to make this grouping scheme work; I'm talking specifically about the elaborate issue queue structure. Indeed, it's apparent that the 970 was designed from the ground up with such grouping in mind. The PM, on the other hand, is based on the P6 core, and as such the designers were more constrained in the kinds of things they could do to make micro-op fusion work. Specifically, the P6 core's reservation station and issue port structure would have to be pretty heavily rethought for a pervasively used micro-op fusion scheme to be feasible, which is why the PM's designers appear to have limited the technique to two very specific types of instructions: the load-op instruction type and the store.

It's not completely clear to me whether or not the two "examples" of micro-op fusion cited in Intel's literature are indeed the only kinds of micro-op fusion that goes on. I've been led astray by examples before, so I proceed with some caution here. I do, however, have good reasons to suspect that load-op and store instructions are the only two types that are fused, reasons which I'll outline near the end of the present discussion.

Store instructions on the P6 are broken down into two uops: a store-address uop and a store-data uop. The store-address uop is the command that calculates the address where the data is to be stored, and it's sent to the address generation unit in the P6's store-address unit for execution. The store-data uop is the command that writes the data to be stored into the outgoing store data buffer, from which the data will be written out to memory when the store instruction retires; this command is executed by the P6's store-data unit. Because the two operations are inherently parallel and are performed by two separate execution units on two separate issue ports, these two uops can be executed in parallel--the data can be written to the store buffer at the same time that the store address is being calculated.

According to Intel, the PM's instruction decoder not only decodes the store operation into two separate uops but it also fuses them together. I suspect that there has been an extra stage added to the decode pipe to handle this fusion. The instructions remain fused until they're issued (or "dispatched," in Intel's language) through the issue port to the actual store unit, at which point they're treated separately by the execution core. When both uops are completed they're treated as fused by the core's retirement unit.

A load-and-op, or "read-modify" instruction is what it sounds like: an instruction that loads data from memory into a register, and then performs an operation on that data. Such instructions are broken down into two uops: a load uop that's issued to the load unit and is responsible for loading the needed data, and a second instruction that performs some type of operation on the loaded data and is executed by the appropriate execution unit.

Load-and-op instructions are treated in much the same way as store instructions with respect to decoding, fusion and execution. The main difference in practice is that load-and-op instructions are inherently serial, so they must be executed in sequence.

Now let's take a look at what Intel claims for the results of this scheme.

We have found that the fused micro-ops mechanism reduces the number of micro-ops handled by the out-of-order logic by more than 10%. The reduced number of micro-ops increases performance by effectively widening the issue, rename and retire pipeline. The biggest boost is obtained during a burst of memory operations, where micro-op fusion allows all decoders, rather than the one complex decoder, to process incoming instructions. This practically widens the processor decode, allocation, and retirement bandwidth by a factor of three.

It'll probably help if I parse the above quote for you. The P6 decoding hardware has three decoders: two simple/fast decoders, which are capable of translating into uops only those x86 instructions that translate into a single uop, and one complex/slower decoder, which translates all multi-uop x86 instructions. Before the advent of uop fusion, store and load-and-op instructions, being multi-uop instructions, had to pass through the complex decoder. So in situations where there was a burst of memory operations, the complex decoder would be backed up with work while the other two decoders sat idle. With uop fusion, however, the two simple/fast decoders are now capable of decoding stores and load-and-op instructions, because they can now produce a single, fused uop for these two types of instructions.

Now that that's clear, let's take a look at the next paragraph, on the performance increases afforded by uop fusion:

The typical performance increase of the micro-op fusion is 5% for integer code and 9% for Floating Point (FP) code. The store fusion contributes most of the performance increase for integer code. The two types of fused micro-ops contribute about equally to the performance increase of FP code.

This is about what we would expect, given that floating-point code is usually more memory-intensive than integer code.

The quoted paragraphs above also lend credence to my claim that only store and load-and-op instructions are fused by Centrino's decoders. This seems to be the plain sense of the statements, and it also fits with the aforementioned fact that if more types of instructions were fused then Intel might've had to do something weird with Centrino's reservation station and issue ports (called "dispatch ports" in Intel-speak). Because the load and store units each have their own dedicated ports (the load unit is on port 2 and the store units are on ports 3 and 4), it's likely that the modifications to the RS necessary to accommodate uop fusion were kept to a minimum. Finally, there's also the fact that few arithmetic instructions break up into multiple uops, so the performance benefits to rewiring the RS to accommodate fused arithmetic instructions would be offset by increases in the complexity, and hence the power consumption, of the RS. It's worth noting, however, that a hypothetical future iteration of Centrino with a higher power budget, like for use on the desktop or in a blade server, might be under fewer restrictions and so it might use uop-fusion more extensively.

The stack execution unit is the hardest to understand of the PM's innovations, because it performs such a specialized function. I'll give my best shot at an explanation, so that you can have a general idea of what it does and what role it plays in saving power.

x86 includes stack-manipulation instructions like POP, PUSH, RET, and CALL, for use in passing parameters to functions in function calls. During the course of their execution, these instructions update x86's dedicated stack pointer register, ESP. In the PIII's and P4's cores, this update was carried out by a special uop, which was generated by the decoder and was tasked with using the IEUs (integer execution units) to update ESP by adding to it or subtracting from it as necessary.

The dedicated stack engine eliminates these special ESP-updating uops by monitoring the decoder's instruction stream for incoming stack instructions and keeping track itself of those instructions' changes to ESP; updates to the ESP are handled by a dedicated adder attached to the stack engine. So because the PM's front end has dedicated hardware for tracking the state of ESP and keeping it updated, there's no need to issue those extra ESP-related uops to the IEUs.

This technique has a few benefits. The obvious benefit is that it reduces the number of in-flight uops, which means fewer uops per task and less power consumed per task. Then, because there are fewer IEU uops in the core, the IEUs are free to process other instructions because they don't have to deal with the stack-related ESP updates.

Conclusions
To recap a bit from a previous section, initial benchmarks pitting the PIII against the P4 showed the PIII comparing quite favorably to its younger sibling. In fact, even with a slight clockspeed advantage the P4 still couldn't trounce its predecessor in some benchmarks. Inevitably, as the P4 started to pull away from the PIII in clockspeed, it began to open up a performance gap, as well. Aided by its superior branch prediction capabilities, trace cache, and deeper pipeline, the P4's Netburst architecture eventually outran the older P6 architecture and took its place as Intel's flagship performance processor. This victory came at a cost, though: the P4's high clockspeed made for high power requirements, making the architecture less than ideal for mobile applications.

This brings us to the Pentium M. The PM takes one of the P4's strengths--its branch prediction capabilities--and improves on it, adding its advantages to the strengths of the P6 architecture. The PM also deepens the P6's pipeline a bit, allowing for better clockspeed scaling, but without making clockspeed the central factor driving performance. In short, the PM looks like what the P4 might have been, had Intel not been so obsessed with the MHz race--it's a kind of alternate past, but one that may provide a glimpse of Intel's future.

With an increased power budget and, say, IA-32e support, the Pentium M would be a serious desktop contender. With a massive 1MB L2 cache and a hefty 64K L1 cache, it already has twice the cache of Intel's P4. It may not scale to the clockspeed heights of the Netburst architecture, but with the power consumption of Prescott spiraling out into the stratosphere, that might not be such a bad thing. AMD, after all, has gotten off the MHz train and is marketing based on performance. Centrino has Intel doing the same with regard to the mobile space.

Consider this report on a talk by Intel CT Patrick Gelsinger:

Gelsinger, giving the final speech of an Intel technology forum, showed the audience a slide of the impossibly high power needs of computer processors as a way of arguing that chip designers must radically change chip architectures, and that Intel would be the company to do just that.

"We need a fresh approach," Gelsinger said. "We need an architectural paradigm shift."

That "fresh approach" might look something like a future version of the Pentium M, in a system that gangs together multiple low-power processors and relies heavily on multi-threaded software.

Not coincidentally, this looks something like what I've elsewhere described as IBM's plan for the PowerPC 970 and its offspring. In fact, let's take a look at how the 970 stacks up against the PM from a power consumption point of view:

Processor Process Size Transistors Core voltage Power
1.2GHz PM lv 0.13um 84 mm2 77 million 1.18v 12 Watts
1.2GHz 970 0.13um 121 mm2 58 million 1.1v 19 Watts



As you can see, the Pentium M low voltage and the PowerPC 970 are comparable in terms of power requirements for the same speed on the same process. The Pentium M will never be the floating-point and SIMD monster that the 970 is, but for most desktop and server workloads it could provide a competitive alternative, especially with some of the aforementioned tweaks.

In sum, the Pentium M's future looks bright. There are lots of places that Intel could go with this architecture, and with power consumption having fully arrived as a problem with Prescott the Pentium M's architecture just might have a brighter future than Netburst. I look for Intel begin rolling out non-mobile-oriented variants of the Pentium M across a variety of market niches--from small form factor computers to blade servers. A Pentium M derivative with IA-32e? It could certainly happen.

At the very least, the Pentium M represents a solid Plan B in case Prescott goes up in a puff of smoke at some point, and a solid Plan B is what you'd expect from a company whose motto is "only the paranoid survive."

Bibliography
Intel Pentium M Processor: Higher Performance, Lower Power, Intel
Pentium Pro Family Developer's Manual, Volume 1, Intel
Simcha Gochman, et al, "The Intel Pentium M Processor: Microarchitecture and Performance," Intel Technology Journal, volume 7 issue 2.
Suggested further reading
Anand Lal Shimpi, Intel's Centrino CPU (Pentium-M): Revolutionizing the Mobile World, Andandtech
Geoff Gasior, Intel's Pentium M 1.4GHz Processor: More cache, less power, Tech Report




ID: 114540 · Report as offensive
Legolas

Send message
Joined: 23 Jul 00
Posts: 5
Credit: 3,921,232
RAC: 0
United States
Message 115270 - Posted: 27 May 2005, 1:49:32 UTC

Has anyone posted their results for mobile Athlons. Mine isn't the fastest around,but I only paid $90 for it. I average about 2hrs 15min per unit.

AMD Mobile Athlon XP-M 2600(overclocked to 2600mhz)
ID: 115270 · Report as offensive
Profile jimmyhua

Send message
Joined: 16 Apr 05
Posts: 97
Credit: 369,588
RAC: 0
Guam
Message 115362 - Posted: 27 May 2005, 10:24:50 UTC

As far as overclocking goes...

I have overclocked an XP2500+ with 333MHz FSB -> XP3200+ with 400Mhz FSB.

WU completion times are in the 2 Hour range using a motherboard with NForce2 chipset.

Changing ONLY the motherboard, same RAM OS and all (from an Asus Nforce2 to an ECS KM400)... WU completion times changed from 2 hours to 4 hours (doubled)... Yikes!

OC'ing this thing, that much I had to crank core voltages up to 1.6V to 1.75V. That is minor, what sucked was the core temperatures also went up 10 degrees Celsius.

Ended up not overclocking (makes it 20% slower). The change it mobo made a biggger difference than overclocking the CPU.... Geez.

Jimmy



ID: 115362 · Report as offensive
Legolas

Send message
Joined: 23 Jul 00
Posts: 5
Credit: 3,921,232
RAC: 0
United States
Message 115612 - Posted: 28 May 2005, 1:03:18 UTC - in response to Message 115362.  

As far as overclocking goes...

I have overclocked an XP2500+ with 333MHz FSB -> XP3200+ with 400Mhz FSB.

WU completion times are in the 2 Hour range using a motherboard with NForce2 chipset.

Changing ONLY the motherboard, same RAM OS and all (from an Asus Nforce2 to an ECS KM400)... WU completion times changed from 2 hours to 4 hours (doubled)... Yikes!

OC'ing this thing, that much I had to crank core voltages up to 1.6V to 1.75V. That is minor, what sucked was the core temperatures also went up 10 degrees Celsius.

Ended up not overclocking (makes it 20% slower). The change it mobo made a biggger difference than overclocking the CPU.... Geez.

Jimmy



Im not sure I understand why you are 20% slower when overclocking,I agree that the mobo makes a big difference, but in the case of my system I was able to achieve a significant boost in performance. From 2000mhz stock,to a 2600mhz Prime 95 stable overclock,Heat is definetly an issue. I'd be lying if I said my computer was quiet,with all the fans I have,it sounds like a jet engine at times,but as long as you can keep your temps under control and make sure your system is stable then its a cost effective way to squeeze extra performance from your system.

ID: 115612 · Report as offensive
Profile jimmyhua

Send message
Joined: 16 Apr 05
Posts: 97
Credit: 369,588
RAC: 0
Guam
Message 115624 - Posted: 28 May 2005, 2:06:12 UTC
Last modified: 28 May 2005, 2:06:43 UTC

Sorry, my wording was a bit unclear.

Yes, overclocking by 20% did make the machine run faster by 20%. But it also made it run hotter by about 33%, and Also the voltage was cranked up by 5%. Overall power consumption also went up by 33-40%.

I think the gains were not linear (compared to the losses), and so went back to stock settings.

The biggest difference in performance was different brand/chipset motherboards. So, if you don't have a good motherboard to begin with. This is where I'd start first, is what I was trying to say.

Jimmy

ID: 115624 · Report as offensive
Legolas

Send message
Joined: 23 Jul 00
Posts: 5
Credit: 3,921,232
RAC: 0
United States
Message 115679 - Posted: 28 May 2005, 7:15:12 UTC - in response to Message 115624.  

Sorry, my wording was a bit unclear.

Yes, overclocking by 20% did make the machine run faster by 20%. But it also made it run hotter by about 33%, and Also the voltage was cranked up by 5%. Overall power consumption also went up by 33-40%.

I think the gains were not linear (compared to the losses), and so went back to stock settings.

The biggest difference in performance was different brand/chipset motherboards. So, if you don't have a good motherboard to begin with. This is where I'd start first, is what I was trying to say.

Jimmy

Jimmy I couldnt agree with you more,a good motherboard is crucial. For me personally,I'm not all that concerned about overall power consumption,I have too many other things in my home that consume alot of power. I would agree that the gains are not linear,but the losses you mention don't seem all that bad to me. I look at some of the other processors mentioned in these forums,most of which cost more than mine,and I cant help but wonder. It seems like a good trade to me, I paid less than $100 for a processor that crunches a seti unit in 2 to 2.5 hours. Still,you make several valid points. I guess all I'm really trying to say is Mobile processors have their advantages. :)
ID: 115679 · Report as offensive
Riley

Send message
Joined: 30 Jan 00
Posts: 30
Credit: 6,362,824
RAC: 0
United States
Message 117412 - Posted: 2 Jun 2005, 3:22:48 UTC

Here come the Pentium M XPC's.

AOpen and DFI have had their desktop pentium M stuff out for awhile, but I have always liked the XPC's.

http://www.tomshardware.com/business/20050601/computex_day_2-13.html
ID: 117412 · Report as offensive
ewitte2

Send message
Joined: 20 Mar 05
Posts: 17
Credit: 48,588
RAC: 0
United States
Message 117568 - Posted: 2 Jun 2005, 15:02:57 UTC - in response to Message 103847.  
Last modified: 2 Jun 2005, 15:05:32 UTC



*** Results:

Pentium M is the fastest at crunching WU's, with a turn around time of between 2 and 2.5 hours.

AMD Sempron XP 2500+, is the 2nd fastest with a turn around time of between 4-6 hours.

Pentium 4 (with dual processors), takes 4 hours. (only did 1 WU), so far.

Duron 800 takes 8+ hours.


Nearly all of my normal WUs are between 2800 and 3700 seconds. Thats between 46-62 minutes each. Of course thats running an optimized client on a 2.8Ghz A64 San Diego (1MB cache). I'd still expect it to be way below 2 hours with the normal client but have not tried :) Thats still a very good time for the normal client and a CPU running 1.6Ghz! Next up is testing on a X2 4400. That should be pulling near 2Wu's/hour on average.

Eric
ID: 117568 · Report as offensive
Profile jeffusa
Avatar

Send message
Joined: 21 Aug 02
Posts: 224
Credit: 1,809,275
RAC: 0
United States
Message 117869 - Posted: 3 Jun 2005, 2:02:38 UTC

I installed Boinc on a Xeon system today that has hyperthreading. I honestly don't think it will be much faster than a normal P4 but I would like to see how it does. Also somebody mentioned about turning off the hyperthreading. How do you do that?

Also, do you think tweaking the server to run Boince faster could result in speed gains in say running SQL server? Just a thought...
ID: 117869 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 117937 - Posted: 3 Jun 2005, 3:51:43 UTC - in response to Message 117869.  

I installed Boinc on a Xeon system today that has hyperthreading. I honestly don't think it will be much faster than a normal P4 but I would like to see how it does. Also somebody mentioned about turning off the hyperthreading. How do you do that?

Also, do you think tweaking the server to run Boince faster could result in speed gains in say running SQL server? Just a thought...


You turn the Hyper-Threading off, or on, in the BIOS. As the system is booting, enter the bios by depressing the appropriate key, it is, DEL, "F1" or "F2" usually. If the correct key is not displayed as yuo boot up, look in the documentation. ...

There is no way that the project is going to change database engines from an Open Source "free" to a licenced version. It is also highly unlikely to make a significant change in speed of the system. Based on my limited use of MySQL (which is not really all that limited at this time as I have been running tests and storing data of my own) it is no slower than Oracle (my usual first choice).
ID: 117937 · Report as offensive
Profile jeffusa
Avatar

Send message
Joined: 21 Aug 02
Posts: 224
Credit: 1,809,275
RAC: 0
United States
Message 118108 - Posted: 3 Jun 2005, 15:38:37 UTC - in response to Message 117937.  

I installed Boinc on a Xeon system today that has hyperthreading. I honestly don't think it will be much faster than a normal P4 but I would like to see how it does. Also somebody mentioned about turning off the hyperthreading. How do you do that?

Also, do you think tweaking the server to run Boince faster could result in speed gains in say running SQL server? Just a thought...


There is no way that the project is going to change database engines from an Open Source "free" to a licenced version. It is also highly unlikely to make a significant change in speed of the system. Based on my limited use of MySQL (which is not really all that limited at this time as I have been running tests and storing data of my own) it is no slower than Oracle (my usual first choice).


Maybe I didn't explain it clearly enough. What I was trying to say is that if we optimize our computers for BOINC, do you think other applications such as SQL server will run faster?
ID: 118108 · Report as offensive
Profile jeffusa
Avatar

Send message
Joined: 21 Aug 02
Posts: 224
Credit: 1,809,275
RAC: 0
United States
Message 118111 - Posted: 3 Jun 2005, 15:45:35 UTC - in response to Message 117869.  

I installed Boinc on a Xeon system today that has hyperthreading. I honestly don't think it will be much faster than a normal P4 but I would like to see how it does. Also somebody mentioned about turning off the hyperthreading. How do you do that?

Also, do you think tweaking the server to run Boince faster could result in speed gains in say running SQL server? Just a thought...


After running the Xeon server for a day, it is processing units around 8000 seconds on average. Not bad. Compared to my regular 2.8 ghz P4 which does units in around 12,000 seconds. However, that may not be a fair comparison since the Xeon server has hyperthreading. My P4 desktop does not.

I wonder if anyone out there has a 2.8 Ghz P4 system with Hyperthreading so we can compare the chips better?

ID: 118111 · Report as offensive
Profile Paul D. Buck
Volunteer tester

Send message
Joined: 19 Jul 00
Posts: 3898
Credit: 1,158,042
RAC: 0
United States
Message 118129 - Posted: 3 Jun 2005, 16:35:48 UTC - in response to Message 118108.  

Maybe I didn't explain it clearly enough. What I was trying to say is that if we optimize our computers for BOINC, do you think other applications such as SQL server will run faster?


Ah, got it.

In answer to that question ... no ... :)

BOINC and most of the projects that run on it are CPU bound. Meaning, to increase the speed substantially the best way is to improve the CPU speed of the system. Things like memory bandwidth do play a part but, in general, the best "bang" for the "buck" is to increase CPU speed (and/or cache size/speed).

Database systems are I/O bound. Meaning that the speed of reading and writing to/from memory (logical I/O as Oracle calls it) and to/from the disk drives (physical I/O). So, there, you may see the CPU cruising along at 50-80% utilization and the system is as fast as it can get...

Faster disk drives, RAID using striping, faster connection type (SCSI vs. IDE/ATA) will speed up a database system.

In essence, the two are almost mutually exclusive as far as return on the dollar. Though there is overlap ...
ID: 118129 · Report as offensive
Previous · 1 · 2 · 3 · 4

Message boards : Number crunching : Cache is King, Pentium-M CPU is the fastest!!!


 
©2025 University of California
 
SETI@home and Astropulse are funded by grants from the National Science Foundation, NASA, and donations from SETI@home volunteers. AstroPulse is funded in part by the NSF through grant AST-0307956.