Stefano Tommesani

  • Increase font size
  • Default font size
  • Decrease font size
Home SIMD Intel Pentium III

Intel Pentium III

The Intel P6 core, introduced with the Pentium Pro processor and used in all current Intel processors, features a RISC-like microarchitecture and an out-of-order execution unit, representing a radical shift from previous designs. 

The P6's new dynamic execution micro-architecture removes the constraint of linear instruction sequencing between the traditional fetch and execute phases. An instruction buffer opens a wide window on the instructions that are not executed yet, allowing the execute phase of the processor to have much more visibility into the instruction stream so that a better scheduling policy may be adopted. Optimal scheduling requires the execute phase to be replaced by decoupled dispatch/execute and retire phases, so that instructions can start in any order that satisfies dependency bounds, but must be completed and therefore retired in the original order. This approach greatly increases performance as it more fully utilizes the resources of the processor core.

The P6 core executes x86 instructions by breaking them into simpler micro-instructions called micro-ops. This task is performed by three parallel decoders in the D1 stage of the pipeline: the first decoder is capable of decoding one x86 instruction of four or fewer µops in each clock cycle, while the other two decoders can each decode an x86 instruction of one µop in each clock cycle. Once the µops are decoded, they will be issued from the in-order front-end into the Reservation Station (RS), which is the beginning stage of the out-of-order core. In the RS, the µops wait until their data operands are available; once a µop has all data sources available, it will be dispatched from the RS to an execution unit. Once the µop has been executed it returns to the ReOrder Buffer and waits for retirement. In this stage, all data values are written back to memory and all µops are retired in-order, three at a time. The P6 core can schedule at a peak rate of 5 micro-ops per clock, one to each resource port, but a sustained rate of 3 micro-ops per clock is more typical.
Optimizing code for the P6 core is strikingly different than on previous processors, such as the Pentium, that featured in-order execution. The developer has no control over the sequence of execution, but the goal is maximizing the efficiency of both the decoders and the execution units. 
Pushing the decoding bandwidth to the limit means scheduling instructions with a 4-1-1 pattern, where these numbers refer to the count of micro-ops generated by each instruction. When working with MMX instructions, all opcodes require only 1 micro-op except for computations that have as source operand a memory reference, and writes to memory. The MMX register set contains only 8 registers, therefore there are many instructions that use a memory reference as source operand, and the fact that this kind of instruction can only by translated by decoder 0 leads to stalls in this stage of the pipeline. The only method for relieving this problem is choosing a smart register allocation strategy that minimizes the number of memory references.

The effective usage of the execution units is even more troublesome. There are five execution units on the P6 core, and each performs a well-defined set of operations: scheduling a large bulk of instructions of the same kind will overcharge the required execution unit that will impose long latencies, while all other execution units remain idle. The key to fast performance is obtaining from the decoders a balanced stream of micro-ops that evenly exploits all execution units, and this often means that loops must be rearranged as most of them expose a great locality (i.e. loads from memory at the beginning, computations in the middle and stores to memory at the end).
Another key technique is minimizing dependency bounds among micro-ops, so that they do not stall often waiting for data operands: the easiest way to maximize the Instruction Level Parallelism (ILP) is unrolling loops and scheduling two or more computing threads together. While this is hardly a novel technique, actually implementing it is really complex due to the limited number of MMX registers available, and a clever register allocation strategy is mandatory.
It is therefore evident that writing high-performance MMX code requires much more that the knowledge of the instruction set: the developer should have a solid background on both traditional compiler designs to devise an effective register allocation strategy, and on the microarchitectures of current processors to avoid pitfalls in the hand-scheduled code.
Quexal implements an optimizing compiler that exploits all these techniques. The source listing is re-arranged to maximize the Instruction Level Parallelism (ILP), then the instructions are scheduled so that:
1. they satisfy the 4-1-1 pattern to fully use all decoders;
2. the resulting stream of micro-ops is balanced and makes effective usage of available hardware resources;
3. the number of required registers does not exceed that of MMX registers.
The Quexal compiler outputs high-quality code that matches the speed of hand-optimized samples. Performance benchmarks show that produced code usually makes optimal usage of the decoders and achieves a typical 3 micro-ops per cycle rate, without introducing excessive register spilling to memory.

Quote this article on your site

To create link towards this article on your website,
copy and paste the text below in your page.

Preview :

Intel Pentium III
Tuesday, 25 April 2000

Powered by QuoteThis © 2008
View Stefano Tommesani's profile on LinkedIn

Latest Articles

Fixing Git pull errors in SourceTree 10 April 2017, 01.44 Software
Fixing Git pull errors in SourceTree
If you encounter the following error when pulling a repository in SourceTree: VirtualAlloc pointer is null, Win32 error 487 it is due to to the Cygwin system failing to allocate a 5 MB large chunk of memory for its heap at
Castle on the hill of crappy audio quality 19 March 2017, 01.53 Audio
Castle on the hill of crappy audio quality
As the yearly dynamic range day is close (March 31st), let's have a look at one of the biggest audio massacres of the year, Ed Sheeran's "Castle on the hill". First time I heard the song, I thought my headphones just got
Necessary evil: testing private methods 29 January 2017, 21.41 Testing
Necessary evil: testing private methods
Some might say that testing private methods should be avoided because it means not testing the contract, that is the interface implemented by the class, but the internal implementation of the class itself. Still, not all
I am right and you are wrong 28 December 2016, 14.23 Web
I am right and you are wrong
Have you ever convinced anyone that disagreed with you about a deeply held belief? Better yet, have you changed your mind lately on an important topic after discussing with someone else that did not share your point of
How Commercial Insight changes R&D 06 November 2016, 01.21 Web
How Commercial Insight changes R&D
The CEB's Commercial Insight is based on three pillars: Be credible/relevant – Demonstrate an understanding of the customer’s world, substantiating claims with real-world evidence. Be frame-breaking – Disrupt the