MMX Primer

Saturday, 24 April 2010 14:21 Stefano Tommesani

The MMX technology is designed to accelerate multimedia and communications applications by including new instructions and data types that allow applications to achieve a new level of performance. It exploits the parallelism inherent in many multimedia and communications algorithms, yet maintains full compatibility with existing operating systems and applications. 
A wide range of software applications, including graphics, MPEG video, music synthesis, speech compression and recognition, image processing, games, video conferencing and more, shows many common, fundamental characteristics: 

The MMX technology is designed as a set of general purpose integer instructions that can be applied to the needs of the wide diversity of multimedia and communications applications. The highlights of the technology are:

MMX technology introduces four new data types: three packed data types (bytes, words and doublewords, respectively being 8, 16 and 32 bits wide for each data element) and a new 64-bit entity. Each element within the packed data types is an independent fixed-point integer. The architecture does not specify the place of the fixed point within the elements, because it is up to the developer the control of its place within each element throughout the calculation. This adds a burden on the developer, but it also leaves a large amount of flexibility to choose and change the precision of fixed-point numbers during the course of the application in order to fully control the dynamic range of values.
The four MMX technology data types are: 

SIMD additionAs an example, graphics pixel data are generally represented in 8-bit integers, or bytes. With MMX technology, eight of these pixels are packed together in a 64-bit quantity and moved into an MMX register; when an MMX instruction executes, it takes all eight of the pixel values at once from the MMX register, performs the arithmetic or logical operation on all eight elements in parallel, and writes the result into an MMX register. The degree of parallelism that can be achieved with the MMX technology depends on the size of data, ranging from 8 when using 8-bit data to 1, i.e. no parallelism, when using 64-bit data.
Aliasing of MMX over FPThe MMX technology is integrated into Intel x86 architecture in a way that maintains full compatibility with existing operating systems. This is obtained by aliasing MMX registers and state upon the x86 floating-point registers and state. Therefore, no new registers or states are added to support MMX technology, so that the operating system uses the standard mechanisms for interacting with the floating point state to save and restore MMX code: floating-point instructions that save/restore the floating-point state also handle the MMX state (for example, during context switching).
Aliasing the MMX state upon the floating-point state does not preclude applications from executing both MMX routines and floating point routines, but the developer cannot freely interleave MMX and floating point instructions, and he must insert an EMMS instruction before switching between MMX and floating point code sequences.


2. Instruction set

The MMX instructions cover several functional areas including: 

Arithmetic, comparison and shift instructions are designed to support the different packed integer data types: these instructions have a different opcode for each supported data type. As a result, the MMX technology instructions are implemented with 57 opcodes.
All MMX instructions, except the EMMS instruction, reference and operate on two operands: the source and the destination operand. The first operand is the destination and the second operand is the source. The instruction overwrites the destination operand with the result. For example, a two-operand instruction 


would be decoded as:


A typical MMX instruction has this syntax: 

As an example, PADDSB is a MMX instruction (P) that sums (ADD) the 8 bytes (B) of the source and destination operands and saturates the result (S).
Instructions that have different input and output data elements have two data-type suffixes: for example, the conversion instruction converts from one data type to another, so it has two suffixes, one for the original data type and the second for the converted data type.
The next pages describe in depth the full set of MMX instructions, grouped by functional areas. The box on the right side representes the syntax of that instruction; here is a list of the symbols used to represent operands in the instruction statements: 

As an example, 
OP mm, mm/m64
means that the destination operand of the OP instruction is an MMX register, while the source operand can either be an MMX register or a 64-bit memory operand.


3. Examples and benchmarks

The Intel MMX Application Notes offer a wide overview of the benefits achievable by using MMX instructions. All performance data was extracted from Application Notes, and it generally refers to the Pentium MMX microarchitecture.
Before starting to code in assembly for MMX, you should take a look at Quexal, the visual development environment for MMX and ISSE coding that will make your life a lot easier!
Here is a list of currently available Application Notes, grouped by arguments. The column on the right shows the speed-up obtained moving from scalar C code to MMX code.

    Title Speed-up
    Audio Echo Effects 5.9x
    MPEG1 Audio Kernels
    G.728 Code Book Search 2.7x
    Levinson-Durbin Filter
    Schur-Weiner Filter
    Passband Echo Canceller
    Baseband Echo Canceller
    1/3 T Equalizer
    2/3 T Spaced Equalizer
    DSP Kernels
    Efficient Vector/Matrix Multiply Routine 14.6x
    Matrix Transpose 2x
    Real 16-bit FFT
    Dot Product - 16x16 -> 32 5x
    Real FIR - 16 bit 5x
    Vector Arithmetic and Logic Operations 6x
    High Precision Multiply
    Data Alignment
    Graphics (2D)
    Fractals with MMX Technology 1.5x
    Sprite Overlay
    Graphics (3D)
    Advanced Procedural Texturing 10x
    AGP and 3D Graphics Software
    MMX Technology for 3D Rendering
    3D Bilinear Texture Mapping 7x
    Gourand Shading
    3D Transform 3.1x
    Image Processing
    YUV12 to RGB Color Conversion
    2X 8-bit Image Scaling 13.5x
    Bilinear Interpolation 3.9x
    Median Filter 3.8x
    Row Filter - 8 bit
    Column Filter
    Alpha Blending 8x
    24 to 16 bit Conversion
    RGB -> YUV > 10x
    Speech Recognition
    Viterbi Decoding 2x
    L1 Distance Measure 3.3x
    L2 Norm Distance Measure 7.3x
    IDCT 2D 8x8 3.5x
    Motion Compensation
    Absolute Difference 5x
    Haar Transform - 2x2 2.2x
    Get Bits 2.4x
    Video Loop Filter 1.9