Stefano Tommesani

  • Increase font size
  • Default font size
  • Decrease font size
Home SIMD MMX Primer

MMX Primer


The MMX technology is designed to accelerate multimedia and communications applications by including new instructions and data types that allow applications to achieve a new level of performance. It exploits the parallelism inherent in many multimedia and communications algorithms, yet maintains full compatibility with existing operating systems and applications. 
A wide range of software applications, including graphics, MPEG video, music synthesis, speech compression and recognition, image processing, games, video conferencing and more, shows many common, fundamental characteristics: 

  • small integer data types (for example: 8-bit pixels, 16-bit audio samples) 
  • small, highly repetitive loops 
  • frequent multiplies and accumulates 
  • compute-intensive algorithms 
  • highly parallel operations 

The MMX technology is designed as a set of general purpose integer instructions that can be applied to the needs of the wide diversity of multimedia and communications applications. The highlights of the technology are:

  • Single Instruction, Multiple Data (SIMD) technique 
  • 57 new instructions 
  • 8 64-bit wide MMX registers, named mm0 up to mm7
  • 4 new data types 

MMX technology introduces four new data types: three packed data types (bytes, words and doublewords, respectively being 8, 16 and 32 bits wide for each data element) and a new 64-bit entity. Each element within the packed data types is an independent fixed-point integer. The architecture does not specify the place of the fixed point within the elements, because it is up to the developer the control of its place within each element throughout the calculation. This adds a burden on the developer, but it also leaves a large amount of flexibility to choose and change the precision of fixed-point numbers during the course of the application in order to fully control the dynamic range of values.
The four MMX technology data types are: 

  • Packed byte -- 8 bytes packed into one 64-bit quantity 
  • Packed word -- 4 16-bit words packed into one 64-bit quantity 
  • Packed doubleword – 2 32-bit double words packed into one 64-bit quantity 
  • Quadword -- one 64-bit quantity 

SIMD additionAs an example, graphics pixel data are generally represented in 8-bit integers, or bytes. With MMX technology, eight of these pixels are packed together in a 64-bit quantity and moved into an MMX register; when an MMX instruction executes, it takes all eight of the pixel values at once from the MMX register, performs the arithmetic or logical operation on all eight elements in parallel, and writes the result into an MMX register. The degree of parallelism that can be achieved with the MMX technology depends on the size of data, ranging from 8 when using 8-bit data to 1, i.e. no parallelism, when using 64-bit data.
Aliasing of MMX over FPThe MMX technology is integrated into Intel x86 architecture in a way that maintains full compatibility with existing operating systems. This is obtained by aliasing MMX registers and state upon the x86 floating-point registers and state. Therefore, no new registers or states are added to support MMX technology, so that the operating system uses the standard mechanisms for interacting with the floating point state to save and restore MMX code: floating-point instructions that save/restore the floating-point state also handle the MMX state (for example, during context switching).
Aliasing the MMX state upon the floating-point state does not preclude applications from executing both MMX routines and floating point routines, but the developer cannot freely interleave MMX and floating point instructions, and he must insert an EMMS instruction before switching between MMX and floating point code sequences.


2. Instruction set

The MMX instructions cover several functional areas including: 

  • basic arithmetic operations such as add, subtract, multiply, arithmetic shift and multiply-add 
  • comparison operations 
  • conversion instructions to convert between the new data types: pack data together, and unpack from small to larger data types 
  • logical operations such as AND, AND NOT,OR, and XOR 
  • shift operations 
  • data transfer instructions for MMX register-to-register transfers, or 64-bit and 32-bit load/store to memory 
  • state management instruction to handle MMX to floating point transitions

Arithmetic, comparison and shift instructions are designed to support the different packed integer data types: these instructions have a different opcode for each supported data type. As a result, the MMX technology instructions are implemented with 57 opcodes.
All MMX instructions, except the EMMS instruction, reference and operate on two operands: the source and the destination operand. The first operand is the destination and the second operand is the source. The instruction overwrites the destination operand with the result. For example, a two-operand instruction 


would be decoded as:


A typical MMX instruction has this syntax: 

  • Prefix: P for Packed 
  • Instruction operation: for example - ADD, CMP, or XOR 
  • Suffix
    • US for Unsigned Saturation 
    • S for Signed saturation 
    • B, W, D, Q for the data type: packed byte, packed word, packed doubleword, or quadword.

As an example, PADDSB is a MMX instruction (P) that sums (ADD) the 8 bytes (B) of the source and destination operands and saturates the result (S).
Instructions that have different input and output data elements have two data-type suffixes: for example, the conversion instruction converts from one data type to another, so it has two suffixes, one for the original data type and the second for the converted data type.
The next pages describe in depth the full set of MMX instructions, grouped by functional areas. The box on the right side representes the syntax of that instruction; here is a list of the symbols used to represent operands in the instruction statements: 

  • imm8: an immediate byte value, imm8 is a signed number between -128 and +127 inclusive.
  • r/m32: a doubleword register or memory operand used for instructions whose operand-size attribute is 32 bits. 
  • mm/m32: indicates the lowest 32 bits of an MMX register or a 32-bit memory location.
  • mm/m64: indicates a 64-bit MMX register or a 64-bit memory location.

As an example, 
OP mm, mm/m64
means that the destination operand of the OP instruction is an MMX register, while the source operand can either be an MMX register or a 64-bit memory operand.


3. Examples and benchmarks

The Intel MMX Application Notes offer a wide overview of the benefits achievable by using MMX instructions. All performance data was extracted from Application Notes, and it generally refers to the Pentium MMX microarchitecture.
Before starting to code in assembly for MMX, you should take a look at Quexal, the visual development environment for MMX and ISSE coding that will make your life a lot easier!
Here is a list of currently available Application Notes, grouped by arguments. The column on the right shows the speed-up obtained moving from scalar C code to MMX code.

    Title Speed-up
    Audio Echo Effects 5.9x
    MPEG1 Audio Kernels
    G.728 Code Book Search 2.7x
    Levinson-Durbin Filter
    Schur-Weiner Filter
    Passband Echo Canceller
    Baseband Echo Canceller
    1/3 T Equalizer
    2/3 T Spaced Equalizer
    DSP Kernels
    Efficient Vector/Matrix Multiply Routine 14.6x
    Matrix Transpose 2x
    Real 16-bit FFT
    Dot Product - 16x16 -> 32 5x
    Real FIR - 16 bit 5x
    Vector Arithmetic and Logic Operations 6x
    High Precision Multiply
    Data Alignment
    Graphics (2D)
    Fractals with MMX Technology 1.5x
    Sprite Overlay
    Graphics (3D)
    Advanced Procedural Texturing 10x
    AGP and 3D Graphics Software
    MMX Technology for 3D Rendering
    3D Bilinear Texture Mapping 7x
    Gourand Shading
    3D Transform 3.1x
    Image Processing
    YUV12 to RGB Color Conversion
    2X 8-bit Image Scaling 13.5x
    Bilinear Interpolation 3.9x
    Median Filter 3.8x
    Row Filter - 8 bit
    Column Filter
    Alpha Blending 8x
    24 to 16 bit Conversion
    RGB -> YUV > 10x
    Speech Recognition
    Viterbi Decoding 2x
    L1 Distance Measure 3.3x
    L2 Norm Distance Measure 7.3x
    IDCT 2D 8x8 3.5x
    Motion Compensation
    Absolute Difference 5x
    Haar Transform - 2x2 2.2x
    Get Bits 2.4x
    Video Loop Filter 1.9



Latest Articles

Easily upload videos of security cameras to YouTube
In this example, we will import video from a Yi security camera into YouTube. The same process, with eventual adjustment to the naming of directories in the SD card used by the camera to record videos, will also apply to other
A software to stand out 27 January 2018, 14.35 Web
A software to stand out
Standing out of the pack starts by being visible, and being noticed by the right group of professionals. No matter how good your profile is, it is lost in a sea of similar profiles, so you need to show up and start attracting
Web page scraping, the easy way 07 January 2018, 00.46 Web
Web page scraping, the easy way
There are many ways to extract data elements from web pages, almost all of them prettier and cooler than the method proposed here, but as we are in an hurry, let's get that data quickly, ok? Suppose we have to extract the
Scraping dynamic page content 06 January 2018, 23.57 Web
Scraping dynamic page content
One of the most common roadblocks when scraping the content of web sites is getting the full contents of the page, including JS-generated data elements (probably, the ones you are looking for). So, when using CEFSharp to scrape
Unit-testing file I/O 26 November 2017, 12.09 Testing
Unit-testing file I/O
Two good news: file I/O is unit-testable, and it is surprisingly easy to do. Let's see how it works! A software no-one asked for First, we need a piece of software that deals with files and that has to be unit-tested. The