SIMD on x64/x86

MMX Primer: Packed Integer SIMD on Early x86 CPUs

MMX was the first widely adopted SIMD instruction set on x86 processors.

It was introduced by Intel in the Pentium MMX generation and later supported by AMD and other x86-compatible processors. At the time, it was a major step forward for multimedia and communications software because it allowed one instruction to process multiple small integer values at once.

MMX was designed for workloads such as:

  • image processing;
  • video decoding;
  • audio processing;
  • speech compression;
  • games;
  • 2D and 3D graphics support code;
  • signal processing;
  • video conferencing;
  • data conversion and filtering.

These workloads often have the same basic structure: many small values, repeated operations, tight loops, and a high degree of data parallelism.

MMX gave x86 programmers a way to exploit that parallelism directly.

Today, MMX is mostly a legacy instruction set. For new code, SSE2 or later is almost always a better choice. But MMX is still important to understand because it was the foundation of x86 SIMD programming and many old multimedia libraries, codecs, image filters, and game engines used it heavily.

Core Idea

MMX is based on SIMD:

Single Instruction, Multiple Data

Instead of processing one value at a time, an MMX instruction processes several packed integer values inside one 64-bit register.

For example, one MMX register can hold eight 8-bit pixels:

[ p7 ][ p6 ][ p5 ][ p4 ][ p3 ][ p2 ][ p1 ][ p0 ]

A single packed add instruction can add eight pixel values at once.

That was the key advantage of MMX: it allowed small integer operations to be performed in parallel.

What MMX Added

MMX added:

  • 57 new instructions;
  • eight 64-bit MMX registers;
  • packed integer data types;
  • saturating arithmetic;
  • packed comparison instructions;
  • packing and unpacking operations;
  • logical operations;
  • packed shifts;
  • a new SIMD programming model for x86.

The MMX registers are named:

mm0 mm1 mm2 mm3 mm4 mm5 mm6 mm7

Each register is 64 bits wide.

In C and C++ intrinsics, MMX values are represented using the __m64 type:

#include <mmintrin.h>

__m64 value;

The original MMX intrinsic header is:

#include <mmintrin.h>

MMX Data Types

MMX does not define one single data type. Instead, each 64-bit MMX register can be interpreted in several ways.

MMX packed typeElements per 64-bit registerElement size
Packed byte88-bit
Packed word416-bit
Packed doubleword232-bit
Quadword164-bit

The same physical 64-bit register can be interpreted differently depending on the instruction.

For example:

64-bit register as bytes:

[ b7 ][ b6 ][ b5 ][ b4 ][ b3 ][ b2 ][ b1 ][ b0 ]

64-bit register as words:

[ w3 ][ w2 ][ w1 ][ w0 ]

64-bit register as doublewords:

[ d1 ][ d0 ]

64-bit register as one quadword:

[ q0 ]

MMX itself does not track the data type stored in a register. The instruction determines how the bits are interpreted.

That is still true in modern SIMD programming. A vector register contains bits. The instruction gives meaning to those bits.

Integer and Fixed-Point Only

MMX is an integer SIMD instruction set.

It does not provide floating-point SIMD operations. That came later through AMD 3DNow! and Intel SSE.

MMX can be used for fixed-point arithmetic, but the processor does not know where the fixed point is. The programmer decides how to interpret the integer values.

For example, a 16-bit word might represent:

1234        // ordinary integer
12.34       // fixed-point value with two decimal digits
0.3765      // normalized fixed-point value

The CPU does not know the difference. It simply performs integer operations.

This gives the programmer flexibility, but also responsibility. You must control scaling, rounding, overflow, and precision yourself.

Why MMX Was Useful

MMX was designed for the kind of data commonly found in multimedia code.

Image Processing

Pixels are often stored as 8-bit channels:

red   = 0..255
green = 0..255
blue  = 0..255
alpha = 0..255

One MMX register can process eight 8-bit values at a time.

That is useful for:

  • brightness changes;
  • contrast adjustment;
  • alpha blending;
  • grayscale conversion;
  • thresholding;
  • masks;
  • simple filters.

Audio Processing

Audio samples are often stored as 16-bit integers.

One MMX register can hold four 16-bit samples.

That is useful for:

  • mixing;
  • scaling;
  • clipping;
  • filtering;
  • format conversion.

Video Processing

Video codecs perform many repeated operations on blocks of pixels.

MMX was useful for:

  • motion compensation;
  • block difference calculations;
  • interpolation;
  • color-space conversion;
  • quantization support;
  • pixel averaging.

Later instruction sets added much better video-processing instructions, but MMX was the first mainstream x86 SIMD step in that direction.

Basic MMX Example

The following example adds eight unsigned bytes using MMX wraparound arithmetic.

#include <stdint.h>
#include <stddef.h>
#include <string.h>
#include <mmintrin.h>

void add_u8_mmx(uint8_t* dst, const uint8_t* a, const uint8_t* b, size_t count)
{
    size_t i = 0;

    for (; i + 8 <= count; i += 8)
    {
        __m64 va;
        __m64 vb;

        memcpy(&va, a + i, sizeof(va));
        memcpy(&vb, b + i, sizeof(vb));

        __m64 vr = _mm_add_pi8(va, vb);

        memcpy(dst + i, &vr, sizeof(vr));
    }

    _mm_empty();

    for (; i < count; ++i)
    {
        dst[i] = (uint8_t)(a[i] + b[i]);
    }
}

The vector loop processes eight bytes per iteration.

The scalar loop at the end handles the remaining bytes when count is not a multiple of eight.

The call to _mm_empty() is important. It clears the MMX state before the function returns.

Why _mm_empty() Is Required

MMX has an important historical design issue:

MMX registers are aliased on top of the x87 floating-point register stack.

The MMX registers mm0 through mm7 reuse the same physical register storage as the x87 FPU registers.

Because of this, after executing MMX instructions, the x87 floating-point unit may think its register stack is full or in use.

To return the processor to a clean floating-point state, you must execute the EMMS instruction.

In C and C++, this is done with:

_mm_empty();

A typical MMX function should end like this:

void mmx_function(void)
{
    // MMX work here.

    _mm_empty();
}

Do not call _mm_empty() before every MMX instruction. It is a cleanup instruction, not an initialization instruction.

Use it once after the MMX block is finished, before returning to code that may use x87 floating-point instructions.

MMX and the x87 Problem

The x87 FPU uses a stack-like register model.

MMX reused that register storage to avoid adding a completely separate architectural register file. That made MMX easier to integrate into existing CPUs and operating systems, but it created a programming inconvenience.

After MMX code runs, x87 floating-point code may fail or behave incorrectly unless EMMS is executed first.

This is one of the reasons MMX was eventually replaced by SSE2 for integer SIMD code.

SSE2 uses XMM registers, which are independent from the x87 stack. Pure SSE2 code does not require _mm_empty().

Packed Arithmetic Instructions

MMX includes packed arithmetic instructions for bytes, words, and doublewords.

Common examples include:

InstructionIntrinsicMeaning
PADDB_mm_add_pi8add packed 8-bit integers
PADDW_mm_add_pi16add packed 16-bit integers
PADDD_mm_add_pi32add packed 32-bit integers
PSUBB_mm_sub_pi8subtract packed 8-bit integers
PSUBW_mm_sub_pi16subtract packed 16-bit integers
PSUBD_mm_sub_pi32subtract packed 32-bit integers
PMULLW_mm_mullo_pi16multiply packed 16-bit integers, keep low half
PMULHW_mm_mulhi_pi16multiply packed signed 16-bit integers, keep high half

The basic packed add and subtract instructions wrap around on overflow.

For example, with 8-bit unsigned arithmetic:

250 + 20 = 14

because the result wraps modulo 256.

For multimedia programming, wraparound is often not what you want. That is why MMX also provides saturating arithmetic.

Saturating Arithmetic

Saturating arithmetic clamps results to the representable range instead of wrapping around.

For unsigned 8-bit values:

250 + 20 = 255

instead of:

250 + 20 = 14

This is very useful for pixels, audio samples, and other bounded media values.

MMX provides signed and unsigned saturating add/subtract instructions.

InstructionIntrinsicMeaning
PADDSB_mm_adds_pi8signed saturating add, 8-bit
PADDSW_mm_adds_pi16signed saturating add, 16-bit
PADDUSB_mm_adds_pu8unsigned saturating add, 8-bit
PADDUSW_mm_adds_pu16unsigned saturating add, 16-bit
PSUBSB_mm_subs_pi8signed saturating subtract, 8-bit
PSUBSW_mm_subs_pi16signed saturating subtract, 16-bit
PSUBUSB_mm_subs_pu8unsigned saturating subtract, 8-bit
PSUBUSW_mm_subs_pu16unsigned saturating subtract, 16-bit

Example: Brightness Increase With Saturation

This example increases image brightness for an array of 8-bit grayscale pixels.

#include <stdint.h>
#include <stddef.h>
#include <string.h>
#include <mmintrin.h>

void brighten_u8_mmx(uint8_t* pixels, uint8_t amount, size_t count)
{
    size_t i = 0;

    __m64 increment = _mm_set1_pi8((char)amount);

    for (; i + 8 <= count; i += 8)
    {
        __m64 p;

        memcpy(&p, pixels + i, sizeof(p));

        __m64 r = _mm_adds_pu8(p, increment);

        memcpy(pixels + i, &r, sizeof(r));
    }

    _mm_empty();

    for (; i < count; ++i)
    {
        unsigned int value = pixels[i] + amount;
        pixels[i] = (value > 255) ? 255 : (uint8_t)value;
    }
}

This is the kind of operation MMX was designed for: small data elements, simple repeated computation, and obvious data parallelism.

Packed Comparison Instructions

MMX includes packed comparison instructions.

These compare elements lane by lane and produce a mask result.

Common examples:

InstructionIntrinsicMeaning
PCMPEQB_mm_cmpeq_pi8compare packed bytes for equality
PCMPEQW_mm_cmpeq_pi16compare packed words for equality
PCMPEQD_mm_cmpeq_pi32compare packed doublewords for equality
PCMPGTB_mm_cmpgt_pi8compare signed bytes greater-than
PCMPGTW_mm_cmpgt_pi16compare signed words greater-than
PCMPGTD_mm_cmpgt_pi32compare signed doublewords greater-than

The result of a comparison is not 0 or 1 per element.

Instead, each element becomes either all zero bits or all one bits.

For example, for bytes:

false -> 0x00
true  -> 0xFF

This makes comparison results useful as masks for later logical operations.

Logical Instructions

MMX includes bitwise logical operations on 64-bit values.

InstructionIntrinsicMeaning
PAND_mm_and_si64bitwise AND
PANDN_mm_andnot_si64bitwise AND NOT
POR_mm_or_si64bitwise OR
PXOR_mm_xor_si64bitwise XOR

These instructions are commonly used for:

  • masks;
  • blending;
  • clearing bits;
  • selecting values;
  • building constants;
  • combining comparison results.

A common idiom is to zero a register using XOR:

__m64 zero = _mm_setzero_si64();

This typically maps to a PXOR instruction.

Packing and Unpacking

Packing and unpacking are essential in SIMD programming.

Packing converts wider elements into narrower elements.

Unpacking interleaves elements from two registers and often converts narrower data into a wider representation.

Common MMX pack instructions include:

InstructionMeaning
PACKSSWBpack signed words into signed bytes with saturation
PACKSSDWpack signed doublewords into signed words with saturation
PACKUSWBpack signed words into unsigned bytes with saturation

Common unpack instructions include:

InstructionMeaning
PUNPCKLBWunpack low bytes into words
PUNPCKHBWunpack high bytes into words
PUNPCKLWDunpack low words into doublewords
PUNPCKHWDunpack high words into doublewords
PUNPCKLDQunpack low doublewords into quadword
PUNPCKHDQunpack high doublewords into quadword

These instructions are useful when you need to widen data before arithmetic.

For example, adding bytes can overflow quickly. A common approach is:

  1. unpack bytes into words;
  2. perform 16-bit arithmetic;
  3. pack the result back into bytes with saturation.

This pattern remains common in later SIMD instruction sets too.

Shift Instructions

MMX includes packed shift instructions for words, doublewords, and quadwords.

InstructionMeaning
PSLLWshift packed words left logical
PSLLDshift packed doublewords left logical
PSLLQshift quadword left logical
PSRLWshift packed words right logical
PSRLDshift packed doublewords right logical
PSRLQshift quadword right logical
PSRAWshift packed words right arithmetic
PSRADshift packed doublewords right arithmetic

Logical right shifts fill with zeros.

Arithmetic right shifts preserve the sign bit and are used for signed values.

Shifts are important for fixed-point arithmetic. For example, after multiplying fixed-point values, a right shift is often used to restore the desired scale.

Data Movement

MMX uses MOVD and MOVQ style instructions to move data.

InstructionMeaning
MOVDmove 32 bits between MMX, memory, or general-purpose register
MOVQmove 64 bits between MMX registers or memory

In C/C++ intrinsics, older MMX code often used pointer casts to load and store __m64 values. In modern C and C++, using memcpy is safer because it avoids strict-aliasing problems, and optimizing compilers usually generate efficient loads and stores.

Example:

__m64 v;
memcpy(&v, source, sizeof(v));

/* use v */

memcpy(destination, &v, sizeof(v));

For historical code, you will often see this instead:

__m64 v = *(__m64 const*)source;
*(__m64*)destination = v;

That style was common in old examples, but it can be problematic in strictly conforming C/C++ because of aliasing and alignment assumptions.

MMX Instruction Groups

The MMX instruction set can be grouped like this:

GroupExamplesPurpose
Data transferMOVD, MOVQmove data to/from MMX registers
ArithmeticPADD*, PSUB*, PMUL*packed integer arithmetic
Saturating arithmeticPADDS*, PADDUS*, PSUBS*, PSUBUS*clamp instead of wrap
ComparisonPCMPEQ*, PCMPGT*produce packed masks
Conversion / packingPACK*, PUNPCK*narrow, widen, interleave
LogicalPAND, PANDN, POR, PXORbitwise operations
ShiftPSLL*, PSRL*, PSRA*packed and quadword shifts
State managementEMMSempty MMX state

This structure influenced later SIMD extensions such as SSE2, AVX2, and AVX-512.

The register width changed, but the same categories remained important.

MMX Intrinsics

MMX intrinsics expose MMX instructions through C and C++ function-like operations.

The basic header is:

#include <mmintrin.h>

The main data type is:

__m64

Examples:

__m64 a = _mm_set1_pi16(100);
__m64 b = _mm_set1_pi16(25);

__m64 sum = _mm_add_pi16(a, b);
__m64 diff = _mm_sub_pi16(a, b);
__m64 product_low = _mm_mullo_pi16(a, b);

_mm_empty();

A useful practical rule is:

If your code uses __m64, it is using MMX state.

That means you should remember _mm_empty() before returning to code that may use x87 floating-point operations.

MMX Register Layout and Element Order

When using intrinsics such as _mm_set_pi8, _mm_set_pi16, and _mm_set_pi32, be careful with element order.

For example:

__m64 v = _mm_set_pi16(4, 3, 2, 1);

This places values into the packed lanes in a high-to-low argument order.

This convention can be confusing when debugging because memory order, lane numbering, and argument order may not look the same at first glance.

For educational code, it is often clearer to load values from an array:

#include <stdint.h>
#include <string.h>
#include <mmintrin.h>

int16_t values[4] = { 1, 2, 3, 4 };

__m64 v;
memcpy(&v, values, sizeof(v));

Then the memory layout is explicit.

Alignment

MMX registers are 64 bits wide, so MMX code usually works with 8-byte chunks.

Many MMX memory operations can access memory operands directly, and old MMX code often assumes 8-byte chunks.

However, alignment still matters for performance and correctness in low-level code.

General advice:

  • process data in 8-byte blocks;
  • handle leftover elements with scalar code;
  • avoid unsafe pointer casts when writing portable C/C++;
  • be careful when reading past the end of buffers;
  • benchmark with realistic data.

SSE2 and later make alignment more visible because XMM registers are 16 bytes wide, but MMX code still benefits from predictable memory access.

Tail Processing

Most buffers are not exact multiples of eight bytes.

A safe MMX loop usually has this structure:

size_t i = 0;

for (; i + 8 <= count; i += 8)
{
    // MMX work on 8 bytes.
}

_mm_empty();

for (; i < count; ++i)
{
    // Scalar tail.
}

The scalar tail is not just a detail. It is necessary for correctness.

Do not read past the end of the input buffer unless the data format explicitly guarantees safe padding.

MMX and Modern x86-64

Modern x86-64 processors generally still support MMX for backward compatibility, but MMX is not a good target for new code.

There are several reasons:

  • MMX registers are only 64 bits wide.
  • MMX shares state with x87.
  • MMX requires _mm_empty().
  • SSE2 is part of the x86-64 baseline.
  • SSE2 provides 128-bit integer SIMD with XMM registers.
  • AVX2 and AVX-512 provide much wider SIMD on newer CPUs.
  • Modern compilers optimize SSE2 and newer SIMD much better than MMX.

For new code, MMX is mainly useful as historical knowledge.

MMX vs SSE2

SSE2 is the natural replacement for MMX integer SIMD.

FeatureMMXSSE2
Register typeMMXXMM
Register width64-bit128-bit
Intrinsic type__m64__m128i
Packed bytes816
Packed words48
Packed doublewords24
Shares x87 stateYesNo
Requires _mm_empty()YesNo
Good for new codeNoYes

Here is a simple MMX operation:

#include <mmintrin.h>

void mmx_example(void)
{
    __m64 a = _mm_set_pi16(4, 3, 2, 1);
    __m64 b = _mm_set_pi16(8, 7, 6, 5);

    __m64 c = _mm_add_pi16(a, b);

    _mm_empty();
}

Here is the equivalent SSE2-style operation:

#include <emmintrin.h>

void sse2_example(void)
{
    __m128i a = _mm_set_epi16(8, 7, 6, 5, 4, 3, 2, 1);
    __m128i b = _mm_set_epi16(16, 15, 14, 13, 12, 11, 10, 9);

    __m128i c = _mm_add_epi16(a, b);

    // No _mm_empty() required for pure SSE2 code.
}

The SSE2 version processes twice as many 16-bit values and avoids the MMX/x87 state problem.

MMX vs 3DNow!

AMD introduced 3DNow! after MMX.

3DNow! used the MMX register file but added packed single-precision floating-point operations and other media-oriented instructions.

This was AMD’s way to accelerate 3D and multimedia workloads before SSE became widely available.

However, 3DNow! did not become the long-term portable SIMD path. Intel SSE and SSE2 became the dominant model, and AMD eventually adopted the SSE/AVX family as well.

Today, 3DNow! is obsolete for new code.

MMX vs SSE, AVX, and AVX-512

MMX started the SIMD path on x86, but later instruction sets expanded the same idea.

Instruction setRegister typeWidthMain role
MMX__m6464-bitPacked integer SIMD
SSE__m128128-bitPacked single-precision floating point
SSE2__m128i, __m128d128-bitInteger SIMD and double-precision floating point
AVX__m256256-bitWider floating-point SIMD
AVX2__m256i256-bitWider integer SIMD
AVX-512__m512i, __m512, __m512d512-bitWide SIMD, masks, HPC, AI, analytics

The modern equivalent of MMX-style packed integer programming is not MMX. It is SSE2, AVX2, or AVX-512 depending on the target CPU.

When MMX Still Matters

MMX is still worth understanding when:

  • maintaining old C/C++ multimedia code;
  • reading old assembly routines;
  • studying early SIMD optimization techniques;
  • porting legacy codecs or image filters;
  • understanding the evolution from MMX to SSE2;
  • working with vintage CPUs;
  • documenting old x86 performance techniques.

MMX is not useless knowledge. It explains many SIMD ideas that still exist today:

  • packed data;
  • lane-wise arithmetic;
  • saturating arithmetic;
  • masks from comparisons;
  • widening and narrowing;
  • fixed-point arithmetic;
  • vectorized loops;
  • scalar tail handling.

Those ideas did not disappear. They simply moved to wider and cleaner register files.

When Not to Use MMX

Avoid MMX for new code unless you have a very specific legacy requirement.

Do not choose MMX when:

  • you are writing new x86-64 code;
  • SSE2 is available;
  • you need floating-point SIMD;
  • you want good compiler auto-vectorization;
  • you want portable modern SIMD paths;
  • you are targeting AVX2 or newer CPUs;
  • you want to avoid _mm_empty() and x87 interactions.

For new x86-64 code, SSE2 should be considered the minimum SIMD baseline.

For modern performance code, AVX2 is often a better target when available.

Common Mistakes

Mistake 1: Forgetting _mm_empty()

This is the classic MMX bug.

void bad_mmx_function(void)
{
    __m64 a = _mm_set1_pi16(10);
    __m64 b = _mm_set1_pi16(20);

    __m64 c = _mm_add_pi16(a, b);

    // Missing _mm_empty().
}

Correct version:

void good_mmx_function(void)
{
    __m64 a = _mm_set1_pi16(10);
    __m64 b = _mm_set1_pi16(20);

    __m64 c = _mm_add_pi16(a, b);

    _mm_empty();
}

Mistake 2: Assuming MMX Supports Floating Point

MMX does not provide floating-point SIMD.

It provides packed integer SIMD.

For floating-point SIMD, use SSE or later.

Mistake 3: Confusing Wraparound and Saturation

These are different:

_mm_add_pi8(a, b);     // wraparound
_mm_adds_pu8(a, b);    // unsigned saturation

For pixels and audio samples, saturating arithmetic is often the correct choice.

Mistake 4: Ignoring Tail Elements

If your loop processes eight bytes at a time, you still need to handle the remaining bytes.

for (; i < count; ++i)
{
    // scalar tail
}

Mistake 5: Using MMX for New Code

MMX was important historically, but SSE2 is a better minimum target for modern x86-64 code.

If you are starting from scratch, use __m128i, not __m64.

Mistake 6: Assuming Wider SIMD Automatically Fixes the Algorithm

Moving from scalar code to MMX, or from MMX to SSE2, requires careful thinking about:

  • overflow;
  • signed vs unsigned operations;
  • data layout;
  • alignment;
  • tail handling;
  • fixed-point scaling;
  • memory bandwidth.

SIMD makes operations parallel. It does not automatically make the algorithm correct.

Best Practices for Legacy MMX Code

If you must maintain MMX code:

  • keep MMX blocks small and well isolated;
  • call _mm_empty() at the end of each MMX section;
  • avoid mixing MMX and x87 floating-point code;
  • document whether arithmetic wraps or saturates;
  • handle scalar tails safely;
  • avoid reading past buffer ends;
  • use clear helper functions for loading and storing;
  • consider replacing the code with SSE2;
  • benchmark before and after any rewrite.

Best Practices for New Code

For new SIMD code:

  • use SSE2 or later instead of MMX;
  • use __m128i for 128-bit integer SIMD;
  • use AVX2 where 256-bit integer SIMD is useful and available;
  • use AVX-512 only with runtime dispatch and benchmarking;
  • keep a scalar fallback when portability matters;
  • let the compiler auto-vectorize simple loops before writing intrinsics manually;
  • write tests that cover overflow, saturation, signs, and tails.

Summary

MMX was the first major SIMD extension for x86 processors.

It introduced 64-bit packed integer operations and made it possible to process multiple bytes, words, or doublewords with one instruction. This was a major improvement for multimedia and communications workloads in the late 1990s.

The key MMX concepts are:

  • SIMD parallelism;
  • eight 64-bit MMX registers;
  • packed byte, word, doubleword, and quadword data;
  • integer and fixed-point arithmetic;
  • saturating arithmetic;
  • packed comparisons and masks;
  • packing and unpacking;
  • shift and logical operations;
  • the need for EMMS / _mm_empty().

The main limitation is that MMX shares state with the x87 floating-point unit. That makes _mm_empty() necessary and makes MMX awkward compared with later SIMD instruction sets.

For modern code, MMX is legacy.

SSE2 moved packed integer SIMD into 128-bit XMM registers, doubled the data width, avoided the x87 state problem, and became part of the x86-64 baseline.

MMX is still worth learning because it explains where x86 SIMD programming started. But for new development, the practical recommendation is clear:

Use SSE2 or later, and keep MMX for legacy code and historical understanding.

References