SIMD on x64/x86

MMX Intrinsics: Packed Integer SIMD with __m64

MMX was Intel’s first widely used SIMD extension for x86 processors. It introduced packed integer operations, allowing one instruction to process multiple small integer values at the same time.

MMX is mostly historical today, but it is still useful to understand older multimedia, image-processing, audio, codec, and game code. Many of the concepts introduced by MMX — packed lanes, saturating arithmetic, packing, unpacking, and vector-style operations — later became central to SSE2, AVX2, and AVX-512 integer SIMD programming.

For new code, SSE2 __m128i, AVX2 __m256i, or later SIMD instruction sets are usually better choices. MMX is still worth knowing because a lot of legacy code and old optimization articles use it.

What MMX does

MMX operates on 64-bit vector values. A single MMX value can be interpreted in several ways:

8 x 8-bit integers
4 x 16-bit integers
2 x 32-bit integers
1 x 64-bit integer

The same 64 bits can represent different lane layouts depending on the instruction being used.

For example, the same MMX register could be treated as eight unsigned bytes:

[ b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 ]

or as four 16-bit words:

[ w0 | w1 | w2 | w3 ]

or as two 32-bit doublewords:

[ d0 | d1 ]

This is the basic SIMD model: one instruction operates on multiple lanes packed into a single register.

Header file

MMX intrinsics are declared in:

#include <mmintrin.h>

This header defines the MMX data type and the MMX intrinsic functions.

The __m64 data type

The main MMX intrinsic type is:

__m64

A __m64 value represents a 64-bit MMX register.

Unlike SSE’s __m128, which is commonly used for four floating-point values, MMX is an integer SIMD technology. It works with packed bytes, words, doublewords, and quadwords.

Conceptually:

__m64 value = 64 bits

Those 64 bits can be interpreted as:

InterpretationLane count
8-bit integers8 lanes
16-bit integers4 lanes
32-bit integers2 lanes
64-bit integer1 lane

The intrinsic name tells the compiler and the reader how those bits are meant to be interpreted.

Important historical note: MMX and x87 share state

MMX has one unusual and important limitation: MMX registers alias the old x87 floating-point register state.

Because of this, after using MMX instructions, code should call:

_mm_empty();

This emits the EMMS instruction, which clears the MMX state and allows normal x87 floating-point operations to be used safely again.

This is one of the main reasons MMX is awkward compared with SSE2 integer SIMD. SSE2 uses XMM registers and does not have the same x87/MMX state problem.

In short:

// Use MMX intrinsics here.

_mm_empty(); // Clear MMX state before returning or before x87 floating-point code.

Even if your function itself does not use floating-point arithmetic, calling _mm_empty() before returning from MMX code is a good habit in legacy MMX code.

Naming conventions

MMX intrinsic names are compact but systematic.

Name fragmentMeaning
_mmintrinsic prefix
pi8packed 8-bit integers
pi16packed 16-bit integers
pi32packed 32-bit integers
si6464-bit integer value
addssaturating addition
subssaturating subtraction
packspack with signed saturation
packuspack with unsigned saturation
unpacklointerleave low lanes
unpackhiinterleave high lanes

For example:

_mm_add_pi16(a, b)

means “add four packed 16-bit integer lanes.”

_mm_adds_pu8(a, b)

means “add eight packed unsigned 8-bit integer lanes with saturation.”

_mm_unpacklo_pi8(a, b)

means “interleave the low bytes of two MMX values.”

General support intrinsics

These are the most basic MMX support intrinsics.

IntrinsicInstructionMeaning
_mm_empty()EMMSClear MMX state
_mm_cvtsi32_si64(i)MOVDMove a 32-bit integer into the low 32 bits of an __m64 value
_mm_cvtsi64_si32(m)MOVDExtract the low 32 bits of an __m64 value as an int

Example:

#include <mmintrin.h>

int example_convert(int x)
{
    __m64 v = _mm_cvtsi32_si64(x);
    int y = _mm_cvtsi64_si32(v);

    _mm_empty();

    return y;
}

This example moves an integer into an MMX value and then extracts it again.

Creating MMX values

MMX provides intrinsics to create packed values.

Set bytes

__m64 v = _mm_set_pi8(8, 7, 6, 5, 4, 3, 2, 1);

The ordering can be confusing because the arguments are written from high lane to low lane. In memory-oriented examples, it is often easier to load data from an array instead of using _mm_set_pi8 directly.

Broadcast one byte

__m64 v = _mm_set1_pi8(10);

Conceptually:

[10 | 10 | 10 | 10 | 10 | 10 | 10 | 10]

Broadcast one 16-bit value

__m64 v = _mm_set1_pi16(100);

Conceptually:

[100 | 100 | 100 | 100]

Create a zero value

__m64 zero = _mm_setzero_si64();

This creates a 64-bit zero value.

Loading and storing MMX values

Old MMX examples often load and store values by casting pointers:

__m64 v = *(__m64 const*)ptr;
*(__m64*)out = v;

This was common in legacy code, but it can raise alignment and strict-aliasing concerns in modern C and C++.

For simple examples, a safer approach is to use memcpy helper functions:

#include <mmintrin.h>
#include <string.h>

static __m64 load_m64(const void* ptr)
{
    __m64 value;
    memcpy(&value, ptr, sizeof(value));
    return value;
}

static void store_m64(void* ptr, __m64 value)
{
    memcpy(ptr, &value, sizeof(value));
}

Compilers usually optimize these small fixed-size copies efficiently.

Arithmetic intrinsics

MMX provides packed integer addition, subtraction, multiplication, and related operations.

Wrapping addition and subtraction

Wrapping arithmetic means the result wraps around on overflow.

For unsigned 8-bit arithmetic:

250 + 20 = 14    // wraps modulo 256

Common wrapping arithmetic intrinsics include:

IntrinsicOperation
_mm_add_pi8(a, b)add eight 8-bit integer lanes
_mm_add_pi16(a, b)add four 16-bit integer lanes
_mm_add_pi32(a, b)add two 32-bit integer lanes
_mm_sub_pi8(a, b)subtract eight 8-bit integer lanes
_mm_sub_pi16(a, b)subtract four 16-bit integer lanes
_mm_sub_pi32(a, b)subtract two 32-bit integer lanes

Example:

__m64 a = _mm_set1_pi16(1000);
__m64 b = _mm_set1_pi16(25);

__m64 c = _mm_add_pi16(a, b);

// c contains four 16-bit lanes, each equal to 1025.

Saturating addition and subtraction

Saturating arithmetic clamps the result instead of wrapping.

For unsigned 8-bit arithmetic:

250 + 20 = 255   // saturates

instead of:

250 + 20 = 14    // wraps

Saturating arithmetic is especially useful for image and audio processing, where values often need to stay within a fixed range.

IntrinsicOperation
_mm_adds_pi8(a, b)signed saturating add, 8-bit lanes
_mm_adds_pi16(a, b)signed saturating add, 16-bit lanes
_mm_adds_pu8(a, b)unsigned saturating add, 8-bit lanes
_mm_adds_pu16(a, b)unsigned saturating add, 16-bit lanes
_mm_subs_pi8(a, b)signed saturating subtract, 8-bit lanes
_mm_subs_pi16(a, b)signed saturating subtract, 16-bit lanes
_mm_subs_pu8(a, b)unsigned saturating subtract, 8-bit lanes
_mm_subs_pu16(a, b)unsigned saturating subtract, 16-bit lanes

Example: brighten 8-bit pixels with saturation

This example adds a brightness value to an array of unsigned 8-bit pixels. Values are clamped to 255 instead of wrapping around.

#include <mmintrin.h>
#include <string.h>

static __m64 load_m64(const void* ptr)
{
    __m64 value;
    memcpy(&value, ptr, sizeof(value));
    return value;
}

static void store_m64(void* ptr, __m64 value)
{
    memcpy(ptr, &value, sizeof(value));
}

void brighten_u8_mmx(const unsigned char* input,
                     unsigned char* output,
                     int count,
                     unsigned char amount)
{
    __m64 vamount = _mm_set1_pi8((char)amount);

    int i = 0;

    for (; i + 7 < count; i += 8)
    {
        __m64 pixels = load_m64(&input[i]);

        // Unsigned saturating add:
        // values above 255 are clamped to 255.
        __m64 result = _mm_adds_pu8(pixels, vamount);

        store_m64(&output[i], result);
    }

    _mm_empty();

    for (; i < count; ++i)
    {
        int value = input[i] + amount;

        if (value > 255)
            value = 255;

        output[i] = (unsigned char)value;
    }
}

The MMX loop processes eight pixels per iteration.

This is a classic MMX-style use case: many small integer values, simple arithmetic, and saturation.

Multiplication intrinsics

MMX supports multiplication of 16-bit integer lanes.

IntrinsicOperation
_mm_mullo_pi16(a, b)multiply four 16-bit lanes and keep the low 16 bits of each result
_mm_mulhi_pi16(a, b)multiply four signed 16-bit lanes and keep the high 16 bits of each result
_mm_madd_pi16(a, b)multiply pairs of 16-bit lanes and add adjacent products into 32-bit results

Example:

__m64 a = _mm_set1_pi16(100);
__m64 b = _mm_set1_pi16(3);

__m64 c = _mm_mullo_pi16(a, b);

// c contains four 16-bit lanes, each equal to 300.

The _mm_madd_pi16 intrinsic is especially useful in signal processing, dot products, filters, and transform code.

Conceptually:

a = [a0 | a1 | a2 | a3]
b = [b0 | b1 | b2 | b3]

_mm_madd_pi16(a, b) produces:

[ a0*b0 + a1*b1 | a2*b2 + a3*b3 ]

The result contains two 32-bit integer lanes.

Packing intrinsics

Packing converts wider integer lanes into narrower integer lanes.

This is useful when intermediate calculations are done at higher precision and then converted back to smaller output values.

For example, an image-processing filter may compute temporary 16-bit values and then pack them back into 8-bit pixels.

IntrinsicInstructionMeaning
_mm_packs_pi16(a, b)PACKSSWBPack signed 16-bit values into signed saturated 8-bit values
_mm_packs_pi32(a, b)PACKSSDWPack signed 32-bit values into signed saturated 16-bit values
_mm_packs_pu16(a, b)PACKUSWBPack 16-bit values into unsigned saturated 8-bit values

Signed saturation example

When packing signed 16-bit values to signed 8-bit values:

-200 -> -128
 -20 ->  -20
 100 ->  100
 200 ->  127

The signed 8-bit range is:

-128 to 127

Values outside that range are clamped.

Unsigned saturation example

When packing 16-bit values to unsigned 8-bit values:

-20 ->   0
  0 ->   0
100 -> 100
300 -> 255

The unsigned 8-bit range is:

0 to 255

Values below 0 become 0. Values above 255 become 255.

Example: pack 16-bit values to unsigned 8-bit pixels

This example converts signed 16-bit intermediate values to unsigned 8-bit output values using saturation.

#include <mmintrin.h>
#include <string.h>

static __m64 load_m64(const void* ptr)
{
    __m64 value;
    memcpy(&value, ptr, sizeof(value));
    return value;
}

static void store_m64(void* ptr, __m64 value)
{
    memcpy(ptr, &value, sizeof(value));
}

void pack_s16_to_u8_mmx(const short* input,
                        unsigned char* output,
                        int count)
{
    int i = 0;

    // Process 8 input shorts and produce 8 output bytes.
    for (; i + 7 < count; i += 8)
    {
        __m64 lowWords = load_m64(&input[i]);      // 4 x int16
        __m64 highWords = load_m64(&input[i + 4]); // 4 x int16

        __m64 packed = _mm_packs_pu16(lowWords, highWords);

        store_m64(&output[i], packed);
    }

    _mm_empty();

    for (; i < count; ++i)
    {
        int value = input[i];

        if (value < 0)
            value = 0;

        if (value > 255)
            value = 255;

        output[i] = (unsigned char)value;
    }
}

This is common in image-processing code where calculations are performed using 16-bit or 32-bit intermediates, but the final image is 8-bit.

Unpacking intrinsics

Unpacking interleaves lanes from two MMX values.

This is often used to widen smaller integer values before doing arithmetic.

For example, unsigned 8-bit pixels may be unpacked into 16-bit words before multiplication or addition, avoiding overflow.

IntrinsicInstructionMeaning
_mm_unpacklo_pi8(a, b)PUNPCKLBWInterleave low bytes
_mm_unpackhi_pi8(a, b)PUNPCKHBWInterleave high bytes
_mm_unpacklo_pi16(a, b)PUNPCKLWDInterleave low 16-bit words
_mm_unpackhi_pi16(a, b)PUNPCKHWDInterleave high 16-bit words
_mm_unpacklo_pi32(a, b)PUNPCKLDQInterleave low 32-bit doublewords
_mm_unpackhi_pi32(a, b)PUNPCKHDQInterleave high 32-bit doublewords

Example: widen unsigned bytes to unsigned words

This example converts unsigned 8-bit values into unsigned 16-bit values.

The trick is to unpack the bytes with zero.

#include <mmintrin.h>
#include <string.h>

static __m64 load_m64(const void* ptr)
{
    __m64 value;
    memcpy(&value, ptr, sizeof(value));
    return value;
}

static void store_m64(void* ptr, __m64 value)
{
    memcpy(ptr, &value, sizeof(value));
}

void widen_u8_to_u16_mmx(const unsigned char* input,
                         unsigned short* output,
                         int count)
{
    __m64 zero = _mm_setzero_si64();

    int i = 0;

    for (; i + 7 < count; i += 8)
    {
        __m64 bytes = load_m64(&input[i]);

        __m64 lowWords = _mm_unpacklo_pi8(bytes, zero);
        __m64 highWords = _mm_unpackhi_pi8(bytes, zero);

        store_m64(&output[i], lowWords);
        store_m64(&output[i + 4], highWords);
    }

    _mm_empty();

    for (; i < count; ++i)
    {
        output[i] = input[i];
    }
}

Conceptually, this converts:

[ b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 ]

into:

[ b0 | b1 | b2 | b3 ] as 16-bit values
[ b4 | b5 | b6 | b7 ] as 16-bit values

This is a common first step before doing arithmetic on pixel data.

Logical operations

MMX also supports bitwise logical operations.

IntrinsicOperation
_mm_and_si64(a, b)bitwise AND
_mm_or_si64(a, b)bitwise OR
_mm_xor_si64(a, b)bitwise XOR
_mm_andnot_si64(a, b)bitwise AND NOT

These are useful for masking, clearing bits, combining comparison results, and manipulating packed data.

Example:

__m64 a = _mm_set1_pi8((char)0xF0);
__m64 b = _mm_set1_pi8((char)0x0F);

__m64 c = _mm_or_si64(a, b);

// Each byte in c is 0xFF.

Comparison intrinsics

MMX comparisons produce masks. Each lane in the result is either all bits set, meaning true, or all bits clear, meaning false.

IntrinsicOperation
_mm_cmpeq_pi8(a, b)compare eight 8-bit lanes for equality
_mm_cmpeq_pi16(a, b)compare four 16-bit lanes for equality
_mm_cmpeq_pi32(a, b)compare two 32-bit lanes for equality
_mm_cmpgt_pi8(a, b)compare eight signed 8-bit lanes for greater-than
_mm_cmpgt_pi16(a, b)compare four signed 16-bit lanes for greater-than
_mm_cmpgt_pi32(a, b)compare two signed 32-bit lanes for greater-than

Example:

__m64 a = _mm_set1_pi16(100);
__m64 b = _mm_set1_pi16(50);

__m64 mask = _mm_cmpgt_pi16(a, b);

Each 16-bit lane in mask is true because 100 is greater than 50.

Internally, true lanes are represented with all bits set:

0xFFFF

False lanes are represented as:

0x0000

These masks can be combined with logical operations.

Shift intrinsics

MMX provides shift operations for packed 16-bit and 32-bit integer lanes, and for whole 64-bit values.

IntrinsicOperation
_mm_slli_pi16(a, count)shift each 16-bit lane left
_mm_slli_pi32(a, count)shift each 32-bit lane left
_mm_slli_si64(a, count)shift the whole 64-bit value left
_mm_srli_pi16(a, count)logical shift right of each 16-bit lane
_mm_srli_pi32(a, count)logical shift right of each 32-bit lane
_mm_srli_si64(a, count)logical shift right of the whole 64-bit value
_mm_srai_pi16(a, count)arithmetic shift right of each signed 16-bit lane
_mm_srai_pi32(a, count)arithmetic shift right of each signed 32-bit lane

Logical right shift fills with zeros.

Arithmetic right shift preserves the sign bit, so it is used for signed values.

Example: average two unsigned byte arrays

MMX itself does not have all the convenience instructions that later SIMD extensions added, but basic operations can still be combined to build useful routines.

This example computes a simple average of two unsigned byte arrays:

output[i] = (a[i] + b[i]) / 2

To avoid 8-bit overflow, the bytes are widened to 16-bit words first.

#include <mmintrin.h>
#include <string.h>

static __m64 load_m64(const void* ptr)
{
    __m64 value;
    memcpy(&value, ptr, sizeof(value));
    return value;
}

static void store_m64(void* ptr, __m64 value)
{
    memcpy(ptr, &value, sizeof(value));
}

void average_u8_mmx(const unsigned char* a,
                    const unsigned char* b,
                    unsigned char* output,
                    int count)
{
    __m64 zero = _mm_setzero_si64();
    __m64 one = _mm_set1_pi16(1);

    int i = 0;

    for (; i + 7 < count; i += 8)
    {
        __m64 va = load_m64(&a[i]);
        __m64 vb = load_m64(&b[i]);

        __m64 aLow = _mm_unpacklo_pi8(va, zero);
        __m64 aHigh = _mm_unpackhi_pi8(va, zero);

        __m64 bLow = _mm_unpacklo_pi8(vb, zero);
        __m64 bHigh = _mm_unpackhi_pi8(vb, zero);

        __m64 sumLow = _mm_add_pi16(aLow, bLow);
        __m64 sumHigh = _mm_add_pi16(aHigh, bHigh);

        // Optional rounding: add 1 before shifting right by 1.
        sumLow = _mm_add_pi16(sumLow, one);
        sumHigh = _mm_add_pi16(sumHigh, one);

        __m64 avgLow = _mm_srli_pi16(sumLow, 1);
        __m64 avgHigh = _mm_srli_pi16(sumHigh, 1);

        __m64 packed = _mm_packs_pu16(avgLow, avgHigh);

        store_m64(&output[i], packed);
    }

    _mm_empty();

    for (; i < count; ++i)
    {
        output[i] = (unsigned char)(((int)a[i] + (int)b[i] + 1) >> 1);
    }
}

This example shows several important MMX techniques together:

  • load eight bytes,
  • unpack bytes to words,
  • do arithmetic in 16-bit lanes,
  • shift to divide by two,
  • pack back to unsigned bytes.

MMX vs SSE2

For new code, SSE2 integer intrinsics are usually preferable to MMX.

MMX uses:

__m64

which is 64 bits wide.

SSE2 uses:

__m128i

which is 128 bits wide.

That means SSE2 can process twice as much integer data per vector register:

Data typeMMX __m64SSE2 __m128i
8-bit integers8 lanes16 lanes
16-bit integers4 lanes8 lanes
32-bit integers2 lanes4 lanes
64-bit integers1 lane2 lanes

SSE2 also avoids the MMX/x87 state problem. There is no need to call _mm_empty() after using SSE2 integer intrinsics.

For example, this MMX idea:

__m64 result = _mm_adds_pu8(a, b);

has a wider SSE2 equivalent:

__m128i result = _mm_adds_epu8(a, b);

The SSE2 version processes sixteen unsigned 8-bit values instead of eight.

Common pitfalls

MMX code has several traps that are easy to miss.

Forgetting _mm_empty()

This is the classic MMX bug.

After using MMX instructions, call:

_mm_empty();

This clears the MMX state. Without it, later x87 floating-point code may behave incorrectly or trigger compiler warnings.

Treating MMX as floating-point SIMD

MMX is packed integer SIMD.

It does not provide the floating-point SIMD programming model that SSE later introduced. If you want packed single-precision floating-point arithmetic, use SSE __m128. If you want packed double-precision floating-point arithmetic, use SSE2 __m128d.

Confusing wrapping and saturating arithmetic

Wrapping arithmetic and saturating arithmetic are different.

Unsigned 8-bit wrapping:

250 + 20 = 14

Unsigned 8-bit saturation:

250 + 20 = 255

Use wrapping operations when you want modulo arithmetic. Use saturating operations when values must stay inside a valid range.

Forgetting that __m64 is only 64 bits

MMX processes only:

8 bytes
4 words
2 doublewords
1 quadword

For modern SIMD code, this is narrow. SSE2 doubles the width to 128 bits, AVX2 doubles it again to 256 bits, and AVX-512 doubles it again to 512 bits.

Using pointer casts carelessly

Legacy MMX code often casts pointers directly to __m64*.

That may work in old code, but it can be questionable in modern C and C++ because of alignment and strict-aliasing rules.

For simple, safe examples, memcpy is a good way to express a 64-bit load or store without relying on aliasing behavior.

Assuming MMX is a good choice for new code

MMX is useful to understand, but it is rarely the right choice for new performance-sensitive x86 code.

Use MMX when:

  • maintaining old code,
  • studying legacy SIMD examples,
  • understanding older multimedia routines,
  • working with an old compiler or target where MMX is required.

Use SSE2, AVX2, or later SIMD instruction sets for new code whenever possible.

Build notes

On GCC or Clang, MMX can be enabled with:

gcc -O2 -mmmx example.c -o example

For C++:

g++ -O2 -mmmx example.cpp -o example

On modern x86 processors, MMX support is generally available, but compiler support and recommended usage vary by platform and target mode.

For new x86-64 code, prefer SSE2 integer intrinsics instead of MMX.

Complete example program

The following small program demonstrates unsigned saturating addition on eight 8-bit values.

#include <stdio.h>
#include <mmintrin.h>
#include <string.h>

static __m64 load_m64(const void* ptr)
{
    __m64 value;
    memcpy(&value, ptr, sizeof(value));
    return value;
}

static void store_m64(void* ptr, __m64 value)
{
    memcpy(ptr, &value, sizeof(value));
}

int main()
{
    unsigned char a[8] = { 10, 20, 30, 40, 240, 250, 100, 200 };
    unsigned char b[8] = {  1,  2,  3,  4,  20,  20, 200, 100 };
    unsigned char out[8];

    __m64 va = load_m64(a);
    __m64 vb = load_m64(b);

    __m64 result = _mm_adds_pu8(va, vb);

    store_m64(out, result);

    _mm_empty();

    for (int i = 0; i < 8; ++i)
    {
        printf("%u ", out[i]);
    }

    printf("\n");

    return 0;
}

Expected output:

11 22 33 44 255 255 255 255

The last four results demonstrate unsigned saturation. Instead of wrapping around, values above 255 are clamped to 255.

Summary

MMX introduced packed integer SIMD programming to x86.

The key ideas are:

  • __m64 represents a 64-bit MMX value.
  • The same 64 bits can be interpreted as 8-bit, 16-bit, 32-bit, or 64-bit integer lanes.
  • MMX is integer-only SIMD, not floating-point SIMD.
  • Saturating arithmetic is one of MMX’s most useful features.
  • Packing and unpacking are essential for image, audio, and codec-style code.
  • MMX shares state with the old x87 floating-point unit, so _mm_empty() is required after MMX code.
  • For new code, SSE2 __m128i or later SIMD instruction sets are usually better choices.

MMX is old, but it remains historically important. Understanding MMX makes it easier to read legacy optimized code and to understand how later SIMD extensions evolved.

References