MMX Intrinsics: Packed Integer SIMD with __m64

May 27, 2010 - By Stefano Tommesani

MMX was Intel’s first widely used SIMD extension for x86 processors. It introduced packed integer operations, allowing one instruction to process multiple small integer values at the same time.

MMX is mostly historical today, but it is still useful to understand older multimedia, image-processing, audio, codec, and game code. Many of the concepts introduced by MMX — packed lanes, saturating arithmetic, packing, unpacking, and vector-style operations — later became central to SSE2, AVX2, and AVX-512 integer SIMD programming.

For new code, SSE2 __m128i, AVX2 __m256i, or later SIMD instruction sets are usually better choices. MMX is still worth knowing because a lot of legacy code and old optimization articles use it.

What MMX does

MMX operates on 64-bit vector values. A single MMX value can be interpreted in several ways:

8 x 8-bit integers
4 x 16-bit integers
2 x 32-bit integers
1 x 64-bit integer

The same 64 bits can represent different lane layouts depending on the instruction being used.

For example, the same MMX register could be treated as eight unsigned bytes:

[ b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 ]

or as four 16-bit words:

[ w0 | w1 | w2 | w3 ]

or as two 32-bit doublewords:

[ d0 | d1 ]

This is the basic SIMD model: one instruction operates on multiple lanes packed into a single register.

Header file

MMX intrinsics are declared in:

#include <mmintrin.h>

This header defines the MMX data type and the MMX intrinsic functions.

The `__m64` data type

The main MMX intrinsic type is:

__m64

A __m64 value represents a 64-bit MMX register.

Unlike SSE’s __m128, which is commonly used for four floating-point values, MMX is an integer SIMD technology. It works with packed bytes, words, doublewords, and quadwords.

Conceptually:

__m64 value = 64 bits

Those 64 bits can be interpreted as:

Interpretation	Lane count
8-bit integers	8 lanes
16-bit integers	4 lanes
32-bit integers	2 lanes
64-bit integer	1 lane

The intrinsic name tells the compiler and the reader how those bits are meant to be interpreted.

Important historical note: MMX and x87 share state

MMX has one unusual and important limitation: MMX registers alias the old x87 floating-point register state.

Because of this, after using MMX instructions, code should call:

_mm_empty();

This emits the EMMS instruction, which clears the MMX state and allows normal x87 floating-point operations to be used safely again.

This is one of the main reasons MMX is awkward compared with SSE2 integer SIMD. SSE2 uses XMM registers and does not have the same x87/MMX state problem.

In short:

// Use MMX intrinsics here.

_mm_empty(); // Clear MMX state before returning or before x87 floating-point code.

Even if your function itself does not use floating-point arithmetic, calling _mm_empty() before returning from MMX code is a good habit in legacy MMX code.

Naming conventions

MMX intrinsic names are compact but systematic.

Name fragment	Meaning
`_mm`	intrinsic prefix
`pi8`	packed 8-bit integers
`pi16`	packed 16-bit integers
`pi32`	packed 32-bit integers
`si64`	64-bit integer value
`adds`	saturating addition
`subs`	saturating subtraction
`packs`	pack with signed saturation
`packus`	pack with unsigned saturation
`unpacklo`	interleave low lanes
`unpackhi`	interleave high lanes

For example:

_mm_add_pi16(a, b)

means “add four packed 16-bit integer lanes.”

_mm_adds_pu8(a, b)

means “add eight packed unsigned 8-bit integer lanes with saturation.”

_mm_unpacklo_pi8(a, b)

means “interleave the low bytes of two MMX values.”

General support intrinsics

These are the most basic MMX support intrinsics.

Intrinsic	Instruction	Meaning
`_mm_empty()`	`EMMS`	Clear MMX state
`_mm_cvtsi32_si64(i)`	`MOVD`	Move a 32-bit integer into the low 32 bits of an `__m64` value
`_mm_cvtsi64_si32(m)`	`MOVD`	Extract the low 32 bits of an `__m64` value as an `int`

Example:

#include <mmintrin.h>

int example_convert(int x)
{
    __m64 v = _mm_cvtsi32_si64(x);
    int y = _mm_cvtsi64_si32(v);

    _mm_empty();

    return y;
}

This example moves an integer into an MMX value and then extracts it again.

Creating MMX values

MMX provides intrinsics to create packed values.

Set bytes

__m64 v = _mm_set_pi8(8, 7, 6, 5, 4, 3, 2, 1);

The ordering can be confusing because the arguments are written from high lane to low lane. In memory-oriented examples, it is often easier to load data from an array instead of using _mm_set_pi8 directly.

Broadcast one byte

__m64 v = _mm_set1_pi8(10);

Conceptually:

[10 | 10 | 10 | 10 | 10 | 10 | 10 | 10]

Broadcast one 16-bit value

__m64 v = _mm_set1_pi16(100);

Conceptually:

[100 | 100 | 100 | 100]

Create a zero value

__m64 zero = _mm_setzero_si64();

This creates a 64-bit zero value.

Loading and storing MMX values

Old MMX examples often load and store values by casting pointers:

__m64 v = *(__m64 const*)ptr;
*(__m64*)out = v;

This was common in legacy code, but it can raise alignment and strict-aliasing concerns in modern C and C++.

For simple examples, a safer approach is to use memcpy helper functions:

#include <mmintrin.h>
#include <string.h>

static __m64 load_m64(const void* ptr)
{
    __m64 value;
    memcpy(&value, ptr, sizeof(value));
    return value;
}

static void store_m64(void* ptr, __m64 value)
{
    memcpy(ptr, &value, sizeof(value));
}

Compilers usually optimize these small fixed-size copies efficiently.

Arithmetic intrinsics

MMX provides packed integer addition, subtraction, multiplication, and related operations.

Wrapping addition and subtraction

Wrapping arithmetic means the result wraps around on overflow.

For unsigned 8-bit arithmetic:

250 + 20 = 14    // wraps modulo 256

Common wrapping arithmetic intrinsics include:

Intrinsic	Operation
`_mm_add_pi8(a, b)`	add eight 8-bit integer lanes
`_mm_add_pi16(a, b)`	add four 16-bit integer lanes
`_mm_add_pi32(a, b)`	add two 32-bit integer lanes
`_mm_sub_pi8(a, b)`	subtract eight 8-bit integer lanes
`_mm_sub_pi16(a, b)`	subtract four 16-bit integer lanes
`_mm_sub_pi32(a, b)`	subtract two 32-bit integer lanes

Example:

__m64 a = _mm_set1_pi16(1000);
__m64 b = _mm_set1_pi16(25);

__m64 c = _mm_add_pi16(a, b);

// c contains four 16-bit lanes, each equal to 1025.

Saturating addition and subtraction

Saturating arithmetic clamps the result instead of wrapping.

For unsigned 8-bit arithmetic:

250 + 20 = 255   // saturates

instead of:

250 + 20 = 14    // wraps

Saturating arithmetic is especially useful for image and audio processing, where values often need to stay within a fixed range.

Intrinsic	Operation
`_mm_adds_pi8(a, b)`	signed saturating add, 8-bit lanes
`_mm_adds_pi16(a, b)`	signed saturating add, 16-bit lanes
`_mm_adds_pu8(a, b)`	unsigned saturating add, 8-bit lanes
`_mm_adds_pu16(a, b)`	unsigned saturating add, 16-bit lanes
`_mm_subs_pi8(a, b)`	signed saturating subtract, 8-bit lanes
`_mm_subs_pi16(a, b)`	signed saturating subtract, 16-bit lanes
`_mm_subs_pu8(a, b)`	unsigned saturating subtract, 8-bit lanes
`_mm_subs_pu16(a, b)`	unsigned saturating subtract, 16-bit lanes

Example: brighten 8-bit pixels with saturation

This example adds a brightness value to an array of unsigned 8-bit pixels. Values are clamped to 255 instead of wrapping around.

#include <mmintrin.h>
#include <string.h>

static __m64 load_m64(const void* ptr)
{
    __m64 value;
    memcpy(&value, ptr, sizeof(value));
    return value;
}

static void store_m64(void* ptr, __m64 value)
{
    memcpy(ptr, &value, sizeof(value));
}

void brighten_u8_mmx(const unsigned char* input,
                     unsigned char* output,
                     int count,
                     unsigned char amount)
{
    __m64 vamount = _mm_set1_pi8((char)amount);

    int i = 0;

    for (; i + 7 < count; i += 8)
    {
        __m64 pixels = load_m64(&input[i]);

        // Unsigned saturating add:
        // values above 255 are clamped to 255.
        __m64 result = _mm_adds_pu8(pixels, vamount);

        store_m64(&output[i], result);
    }

    _mm_empty();

    for (; i < count; ++i)
    {
        int value = input[i] + amount;

        if (value > 255)
            value = 255;

        output[i] = (unsigned char)value;
    }
}

The MMX loop processes eight pixels per iteration.

This is a classic MMX-style use case: many small integer values, simple arithmetic, and saturation.

Multiplication intrinsics

MMX supports multiplication of 16-bit integer lanes.

Intrinsic	Operation
`_mm_mullo_pi16(a, b)`	multiply four 16-bit lanes and keep the low 16 bits of each result
`_mm_mulhi_pi16(a, b)`	multiply four signed 16-bit lanes and keep the high 16 bits of each result
`_mm_madd_pi16(a, b)`	multiply pairs of 16-bit lanes and add adjacent products into 32-bit results

Example:

__m64 a = _mm_set1_pi16(100);
__m64 b = _mm_set1_pi16(3);

__m64 c = _mm_mullo_pi16(a, b);

// c contains four 16-bit lanes, each equal to 300.

The _mm_madd_pi16 intrinsic is especially useful in signal processing, dot products, filters, and transform code.

Conceptually:

a = [a0 | a1 | a2 | a3]
b = [b0 | b1 | b2 | b3]

_mm_madd_pi16(a, b) produces:

[ a0*b0 + a1*b1 | a2*b2 + a3*b3 ]

The result contains two 32-bit integer lanes.

Packing intrinsics

Packing converts wider integer lanes into narrower integer lanes.

This is useful when intermediate calculations are done at higher precision and then converted back to smaller output values.

For example, an image-processing filter may compute temporary 16-bit values and then pack them back into 8-bit pixels.

Intrinsic	Instruction	Meaning
`_mm_packs_pi16(a, b)`	`PACKSSWB`	Pack signed 16-bit values into signed saturated 8-bit values
`_mm_packs_pi32(a, b)`	`PACKSSDW`	Pack signed 32-bit values into signed saturated 16-bit values
`_mm_packs_pu16(a, b)`	`PACKUSWB`	Pack 16-bit values into unsigned saturated 8-bit values

Signed saturation example

When packing signed 16-bit values to signed 8-bit values:

-200 -> -128
 -20 ->  -20
 100 ->  100
 200 ->  127

The signed 8-bit range is:

-128 to 127

Values outside that range are clamped.

Unsigned saturation example

When packing 16-bit values to unsigned 8-bit values:

-20 ->   0
  0 ->   0
100 -> 100
300 -> 255

The unsigned 8-bit range is:

0 to 255

Values below 0 become 0. Values above 255 become 255.

Example: pack 16-bit values to unsigned 8-bit pixels

This example converts signed 16-bit intermediate values to unsigned 8-bit output values using saturation.

#include <mmintrin.h>
#include <string.h>

static __m64 load_m64(const void* ptr)
{
    __m64 value;
    memcpy(&value, ptr, sizeof(value));
    return value;
}

static void store_m64(void* ptr, __m64 value)
{
    memcpy(ptr, &value, sizeof(value));
}

void pack_s16_to_u8_mmx(const short* input,
                        unsigned char* output,
                        int count)
{
    int i = 0;

    // Process 8 input shorts and produce 8 output bytes.
    for (; i + 7 < count; i += 8)
    {
        __m64 lowWords = load_m64(&input[i]);      // 4 x int16
        __m64 highWords = load_m64(&input[i + 4]); // 4 x int16

        __m64 packed = _mm_packs_pu16(lowWords, highWords);

        store_m64(&output[i], packed);
    }

    _mm_empty();

    for (; i < count; ++i)
    {
        int value = input[i];

        if (value < 0)
            value = 0;

        if (value > 255)
            value = 255;

        output[i] = (unsigned char)value;
    }
}

This is common in image-processing code where calculations are performed using 16-bit or 32-bit intermediates, but the final image is 8-bit.

Unpacking intrinsics

Unpacking interleaves lanes from two MMX values.

This is often used to widen smaller integer values before doing arithmetic.

For example, unsigned 8-bit pixels may be unpacked into 16-bit words before multiplication or addition, avoiding overflow.

Intrinsic	Instruction	Meaning
`_mm_unpacklo_pi8(a, b)`	`PUNPCKLBW`	Interleave low bytes
`_mm_unpackhi_pi8(a, b)`	`PUNPCKHBW`	Interleave high bytes
`_mm_unpacklo_pi16(a, b)`	`PUNPCKLWD`	Interleave low 16-bit words
`_mm_unpackhi_pi16(a, b)`	`PUNPCKHWD`	Interleave high 16-bit words
`_mm_unpacklo_pi32(a, b)`	`PUNPCKLDQ`	Interleave low 32-bit doublewords
`_mm_unpackhi_pi32(a, b)`	`PUNPCKHDQ`	Interleave high 32-bit doublewords

Example: widen unsigned bytes to unsigned words

This example converts unsigned 8-bit values into unsigned 16-bit values.

The trick is to unpack the bytes with zero.

#include <mmintrin.h>
#include <string.h>

static __m64 load_m64(const void* ptr)
{
    __m64 value;
    memcpy(&value, ptr, sizeof(value));
    return value;
}

static void store_m64(void* ptr, __m64 value)
{
    memcpy(ptr, &value, sizeof(value));
}

void widen_u8_to_u16_mmx(const unsigned char* input,
                         unsigned short* output,
                         int count)
{
    __m64 zero = _mm_setzero_si64();

    int i = 0;

    for (; i + 7 < count; i += 8)
    {
        __m64 bytes = load_m64(&input[i]);

        __m64 lowWords = _mm_unpacklo_pi8(bytes, zero);
        __m64 highWords = _mm_unpackhi_pi8(bytes, zero);

        store_m64(&output[i], lowWords);
        store_m64(&output[i + 4], highWords);
    }

    _mm_empty();

    for (; i < count; ++i)
    {
        output[i] = input[i];
    }
}

Conceptually, this converts:

[ b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 ]

into:

[ b0 | b1 | b2 | b3 ] as 16-bit values
[ b4 | b5 | b6 | b7 ] as 16-bit values

This is a common first step before doing arithmetic on pixel data.

Logical operations

MMX also supports bitwise logical operations.

Intrinsic	Operation
`_mm_and_si64(a, b)`	bitwise AND
`_mm_or_si64(a, b)`	bitwise OR
`_mm_xor_si64(a, b)`	bitwise XOR
`_mm_andnot_si64(a, b)`	bitwise AND NOT

These are useful for masking, clearing bits, combining comparison results, and manipulating packed data.

Example:

__m64 a = _mm_set1_pi8((char)0xF0);
__m64 b = _mm_set1_pi8((char)0x0F);

__m64 c = _mm_or_si64(a, b);

// Each byte in c is 0xFF.

Comparison intrinsics

MMX comparisons produce masks. Each lane in the result is either all bits set, meaning true, or all bits clear, meaning false.

Intrinsic	Operation
`_mm_cmpeq_pi8(a, b)`	compare eight 8-bit lanes for equality
`_mm_cmpeq_pi16(a, b)`	compare four 16-bit lanes for equality
`_mm_cmpeq_pi32(a, b)`	compare two 32-bit lanes for equality
`_mm_cmpgt_pi8(a, b)`	compare eight signed 8-bit lanes for greater-than
`_mm_cmpgt_pi16(a, b)`	compare four signed 16-bit lanes for greater-than
`_mm_cmpgt_pi32(a, b)`	compare two signed 32-bit lanes for greater-than

Example:

__m64 a = _mm_set1_pi16(100);
__m64 b = _mm_set1_pi16(50);

__m64 mask = _mm_cmpgt_pi16(a, b);

Each 16-bit lane in mask is true because 100 is greater than 50.

Internally, true lanes are represented with all bits set:

0xFFFF

False lanes are represented as:

0x0000

These masks can be combined with logical operations.

Shift intrinsics

MMX provides shift operations for packed 16-bit and 32-bit integer lanes, and for whole 64-bit values.

Intrinsic	Operation
`_mm_slli_pi16(a, count)`	shift each 16-bit lane left
`_mm_slli_pi32(a, count)`	shift each 32-bit lane left
`_mm_slli_si64(a, count)`	shift the whole 64-bit value left
`_mm_srli_pi16(a, count)`	logical shift right of each 16-bit lane
`_mm_srli_pi32(a, count)`	logical shift right of each 32-bit lane
`_mm_srli_si64(a, count)`	logical shift right of the whole 64-bit value
`_mm_srai_pi16(a, count)`	arithmetic shift right of each signed 16-bit lane
`_mm_srai_pi32(a, count)`	arithmetic shift right of each signed 32-bit lane

Logical right shift fills with zeros.

Arithmetic right shift preserves the sign bit, so it is used for signed values.

Example: average two unsigned byte arrays

MMX itself does not have all the convenience instructions that later SIMD extensions added, but basic operations can still be combined to build useful routines.

This example computes a simple average of two unsigned byte arrays:

output[i] = (a[i] + b[i]) / 2

To avoid 8-bit overflow, the bytes are widened to 16-bit words first.

#include <mmintrin.h>
#include <string.h>

static __m64 load_m64(const void* ptr)
{
    __m64 value;
    memcpy(&value, ptr, sizeof(value));
    return value;
}

static void store_m64(void* ptr, __m64 value)
{
    memcpy(ptr, &value, sizeof(value));
}

void average_u8_mmx(const unsigned char* a,
                    const unsigned char* b,
                    unsigned char* output,
                    int count)
{
    __m64 zero = _mm_setzero_si64();
    __m64 one = _mm_set1_pi16(1);

    int i = 0;

    for (; i + 7 < count; i += 8)
    {
        __m64 va = load_m64(&a[i]);
        __m64 vb = load_m64(&b[i]);

        __m64 aLow = _mm_unpacklo_pi8(va, zero);
        __m64 aHigh = _mm_unpackhi_pi8(va, zero);

        __m64 bLow = _mm_unpacklo_pi8(vb, zero);
        __m64 bHigh = _mm_unpackhi_pi8(vb, zero);

        __m64 sumLow = _mm_add_pi16(aLow, bLow);
        __m64 sumHigh = _mm_add_pi16(aHigh, bHigh);

        // Optional rounding: add 1 before shifting right by 1.
        sumLow = _mm_add_pi16(sumLow, one);
        sumHigh = _mm_add_pi16(sumHigh, one);

        __m64 avgLow = _mm_srli_pi16(sumLow, 1);
        __m64 avgHigh = _mm_srli_pi16(sumHigh, 1);

        __m64 packed = _mm_packs_pu16(avgLow, avgHigh);

        store_m64(&output[i], packed);
    }

    _mm_empty();

    for (; i < count; ++i)
    {
        output[i] = (unsigned char)(((int)a[i] + (int)b[i] + 1) >> 1);
    }
}

This example shows several important MMX techniques together:

load eight bytes,
unpack bytes to words,
do arithmetic in 16-bit lanes,
shift to divide by two,
pack back to unsigned bytes.

MMX vs SSE2

For new code, SSE2 integer intrinsics are usually preferable to MMX.

MMX uses:

__m64

which is 64 bits wide.

SSE2 uses:

__m128i

which is 128 bits wide.

That means SSE2 can process twice as much integer data per vector register:

Data type	MMX `__m64`	SSE2 `__m128i`
8-bit integers	8 lanes	16 lanes
16-bit integers	4 lanes	8 lanes
32-bit integers	2 lanes	4 lanes
64-bit integers	1 lane	2 lanes

SSE2 also avoids the MMX/x87 state problem. There is no need to call _mm_empty() after using SSE2 integer intrinsics.

For example, this MMX idea:

__m64 result = _mm_adds_pu8(a, b);

has a wider SSE2 equivalent:

__m128i result = _mm_adds_epu8(a, b);

The SSE2 version processes sixteen unsigned 8-bit values instead of eight.

Common pitfalls

MMX code has several traps that are easy to miss.

Forgetting `_mm_empty()`

This is the classic MMX bug.

After using MMX instructions, call:

_mm_empty();

This clears the MMX state. Without it, later x87 floating-point code may behave incorrectly or trigger compiler warnings.

Treating MMX as floating-point SIMD

MMX is packed integer SIMD.

It does not provide the floating-point SIMD programming model that SSE later introduced. If you want packed single-precision floating-point arithmetic, use SSE __m128. If you want packed double-precision floating-point arithmetic, use SSE2 __m128d.

Confusing wrapping and saturating arithmetic

Wrapping arithmetic and saturating arithmetic are different.

Unsigned 8-bit wrapping:

250 + 20 = 14

Unsigned 8-bit saturation:

250 + 20 = 255

Use wrapping operations when you want modulo arithmetic. Use saturating operations when values must stay inside a valid range.

Forgetting that `__m64` is only 64 bits

MMX processes only:

8 bytes
4 words
2 doublewords
1 quadword

For modern SIMD code, this is narrow. SSE2 doubles the width to 128 bits, AVX2 doubles it again to 256 bits, and AVX-512 doubles it again to 512 bits.

Using pointer casts carelessly

Legacy MMX code often casts pointers directly to __m64*.

That may work in old code, but it can be questionable in modern C and C++ because of alignment and strict-aliasing rules.

For simple, safe examples, memcpy is a good way to express a 64-bit load or store without relying on aliasing behavior.

Assuming MMX is a good choice for new code

MMX is useful to understand, but it is rarely the right choice for new performance-sensitive x86 code.

Use MMX when:

maintaining old code,
studying legacy SIMD examples,
understanding older multimedia routines,
working with an old compiler or target where MMX is required.

Use SSE2, AVX2, or later SIMD instruction sets for new code whenever possible.

Build notes

On GCC or Clang, MMX can be enabled with:

gcc -O2 -mmmx example.c -o example

For C++:

g++ -O2 -mmmx example.cpp -o example

On modern x86 processors, MMX support is generally available, but compiler support and recommended usage vary by platform and target mode.

For new x86-64 code, prefer SSE2 integer intrinsics instead of MMX.

Complete example program

The following small program demonstrates unsigned saturating addition on eight 8-bit values.

#include <stdio.h>
#include <mmintrin.h>
#include <string.h>

static __m64 load_m64(const void* ptr)
{
    __m64 value;
    memcpy(&value, ptr, sizeof(value));
    return value;
}

static void store_m64(void* ptr, __m64 value)
{
    memcpy(ptr, &value, sizeof(value));
}

int main()
{
    unsigned char a[8] = { 10, 20, 30, 40, 240, 250, 100, 200 };
    unsigned char b[8] = {  1,  2,  3,  4,  20,  20, 200, 100 };
    unsigned char out[8];

    __m64 va = load_m64(a);
    __m64 vb = load_m64(b);

    __m64 result = _mm_adds_pu8(va, vb);

    store_m64(out, result);

    _mm_empty();

    for (int i = 0; i < 8; ++i)
    {
        printf("%u ", out[i]);
    }

    printf("\n");

    return 0;
}

Expected output:

11 22 33 44 255 255 255 255

The last four results demonstrate unsigned saturation. Instead of wrapping around, values above 255 are clamped to 255.

Summary

MMX introduced packed integer SIMD programming to x86.

The key ideas are:

__m64 represents a 64-bit MMX value.
The same 64 bits can be interpreted as 8-bit, 16-bit, 32-bit, or 64-bit integer lanes.
MMX is integer-only SIMD, not floating-point SIMD.
Saturating arithmetic is one of MMX’s most useful features.
Packing and unpacking are essential for image, audio, and codec-style code.
MMX shares state with the old x87 floating-point unit, so _mm_empty() is required after MMX code.
For new code, SSE2 __m128i or later SIMD instruction sets are usually better choices.

MMX is old, but it remains historically important. Understanding MMX makes it easier to read legacy optimized code and to understand how later SIMD extensions evolved.

References

Intel Intrinsics Guide
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
Intel 64 and IA-32 Architectures Software Developer Manuals
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
Microsoft x86 intrinsics list
https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list
Microsoft compiler warning C4799: missing EMMS instruction
https://learn.microsoft.com/en-us/cpp/error-messages/compiler-warnings/compiler-warning-level-1-c4799

What MMX does

Header file

The __m64 data type

Important historical note: MMX and x87 share state

Naming conventions

General support intrinsics

Creating MMX values

Set bytes

Broadcast one byte

Broadcast one 16-bit value

Create a zero value

Loading and storing MMX values

Arithmetic intrinsics

Wrapping addition and subtraction

Saturating addition and subtraction

Example: brighten 8-bit pixels with saturation

Multiplication intrinsics

Packing intrinsics

Signed saturation example

Unsigned saturation example

Example: pack 16-bit values to unsigned 8-bit pixels

Unpacking intrinsics

Example: widen unsigned bytes to unsigned words

Logical operations

Comparison intrinsics

Shift intrinsics

Example: average two unsigned byte arrays

MMX vs SSE2

Common pitfalls

Forgetting _mm_empty()

Treating MMX as floating-point SIMD

Confusing wrapping and saturating arithmetic

Forgetting that __m64 is only 64 bits

Using pointer casts carelessly

Assuming MMX is a good choice for new code

Build notes

Complete example program

Summary

References

Related Posts

AltaLux 2.0: a new multiscale engine and a simpler way to enhance images

AltaLux 1.9.1.92: major update for performance, correctness, and documentation

Intel AVX-512 instruction set: 512-bit SIMD, masks, and high-throughput vector computing

The `__m64` data type

Forgetting `_mm_empty()`

Forgetting that `__m64` is only 64 bits