MMX was Intel’s first widely used SIMD extension for x86 processors. It introduced packed integer operations, allowing one instruction to process multiple small integer values at the same time.
MMX is mostly historical today, but it is still useful to understand older multimedia, image-processing, audio, codec, and game code. Many of the concepts introduced by MMX — packed lanes, saturating arithmetic, packing, unpacking, and vector-style operations — later became central to SSE2, AVX2, and AVX-512 integer SIMD programming.
For new code, SSE2 __m128i, AVX2 __m256i, or later SIMD instruction sets are usually better choices. MMX is still worth knowing because a lot of legacy code and old optimization articles use it.
What MMX does
MMX operates on 64-bit vector values. A single MMX value can be interpreted in several ways:
8 x 8-bit integers
4 x 16-bit integers
2 x 32-bit integers
1 x 64-bit integer
The same 64 bits can represent different lane layouts depending on the instruction being used.
For example, the same MMX register could be treated as eight unsigned bytes:
[ b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 ]
or as four 16-bit words:
[ w0 | w1 | w2 | w3 ]
or as two 32-bit doublewords:
[ d0 | d1 ]
This is the basic SIMD model: one instruction operates on multiple lanes packed into a single register.
Header file
MMX intrinsics are declared in:
#include <mmintrin.h>
This header defines the MMX data type and the MMX intrinsic functions.
The __m64 data type
The main MMX intrinsic type is:
__m64
A __m64 value represents a 64-bit MMX register.
Unlike SSE’s __m128, which is commonly used for four floating-point values, MMX is an integer SIMD technology. It works with packed bytes, words, doublewords, and quadwords.
Conceptually:
__m64 value = 64 bits
Those 64 bits can be interpreted as:
| Interpretation | Lane count |
|---|---|
| 8-bit integers | 8 lanes |
| 16-bit integers | 4 lanes |
| 32-bit integers | 2 lanes |
| 64-bit integer | 1 lane |
The intrinsic name tells the compiler and the reader how those bits are meant to be interpreted.
Important historical note: MMX and x87 share state
MMX has one unusual and important limitation: MMX registers alias the old x87 floating-point register state.
Because of this, after using MMX instructions, code should call:
_mm_empty();
This emits the EMMS instruction, which clears the MMX state and allows normal x87 floating-point operations to be used safely again.
This is one of the main reasons MMX is awkward compared with SSE2 integer SIMD. SSE2 uses XMM registers and does not have the same x87/MMX state problem.
In short:
// Use MMX intrinsics here.
_mm_empty(); // Clear MMX state before returning or before x87 floating-point code.
Even if your function itself does not use floating-point arithmetic, calling _mm_empty() before returning from MMX code is a good habit in legacy MMX code.
Naming conventions
MMX intrinsic names are compact but systematic.
| Name fragment | Meaning |
|---|---|
_mm | intrinsic prefix |
pi8 | packed 8-bit integers |
pi16 | packed 16-bit integers |
pi32 | packed 32-bit integers |
si64 | 64-bit integer value |
adds | saturating addition |
subs | saturating subtraction |
packs | pack with signed saturation |
packus | pack with unsigned saturation |
unpacklo | interleave low lanes |
unpackhi | interleave high lanes |
For example:
_mm_add_pi16(a, b)
means “add four packed 16-bit integer lanes.”
_mm_adds_pu8(a, b)
means “add eight packed unsigned 8-bit integer lanes with saturation.”
_mm_unpacklo_pi8(a, b)
means “interleave the low bytes of two MMX values.”
General support intrinsics
These are the most basic MMX support intrinsics.
| Intrinsic | Instruction | Meaning |
|---|---|---|
_mm_empty() | EMMS | Clear MMX state |
_mm_cvtsi32_si64(i) | MOVD | Move a 32-bit integer into the low 32 bits of an __m64 value |
_mm_cvtsi64_si32(m) | MOVD | Extract the low 32 bits of an __m64 value as an int |
Example:
#include <mmintrin.h>
int example_convert(int x)
{
__m64 v = _mm_cvtsi32_si64(x);
int y = _mm_cvtsi64_si32(v);
_mm_empty();
return y;
}
This example moves an integer into an MMX value and then extracts it again.
Creating MMX values
MMX provides intrinsics to create packed values.
Set bytes
__m64 v = _mm_set_pi8(8, 7, 6, 5, 4, 3, 2, 1);
The ordering can be confusing because the arguments are written from high lane to low lane. In memory-oriented examples, it is often easier to load data from an array instead of using _mm_set_pi8 directly.
Broadcast one byte
__m64 v = _mm_set1_pi8(10);
Conceptually:
[10 | 10 | 10 | 10 | 10 | 10 | 10 | 10]
Broadcast one 16-bit value
__m64 v = _mm_set1_pi16(100);
Conceptually:
[100 | 100 | 100 | 100]
Create a zero value
__m64 zero = _mm_setzero_si64();
This creates a 64-bit zero value.
Loading and storing MMX values
Old MMX examples often load and store values by casting pointers:
__m64 v = *(__m64 const*)ptr;
*(__m64*)out = v;
This was common in legacy code, but it can raise alignment and strict-aliasing concerns in modern C and C++.
For simple examples, a safer approach is to use memcpy helper functions:
#include <mmintrin.h>
#include <string.h>
static __m64 load_m64(const void* ptr)
{
__m64 value;
memcpy(&value, ptr, sizeof(value));
return value;
}
static void store_m64(void* ptr, __m64 value)
{
memcpy(ptr, &value, sizeof(value));
}
Compilers usually optimize these small fixed-size copies efficiently.
Arithmetic intrinsics
MMX provides packed integer addition, subtraction, multiplication, and related operations.
Wrapping addition and subtraction
Wrapping arithmetic means the result wraps around on overflow.
For unsigned 8-bit arithmetic:
250 + 20 = 14 // wraps modulo 256
Common wrapping arithmetic intrinsics include:
| Intrinsic | Operation |
|---|---|
_mm_add_pi8(a, b) | add eight 8-bit integer lanes |
_mm_add_pi16(a, b) | add four 16-bit integer lanes |
_mm_add_pi32(a, b) | add two 32-bit integer lanes |
_mm_sub_pi8(a, b) | subtract eight 8-bit integer lanes |
_mm_sub_pi16(a, b) | subtract four 16-bit integer lanes |
_mm_sub_pi32(a, b) | subtract two 32-bit integer lanes |
Example:
__m64 a = _mm_set1_pi16(1000);
__m64 b = _mm_set1_pi16(25);
__m64 c = _mm_add_pi16(a, b);
// c contains four 16-bit lanes, each equal to 1025.
Saturating addition and subtraction
Saturating arithmetic clamps the result instead of wrapping.
For unsigned 8-bit arithmetic:
250 + 20 = 255 // saturates
instead of:
250 + 20 = 14 // wraps
Saturating arithmetic is especially useful for image and audio processing, where values often need to stay within a fixed range.
| Intrinsic | Operation |
|---|---|
_mm_adds_pi8(a, b) | signed saturating add, 8-bit lanes |
_mm_adds_pi16(a, b) | signed saturating add, 16-bit lanes |
_mm_adds_pu8(a, b) | unsigned saturating add, 8-bit lanes |
_mm_adds_pu16(a, b) | unsigned saturating add, 16-bit lanes |
_mm_subs_pi8(a, b) | signed saturating subtract, 8-bit lanes |
_mm_subs_pi16(a, b) | signed saturating subtract, 16-bit lanes |
_mm_subs_pu8(a, b) | unsigned saturating subtract, 8-bit lanes |
_mm_subs_pu16(a, b) | unsigned saturating subtract, 16-bit lanes |
Example: brighten 8-bit pixels with saturation
This example adds a brightness value to an array of unsigned 8-bit pixels. Values are clamped to 255 instead of wrapping around.
#include <mmintrin.h>
#include <string.h>
static __m64 load_m64(const void* ptr)
{
__m64 value;
memcpy(&value, ptr, sizeof(value));
return value;
}
static void store_m64(void* ptr, __m64 value)
{
memcpy(ptr, &value, sizeof(value));
}
void brighten_u8_mmx(const unsigned char* input,
unsigned char* output,
int count,
unsigned char amount)
{
__m64 vamount = _mm_set1_pi8((char)amount);
int i = 0;
for (; i + 7 < count; i += 8)
{
__m64 pixels = load_m64(&input[i]);
// Unsigned saturating add:
// values above 255 are clamped to 255.
__m64 result = _mm_adds_pu8(pixels, vamount);
store_m64(&output[i], result);
}
_mm_empty();
for (; i < count; ++i)
{
int value = input[i] + amount;
if (value > 255)
value = 255;
output[i] = (unsigned char)value;
}
}
The MMX loop processes eight pixels per iteration.
This is a classic MMX-style use case: many small integer values, simple arithmetic, and saturation.
Multiplication intrinsics
MMX supports multiplication of 16-bit integer lanes.
| Intrinsic | Operation |
|---|---|
_mm_mullo_pi16(a, b) | multiply four 16-bit lanes and keep the low 16 bits of each result |
_mm_mulhi_pi16(a, b) | multiply four signed 16-bit lanes and keep the high 16 bits of each result |
_mm_madd_pi16(a, b) | multiply pairs of 16-bit lanes and add adjacent products into 32-bit results |
Example:
__m64 a = _mm_set1_pi16(100);
__m64 b = _mm_set1_pi16(3);
__m64 c = _mm_mullo_pi16(a, b);
// c contains four 16-bit lanes, each equal to 300.
The _mm_madd_pi16 intrinsic is especially useful in signal processing, dot products, filters, and transform code.
Conceptually:
a = [a0 | a1 | a2 | a3]
b = [b0 | b1 | b2 | b3]
_mm_madd_pi16(a, b) produces:
[ a0*b0 + a1*b1 | a2*b2 + a3*b3 ]
The result contains two 32-bit integer lanes.
Packing intrinsics
Packing converts wider integer lanes into narrower integer lanes.
This is useful when intermediate calculations are done at higher precision and then converted back to smaller output values.
For example, an image-processing filter may compute temporary 16-bit values and then pack them back into 8-bit pixels.
| Intrinsic | Instruction | Meaning |
|---|---|---|
_mm_packs_pi16(a, b) | PACKSSWB | Pack signed 16-bit values into signed saturated 8-bit values |
_mm_packs_pi32(a, b) | PACKSSDW | Pack signed 32-bit values into signed saturated 16-bit values |
_mm_packs_pu16(a, b) | PACKUSWB | Pack 16-bit values into unsigned saturated 8-bit values |
Signed saturation example
When packing signed 16-bit values to signed 8-bit values:
-200 -> -128
-20 -> -20
100 -> 100
200 -> 127
The signed 8-bit range is:
-128 to 127
Values outside that range are clamped.
Unsigned saturation example
When packing 16-bit values to unsigned 8-bit values:
-20 -> 0
0 -> 0
100 -> 100
300 -> 255
The unsigned 8-bit range is:
0 to 255
Values below 0 become 0. Values above 255 become 255.
Example: pack 16-bit values to unsigned 8-bit pixels
This example converts signed 16-bit intermediate values to unsigned 8-bit output values using saturation.
#include <mmintrin.h>
#include <string.h>
static __m64 load_m64(const void* ptr)
{
__m64 value;
memcpy(&value, ptr, sizeof(value));
return value;
}
static void store_m64(void* ptr, __m64 value)
{
memcpy(ptr, &value, sizeof(value));
}
void pack_s16_to_u8_mmx(const short* input,
unsigned char* output,
int count)
{
int i = 0;
// Process 8 input shorts and produce 8 output bytes.
for (; i + 7 < count; i += 8)
{
__m64 lowWords = load_m64(&input[i]); // 4 x int16
__m64 highWords = load_m64(&input[i + 4]); // 4 x int16
__m64 packed = _mm_packs_pu16(lowWords, highWords);
store_m64(&output[i], packed);
}
_mm_empty();
for (; i < count; ++i)
{
int value = input[i];
if (value < 0)
value = 0;
if (value > 255)
value = 255;
output[i] = (unsigned char)value;
}
}
This is common in image-processing code where calculations are performed using 16-bit or 32-bit intermediates, but the final image is 8-bit.
Unpacking intrinsics
Unpacking interleaves lanes from two MMX values.
This is often used to widen smaller integer values before doing arithmetic.
For example, unsigned 8-bit pixels may be unpacked into 16-bit words before multiplication or addition, avoiding overflow.
| Intrinsic | Instruction | Meaning |
|---|---|---|
_mm_unpacklo_pi8(a, b) | PUNPCKLBW | Interleave low bytes |
_mm_unpackhi_pi8(a, b) | PUNPCKHBW | Interleave high bytes |
_mm_unpacklo_pi16(a, b) | PUNPCKLWD | Interleave low 16-bit words |
_mm_unpackhi_pi16(a, b) | PUNPCKHWD | Interleave high 16-bit words |
_mm_unpacklo_pi32(a, b) | PUNPCKLDQ | Interleave low 32-bit doublewords |
_mm_unpackhi_pi32(a, b) | PUNPCKHDQ | Interleave high 32-bit doublewords |
Example: widen unsigned bytes to unsigned words
This example converts unsigned 8-bit values into unsigned 16-bit values.
The trick is to unpack the bytes with zero.
#include <mmintrin.h>
#include <string.h>
static __m64 load_m64(const void* ptr)
{
__m64 value;
memcpy(&value, ptr, sizeof(value));
return value;
}
static void store_m64(void* ptr, __m64 value)
{
memcpy(ptr, &value, sizeof(value));
}
void widen_u8_to_u16_mmx(const unsigned char* input,
unsigned short* output,
int count)
{
__m64 zero = _mm_setzero_si64();
int i = 0;
for (; i + 7 < count; i += 8)
{
__m64 bytes = load_m64(&input[i]);
__m64 lowWords = _mm_unpacklo_pi8(bytes, zero);
__m64 highWords = _mm_unpackhi_pi8(bytes, zero);
store_m64(&output[i], lowWords);
store_m64(&output[i + 4], highWords);
}
_mm_empty();
for (; i < count; ++i)
{
output[i] = input[i];
}
}
Conceptually, this converts:
[ b0 | b1 | b2 | b3 | b4 | b5 | b6 | b7 ]
into:
[ b0 | b1 | b2 | b3 ] as 16-bit values
[ b4 | b5 | b6 | b7 ] as 16-bit values
This is a common first step before doing arithmetic on pixel data.
Logical operations
MMX also supports bitwise logical operations.
| Intrinsic | Operation |
|---|---|
_mm_and_si64(a, b) | bitwise AND |
_mm_or_si64(a, b) | bitwise OR |
_mm_xor_si64(a, b) | bitwise XOR |
_mm_andnot_si64(a, b) | bitwise AND NOT |
These are useful for masking, clearing bits, combining comparison results, and manipulating packed data.
Example:
__m64 a = _mm_set1_pi8((char)0xF0);
__m64 b = _mm_set1_pi8((char)0x0F);
__m64 c = _mm_or_si64(a, b);
// Each byte in c is 0xFF.
Comparison intrinsics
MMX comparisons produce masks. Each lane in the result is either all bits set, meaning true, or all bits clear, meaning false.
| Intrinsic | Operation |
|---|---|
_mm_cmpeq_pi8(a, b) | compare eight 8-bit lanes for equality |
_mm_cmpeq_pi16(a, b) | compare four 16-bit lanes for equality |
_mm_cmpeq_pi32(a, b) | compare two 32-bit lanes for equality |
_mm_cmpgt_pi8(a, b) | compare eight signed 8-bit lanes for greater-than |
_mm_cmpgt_pi16(a, b) | compare four signed 16-bit lanes for greater-than |
_mm_cmpgt_pi32(a, b) | compare two signed 32-bit lanes for greater-than |
Example:
__m64 a = _mm_set1_pi16(100);
__m64 b = _mm_set1_pi16(50);
__m64 mask = _mm_cmpgt_pi16(a, b);
Each 16-bit lane in mask is true because 100 is greater than 50.
Internally, true lanes are represented with all bits set:
0xFFFF
False lanes are represented as:
0x0000
These masks can be combined with logical operations.
Shift intrinsics
MMX provides shift operations for packed 16-bit and 32-bit integer lanes, and for whole 64-bit values.
| Intrinsic | Operation |
|---|---|
_mm_slli_pi16(a, count) | shift each 16-bit lane left |
_mm_slli_pi32(a, count) | shift each 32-bit lane left |
_mm_slli_si64(a, count) | shift the whole 64-bit value left |
_mm_srli_pi16(a, count) | logical shift right of each 16-bit lane |
_mm_srli_pi32(a, count) | logical shift right of each 32-bit lane |
_mm_srli_si64(a, count) | logical shift right of the whole 64-bit value |
_mm_srai_pi16(a, count) | arithmetic shift right of each signed 16-bit lane |
_mm_srai_pi32(a, count) | arithmetic shift right of each signed 32-bit lane |
Logical right shift fills with zeros.
Arithmetic right shift preserves the sign bit, so it is used for signed values.
Example: average two unsigned byte arrays
MMX itself does not have all the convenience instructions that later SIMD extensions added, but basic operations can still be combined to build useful routines.
This example computes a simple average of two unsigned byte arrays:
output[i] = (a[i] + b[i]) / 2
To avoid 8-bit overflow, the bytes are widened to 16-bit words first.
#include <mmintrin.h>
#include <string.h>
static __m64 load_m64(const void* ptr)
{
__m64 value;
memcpy(&value, ptr, sizeof(value));
return value;
}
static void store_m64(void* ptr, __m64 value)
{
memcpy(ptr, &value, sizeof(value));
}
void average_u8_mmx(const unsigned char* a,
const unsigned char* b,
unsigned char* output,
int count)
{
__m64 zero = _mm_setzero_si64();
__m64 one = _mm_set1_pi16(1);
int i = 0;
for (; i + 7 < count; i += 8)
{
__m64 va = load_m64(&a[i]);
__m64 vb = load_m64(&b[i]);
__m64 aLow = _mm_unpacklo_pi8(va, zero);
__m64 aHigh = _mm_unpackhi_pi8(va, zero);
__m64 bLow = _mm_unpacklo_pi8(vb, zero);
__m64 bHigh = _mm_unpackhi_pi8(vb, zero);
__m64 sumLow = _mm_add_pi16(aLow, bLow);
__m64 sumHigh = _mm_add_pi16(aHigh, bHigh);
// Optional rounding: add 1 before shifting right by 1.
sumLow = _mm_add_pi16(sumLow, one);
sumHigh = _mm_add_pi16(sumHigh, one);
__m64 avgLow = _mm_srli_pi16(sumLow, 1);
__m64 avgHigh = _mm_srli_pi16(sumHigh, 1);
__m64 packed = _mm_packs_pu16(avgLow, avgHigh);
store_m64(&output[i], packed);
}
_mm_empty();
for (; i < count; ++i)
{
output[i] = (unsigned char)(((int)a[i] + (int)b[i] + 1) >> 1);
}
}
This example shows several important MMX techniques together:
- load eight bytes,
- unpack bytes to words,
- do arithmetic in 16-bit lanes,
- shift to divide by two,
- pack back to unsigned bytes.
MMX vs SSE2
For new code, SSE2 integer intrinsics are usually preferable to MMX.
MMX uses:
__m64
which is 64 bits wide.
SSE2 uses:
__m128i
which is 128 bits wide.
That means SSE2 can process twice as much integer data per vector register:
| Data type | MMX __m64 | SSE2 __m128i |
|---|---|---|
| 8-bit integers | 8 lanes | 16 lanes |
| 16-bit integers | 4 lanes | 8 lanes |
| 32-bit integers | 2 lanes | 4 lanes |
| 64-bit integers | 1 lane | 2 lanes |
SSE2 also avoids the MMX/x87 state problem. There is no need to call _mm_empty() after using SSE2 integer intrinsics.
For example, this MMX idea:
__m64 result = _mm_adds_pu8(a, b);
has a wider SSE2 equivalent:
__m128i result = _mm_adds_epu8(a, b);
The SSE2 version processes sixteen unsigned 8-bit values instead of eight.
Common pitfalls
MMX code has several traps that are easy to miss.
Forgetting _mm_empty()
This is the classic MMX bug.
After using MMX instructions, call:
_mm_empty();
This clears the MMX state. Without it, later x87 floating-point code may behave incorrectly or trigger compiler warnings.
Treating MMX as floating-point SIMD
MMX is packed integer SIMD.
It does not provide the floating-point SIMD programming model that SSE later introduced. If you want packed single-precision floating-point arithmetic, use SSE __m128. If you want packed double-precision floating-point arithmetic, use SSE2 __m128d.
Confusing wrapping and saturating arithmetic
Wrapping arithmetic and saturating arithmetic are different.
Unsigned 8-bit wrapping:
250 + 20 = 14
Unsigned 8-bit saturation:
250 + 20 = 255
Use wrapping operations when you want modulo arithmetic. Use saturating operations when values must stay inside a valid range.
Forgetting that __m64 is only 64 bits
MMX processes only:
8 bytes
4 words
2 doublewords
1 quadword
For modern SIMD code, this is narrow. SSE2 doubles the width to 128 bits, AVX2 doubles it again to 256 bits, and AVX-512 doubles it again to 512 bits.
Using pointer casts carelessly
Legacy MMX code often casts pointers directly to __m64*.
That may work in old code, but it can be questionable in modern C and C++ because of alignment and strict-aliasing rules.
For simple, safe examples, memcpy is a good way to express a 64-bit load or store without relying on aliasing behavior.
Assuming MMX is a good choice for new code
MMX is useful to understand, but it is rarely the right choice for new performance-sensitive x86 code.
Use MMX when:
- maintaining old code,
- studying legacy SIMD examples,
- understanding older multimedia routines,
- working with an old compiler or target where MMX is required.
Use SSE2, AVX2, or later SIMD instruction sets for new code whenever possible.
Build notes
On GCC or Clang, MMX can be enabled with:
gcc -O2 -mmmx example.c -o example
For C++:
g++ -O2 -mmmx example.cpp -o example
On modern x86 processors, MMX support is generally available, but compiler support and recommended usage vary by platform and target mode.
For new x86-64 code, prefer SSE2 integer intrinsics instead of MMX.
Complete example program
The following small program demonstrates unsigned saturating addition on eight 8-bit values.
#include <stdio.h>
#include <mmintrin.h>
#include <string.h>
static __m64 load_m64(const void* ptr)
{
__m64 value;
memcpy(&value, ptr, sizeof(value));
return value;
}
static void store_m64(void* ptr, __m64 value)
{
memcpy(ptr, &value, sizeof(value));
}
int main()
{
unsigned char a[8] = { 10, 20, 30, 40, 240, 250, 100, 200 };
unsigned char b[8] = { 1, 2, 3, 4, 20, 20, 200, 100 };
unsigned char out[8];
__m64 va = load_m64(a);
__m64 vb = load_m64(b);
__m64 result = _mm_adds_pu8(va, vb);
store_m64(out, result);
_mm_empty();
for (int i = 0; i < 8; ++i)
{
printf("%u ", out[i]);
}
printf("\n");
return 0;
}
Expected output:
11 22 33 44 255 255 255 255
The last four results demonstrate unsigned saturation. Instead of wrapping around, values above 255 are clamped to 255.
Summary
MMX introduced packed integer SIMD programming to x86.
The key ideas are:
__m64represents a 64-bit MMX value.- The same 64 bits can be interpreted as 8-bit, 16-bit, 32-bit, or 64-bit integer lanes.
- MMX is integer-only SIMD, not floating-point SIMD.
- Saturating arithmetic is one of MMX’s most useful features.
- Packing and unpacking are essential for image, audio, and codec-style code.
- MMX shares state with the old x87 floating-point unit, so
_mm_empty()is required after MMX code. - For new code, SSE2
__m128ior later SIMD instruction sets are usually better choices.
MMX is old, but it remains historically important. Understanding MMX makes it easier to read legacy optimized code and to understand how later SIMD extensions evolved.
References
- Intel Intrinsics Guide
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html - Intel 64 and IA-32 Architectures Software Developer Manuals
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html - Microsoft x86 intrinsics list
https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list - Microsoft compiler warning C4799: missing EMMS instruction
https://learn.microsoft.com/en-us/cpp/error-messages/compiler-warnings/compiler-warning-level-1-c4799



