MMX was the first widely adopted SIMD instruction set on x86 processors.
It was introduced by Intel in the Pentium MMX generation and later supported by AMD and other x86-compatible processors. At the time, it was a major step forward for multimedia and communications software because it allowed one instruction to process multiple small integer values at once.
MMX was designed for workloads such as:
- image processing;
- video decoding;
- audio processing;
- speech compression;
- games;
- 2D and 3D graphics support code;
- signal processing;
- video conferencing;
- data conversion and filtering.
These workloads often have the same basic structure: many small values, repeated operations, tight loops, and a high degree of data parallelism.
MMX gave x86 programmers a way to exploit that parallelism directly.
Today, MMX is mostly a legacy instruction set. For new code, SSE2 or later is almost always a better choice. But MMX is still important to understand because it was the foundation of x86 SIMD programming and many old multimedia libraries, codecs, image filters, and game engines used it heavily.
Core Idea
MMX is based on SIMD:
Single Instruction, Multiple Data
Instead of processing one value at a time, an MMX instruction processes several packed integer values inside one 64-bit register.
For example, one MMX register can hold eight 8-bit pixels:
[ p7 ][ p6 ][ p5 ][ p4 ][ p3 ][ p2 ][ p1 ][ p0 ]
A single packed add instruction can add eight pixel values at once.
That was the key advantage of MMX: it allowed small integer operations to be performed in parallel.
What MMX Added
MMX added:
- 57 new instructions;
- eight 64-bit MMX registers;
- packed integer data types;
- saturating arithmetic;
- packed comparison instructions;
- packing and unpacking operations;
- logical operations;
- packed shifts;
- a new SIMD programming model for x86.
The MMX registers are named:
mm0 mm1 mm2 mm3 mm4 mm5 mm6 mm7
Each register is 64 bits wide.
In C and C++ intrinsics, MMX values are represented using the __m64 type:
#include <mmintrin.h>
__m64 value;
The original MMX intrinsic header is:
#include <mmintrin.h>
MMX Data Types
MMX does not define one single data type. Instead, each 64-bit MMX register can be interpreted in several ways.
| MMX packed type | Elements per 64-bit register | Element size |
|---|---|---|
| Packed byte | 8 | 8-bit |
| Packed word | 4 | 16-bit |
| Packed doubleword | 2 | 32-bit |
| Quadword | 1 | 64-bit |
The same physical 64-bit register can be interpreted differently depending on the instruction.
For example:
64-bit register as bytes:
[ b7 ][ b6 ][ b5 ][ b4 ][ b3 ][ b2 ][ b1 ][ b0 ]
64-bit register as words:
[ w3 ][ w2 ][ w1 ][ w0 ]
64-bit register as doublewords:
[ d1 ][ d0 ]
64-bit register as one quadword:
[ q0 ]
MMX itself does not track the data type stored in a register. The instruction determines how the bits are interpreted.
That is still true in modern SIMD programming. A vector register contains bits. The instruction gives meaning to those bits.
Integer and Fixed-Point Only
MMX is an integer SIMD instruction set.
It does not provide floating-point SIMD operations. That came later through AMD 3DNow! and Intel SSE.
MMX can be used for fixed-point arithmetic, but the processor does not know where the fixed point is. The programmer decides how to interpret the integer values.
For example, a 16-bit word might represent:
1234 // ordinary integer
12.34 // fixed-point value with two decimal digits
0.3765 // normalized fixed-point value
The CPU does not know the difference. It simply performs integer operations.
This gives the programmer flexibility, but also responsibility. You must control scaling, rounding, overflow, and precision yourself.
Why MMX Was Useful
MMX was designed for the kind of data commonly found in multimedia code.
Image Processing
Pixels are often stored as 8-bit channels:
red = 0..255
green = 0..255
blue = 0..255
alpha = 0..255
One MMX register can process eight 8-bit values at a time.
That is useful for:
- brightness changes;
- contrast adjustment;
- alpha blending;
- grayscale conversion;
- thresholding;
- masks;
- simple filters.
Audio Processing
Audio samples are often stored as 16-bit integers.
One MMX register can hold four 16-bit samples.
That is useful for:
- mixing;
- scaling;
- clipping;
- filtering;
- format conversion.
Video Processing
Video codecs perform many repeated operations on blocks of pixels.
MMX was useful for:
- motion compensation;
- block difference calculations;
- interpolation;
- color-space conversion;
- quantization support;
- pixel averaging.
Later instruction sets added much better video-processing instructions, but MMX was the first mainstream x86 SIMD step in that direction.
Basic MMX Example
The following example adds eight unsigned bytes using MMX wraparound arithmetic.
#include <stdint.h>
#include <stddef.h>
#include <string.h>
#include <mmintrin.h>
void add_u8_mmx(uint8_t* dst, const uint8_t* a, const uint8_t* b, size_t count)
{
size_t i = 0;
for (; i + 8 <= count; i += 8)
{
__m64 va;
__m64 vb;
memcpy(&va, a + i, sizeof(va));
memcpy(&vb, b + i, sizeof(vb));
__m64 vr = _mm_add_pi8(va, vb);
memcpy(dst + i, &vr, sizeof(vr));
}
_mm_empty();
for (; i < count; ++i)
{
dst[i] = (uint8_t)(a[i] + b[i]);
}
}
The vector loop processes eight bytes per iteration.
The scalar loop at the end handles the remaining bytes when count is not a multiple of eight.
The call to _mm_empty() is important. It clears the MMX state before the function returns.
Why _mm_empty() Is Required
MMX has an important historical design issue:
MMX registers are aliased on top of the x87 floating-point register stack.
The MMX registers mm0 through mm7 reuse the same physical register storage as the x87 FPU registers.
Because of this, after executing MMX instructions, the x87 floating-point unit may think its register stack is full or in use.
To return the processor to a clean floating-point state, you must execute the EMMS instruction.
In C and C++, this is done with:
_mm_empty();
A typical MMX function should end like this:
void mmx_function(void)
{
// MMX work here.
_mm_empty();
}
Do not call _mm_empty() before every MMX instruction. It is a cleanup instruction, not an initialization instruction.
Use it once after the MMX block is finished, before returning to code that may use x87 floating-point instructions.
MMX and the x87 Problem
The x87 FPU uses a stack-like register model.
MMX reused that register storage to avoid adding a completely separate architectural register file. That made MMX easier to integrate into existing CPUs and operating systems, but it created a programming inconvenience.
After MMX code runs, x87 floating-point code may fail or behave incorrectly unless EMMS is executed first.
This is one of the reasons MMX was eventually replaced by SSE2 for integer SIMD code.
SSE2 uses XMM registers, which are independent from the x87 stack. Pure SSE2 code does not require _mm_empty().
Packed Arithmetic Instructions
MMX includes packed arithmetic instructions for bytes, words, and doublewords.
Common examples include:
| Instruction | Intrinsic | Meaning |
|---|---|---|
PADDB | _mm_add_pi8 | add packed 8-bit integers |
PADDW | _mm_add_pi16 | add packed 16-bit integers |
PADDD | _mm_add_pi32 | add packed 32-bit integers |
PSUBB | _mm_sub_pi8 | subtract packed 8-bit integers |
PSUBW | _mm_sub_pi16 | subtract packed 16-bit integers |
PSUBD | _mm_sub_pi32 | subtract packed 32-bit integers |
PMULLW | _mm_mullo_pi16 | multiply packed 16-bit integers, keep low half |
PMULHW | _mm_mulhi_pi16 | multiply packed signed 16-bit integers, keep high half |
The basic packed add and subtract instructions wrap around on overflow.
For example, with 8-bit unsigned arithmetic:
250 + 20 = 14
because the result wraps modulo 256.
For multimedia programming, wraparound is often not what you want. That is why MMX also provides saturating arithmetic.
Saturating Arithmetic
Saturating arithmetic clamps results to the representable range instead of wrapping around.
For unsigned 8-bit values:
250 + 20 = 255
instead of:
250 + 20 = 14
This is very useful for pixels, audio samples, and other bounded media values.
MMX provides signed and unsigned saturating add/subtract instructions.
| Instruction | Intrinsic | Meaning |
|---|---|---|
PADDSB | _mm_adds_pi8 | signed saturating add, 8-bit |
PADDSW | _mm_adds_pi16 | signed saturating add, 16-bit |
PADDUSB | _mm_adds_pu8 | unsigned saturating add, 8-bit |
PADDUSW | _mm_adds_pu16 | unsigned saturating add, 16-bit |
PSUBSB | _mm_subs_pi8 | signed saturating subtract, 8-bit |
PSUBSW | _mm_subs_pi16 | signed saturating subtract, 16-bit |
PSUBUSB | _mm_subs_pu8 | unsigned saturating subtract, 8-bit |
PSUBUSW | _mm_subs_pu16 | unsigned saturating subtract, 16-bit |
Example: Brightness Increase With Saturation
This example increases image brightness for an array of 8-bit grayscale pixels.
#include <stdint.h>
#include <stddef.h>
#include <string.h>
#include <mmintrin.h>
void brighten_u8_mmx(uint8_t* pixels, uint8_t amount, size_t count)
{
size_t i = 0;
__m64 increment = _mm_set1_pi8((char)amount);
for (; i + 8 <= count; i += 8)
{
__m64 p;
memcpy(&p, pixels + i, sizeof(p));
__m64 r = _mm_adds_pu8(p, increment);
memcpy(pixels + i, &r, sizeof(r));
}
_mm_empty();
for (; i < count; ++i)
{
unsigned int value = pixels[i] + amount;
pixels[i] = (value > 255) ? 255 : (uint8_t)value;
}
}
This is the kind of operation MMX was designed for: small data elements, simple repeated computation, and obvious data parallelism.
Packed Comparison Instructions
MMX includes packed comparison instructions.
These compare elements lane by lane and produce a mask result.
Common examples:
| Instruction | Intrinsic | Meaning |
|---|---|---|
PCMPEQB | _mm_cmpeq_pi8 | compare packed bytes for equality |
PCMPEQW | _mm_cmpeq_pi16 | compare packed words for equality |
PCMPEQD | _mm_cmpeq_pi32 | compare packed doublewords for equality |
PCMPGTB | _mm_cmpgt_pi8 | compare signed bytes greater-than |
PCMPGTW | _mm_cmpgt_pi16 | compare signed words greater-than |
PCMPGTD | _mm_cmpgt_pi32 | compare signed doublewords greater-than |
The result of a comparison is not 0 or 1 per element.
Instead, each element becomes either all zero bits or all one bits.
For example, for bytes:
false -> 0x00
true -> 0xFF
This makes comparison results useful as masks for later logical operations.
Logical Instructions
MMX includes bitwise logical operations on 64-bit values.
| Instruction | Intrinsic | Meaning |
|---|---|---|
PAND | _mm_and_si64 | bitwise AND |
PANDN | _mm_andnot_si64 | bitwise AND NOT |
POR | _mm_or_si64 | bitwise OR |
PXOR | _mm_xor_si64 | bitwise XOR |
These instructions are commonly used for:
- masks;
- blending;
- clearing bits;
- selecting values;
- building constants;
- combining comparison results.
A common idiom is to zero a register using XOR:
__m64 zero = _mm_setzero_si64();
This typically maps to a PXOR instruction.
Packing and Unpacking
Packing and unpacking are essential in SIMD programming.
Packing converts wider elements into narrower elements.
Unpacking interleaves elements from two registers and often converts narrower data into a wider representation.
Common MMX pack instructions include:
| Instruction | Meaning |
|---|---|
PACKSSWB | pack signed words into signed bytes with saturation |
PACKSSDW | pack signed doublewords into signed words with saturation |
PACKUSWB | pack signed words into unsigned bytes with saturation |
Common unpack instructions include:
| Instruction | Meaning |
|---|---|
PUNPCKLBW | unpack low bytes into words |
PUNPCKHBW | unpack high bytes into words |
PUNPCKLWD | unpack low words into doublewords |
PUNPCKHWD | unpack high words into doublewords |
PUNPCKLDQ | unpack low doublewords into quadword |
PUNPCKHDQ | unpack high doublewords into quadword |
These instructions are useful when you need to widen data before arithmetic.
For example, adding bytes can overflow quickly. A common approach is:
- unpack bytes into words;
- perform 16-bit arithmetic;
- pack the result back into bytes with saturation.
This pattern remains common in later SIMD instruction sets too.
Shift Instructions
MMX includes packed shift instructions for words, doublewords, and quadwords.
| Instruction | Meaning |
|---|---|
PSLLW | shift packed words left logical |
PSLLD | shift packed doublewords left logical |
PSLLQ | shift quadword left logical |
PSRLW | shift packed words right logical |
PSRLD | shift packed doublewords right logical |
PSRLQ | shift quadword right logical |
PSRAW | shift packed words right arithmetic |
PSRAD | shift packed doublewords right arithmetic |
Logical right shifts fill with zeros.
Arithmetic right shifts preserve the sign bit and are used for signed values.
Shifts are important for fixed-point arithmetic. For example, after multiplying fixed-point values, a right shift is often used to restore the desired scale.
Data Movement
MMX uses MOVD and MOVQ style instructions to move data.
| Instruction | Meaning |
|---|---|
MOVD | move 32 bits between MMX, memory, or general-purpose register |
MOVQ | move 64 bits between MMX registers or memory |
In C/C++ intrinsics, older MMX code often used pointer casts to load and store __m64 values. In modern C and C++, using memcpy is safer because it avoids strict-aliasing problems, and optimizing compilers usually generate efficient loads and stores.
Example:
__m64 v;
memcpy(&v, source, sizeof(v));
/* use v */
memcpy(destination, &v, sizeof(v));
For historical code, you will often see this instead:
__m64 v = *(__m64 const*)source;
*(__m64*)destination = v;
That style was common in old examples, but it can be problematic in strictly conforming C/C++ because of aliasing and alignment assumptions.
MMX Instruction Groups
The MMX instruction set can be grouped like this:
| Group | Examples | Purpose |
|---|---|---|
| Data transfer | MOVD, MOVQ | move data to/from MMX registers |
| Arithmetic | PADD*, PSUB*, PMUL* | packed integer arithmetic |
| Saturating arithmetic | PADDS*, PADDUS*, PSUBS*, PSUBUS* | clamp instead of wrap |
| Comparison | PCMPEQ*, PCMPGT* | produce packed masks |
| Conversion / packing | PACK*, PUNPCK* | narrow, widen, interleave |
| Logical | PAND, PANDN, POR, PXOR | bitwise operations |
| Shift | PSLL*, PSRL*, PSRA* | packed and quadword shifts |
| State management | EMMS | empty MMX state |
This structure influenced later SIMD extensions such as SSE2, AVX2, and AVX-512.
The register width changed, but the same categories remained important.
MMX Intrinsics
MMX intrinsics expose MMX instructions through C and C++ function-like operations.
The basic header is:
#include <mmintrin.h>
The main data type is:
__m64
Examples:
__m64 a = _mm_set1_pi16(100);
__m64 b = _mm_set1_pi16(25);
__m64 sum = _mm_add_pi16(a, b);
__m64 diff = _mm_sub_pi16(a, b);
__m64 product_low = _mm_mullo_pi16(a, b);
_mm_empty();
A useful practical rule is:
If your code uses __m64, it is using MMX state.
That means you should remember _mm_empty() before returning to code that may use x87 floating-point operations.
MMX Register Layout and Element Order
When using intrinsics such as _mm_set_pi8, _mm_set_pi16, and _mm_set_pi32, be careful with element order.
For example:
__m64 v = _mm_set_pi16(4, 3, 2, 1);
This places values into the packed lanes in a high-to-low argument order.
This convention can be confusing when debugging because memory order, lane numbering, and argument order may not look the same at first glance.
For educational code, it is often clearer to load values from an array:
#include <stdint.h>
#include <string.h>
#include <mmintrin.h>
int16_t values[4] = { 1, 2, 3, 4 };
__m64 v;
memcpy(&v, values, sizeof(v));
Then the memory layout is explicit.
Alignment
MMX registers are 64 bits wide, so MMX code usually works with 8-byte chunks.
Many MMX memory operations can access memory operands directly, and old MMX code often assumes 8-byte chunks.
However, alignment still matters for performance and correctness in low-level code.
General advice:
- process data in 8-byte blocks;
- handle leftover elements with scalar code;
- avoid unsafe pointer casts when writing portable C/C++;
- be careful when reading past the end of buffers;
- benchmark with realistic data.
SSE2 and later make alignment more visible because XMM registers are 16 bytes wide, but MMX code still benefits from predictable memory access.
Tail Processing
Most buffers are not exact multiples of eight bytes.
A safe MMX loop usually has this structure:
size_t i = 0;
for (; i + 8 <= count; i += 8)
{
// MMX work on 8 bytes.
}
_mm_empty();
for (; i < count; ++i)
{
// Scalar tail.
}
The scalar tail is not just a detail. It is necessary for correctness.
Do not read past the end of the input buffer unless the data format explicitly guarantees safe padding.
MMX and Modern x86-64
Modern x86-64 processors generally still support MMX for backward compatibility, but MMX is not a good target for new code.
There are several reasons:
- MMX registers are only 64 bits wide.
- MMX shares state with x87.
- MMX requires
_mm_empty(). - SSE2 is part of the x86-64 baseline.
- SSE2 provides 128-bit integer SIMD with XMM registers.
- AVX2 and AVX-512 provide much wider SIMD on newer CPUs.
- Modern compilers optimize SSE2 and newer SIMD much better than MMX.
For new code, MMX is mainly useful as historical knowledge.
MMX vs SSE2
SSE2 is the natural replacement for MMX integer SIMD.
| Feature | MMX | SSE2 |
|---|---|---|
| Register type | MMX | XMM |
| Register width | 64-bit | 128-bit |
| Intrinsic type | __m64 | __m128i |
| Packed bytes | 8 | 16 |
| Packed words | 4 | 8 |
| Packed doublewords | 2 | 4 |
| Shares x87 state | Yes | No |
Requires _mm_empty() | Yes | No |
| Good for new code | No | Yes |
Here is a simple MMX operation:
#include <mmintrin.h>
void mmx_example(void)
{
__m64 a = _mm_set_pi16(4, 3, 2, 1);
__m64 b = _mm_set_pi16(8, 7, 6, 5);
__m64 c = _mm_add_pi16(a, b);
_mm_empty();
}
Here is the equivalent SSE2-style operation:
#include <emmintrin.h>
void sse2_example(void)
{
__m128i a = _mm_set_epi16(8, 7, 6, 5, 4, 3, 2, 1);
__m128i b = _mm_set_epi16(16, 15, 14, 13, 12, 11, 10, 9);
__m128i c = _mm_add_epi16(a, b);
// No _mm_empty() required for pure SSE2 code.
}
The SSE2 version processes twice as many 16-bit values and avoids the MMX/x87 state problem.
MMX vs 3DNow!
AMD introduced 3DNow! after MMX.
3DNow! used the MMX register file but added packed single-precision floating-point operations and other media-oriented instructions.
This was AMD’s way to accelerate 3D and multimedia workloads before SSE became widely available.
However, 3DNow! did not become the long-term portable SIMD path. Intel SSE and SSE2 became the dominant model, and AMD eventually adopted the SSE/AVX family as well.
Today, 3DNow! is obsolete for new code.
MMX vs SSE, AVX, and AVX-512
MMX started the SIMD path on x86, but later instruction sets expanded the same idea.
| Instruction set | Register type | Width | Main role |
|---|---|---|---|
| MMX | __m64 | 64-bit | Packed integer SIMD |
| SSE | __m128 | 128-bit | Packed single-precision floating point |
| SSE2 | __m128i, __m128d | 128-bit | Integer SIMD and double-precision floating point |
| AVX | __m256 | 256-bit | Wider floating-point SIMD |
| AVX2 | __m256i | 256-bit | Wider integer SIMD |
| AVX-512 | __m512i, __m512, __m512d | 512-bit | Wide SIMD, masks, HPC, AI, analytics |
The modern equivalent of MMX-style packed integer programming is not MMX. It is SSE2, AVX2, or AVX-512 depending on the target CPU.
When MMX Still Matters
MMX is still worth understanding when:
- maintaining old C/C++ multimedia code;
- reading old assembly routines;
- studying early SIMD optimization techniques;
- porting legacy codecs or image filters;
- understanding the evolution from MMX to SSE2;
- working with vintage CPUs;
- documenting old x86 performance techniques.
MMX is not useless knowledge. It explains many SIMD ideas that still exist today:
- packed data;
- lane-wise arithmetic;
- saturating arithmetic;
- masks from comparisons;
- widening and narrowing;
- fixed-point arithmetic;
- vectorized loops;
- scalar tail handling.
Those ideas did not disappear. They simply moved to wider and cleaner register files.
When Not to Use MMX
Avoid MMX for new code unless you have a very specific legacy requirement.
Do not choose MMX when:
- you are writing new x86-64 code;
- SSE2 is available;
- you need floating-point SIMD;
- you want good compiler auto-vectorization;
- you want portable modern SIMD paths;
- you are targeting AVX2 or newer CPUs;
- you want to avoid
_mm_empty()and x87 interactions.
For new x86-64 code, SSE2 should be considered the minimum SIMD baseline.
For modern performance code, AVX2 is often a better target when available.
Common Mistakes
Mistake 1: Forgetting _mm_empty()
This is the classic MMX bug.
void bad_mmx_function(void)
{
__m64 a = _mm_set1_pi16(10);
__m64 b = _mm_set1_pi16(20);
__m64 c = _mm_add_pi16(a, b);
// Missing _mm_empty().
}
Correct version:
void good_mmx_function(void)
{
__m64 a = _mm_set1_pi16(10);
__m64 b = _mm_set1_pi16(20);
__m64 c = _mm_add_pi16(a, b);
_mm_empty();
}
Mistake 2: Assuming MMX Supports Floating Point
MMX does not provide floating-point SIMD.
It provides packed integer SIMD.
For floating-point SIMD, use SSE or later.
Mistake 3: Confusing Wraparound and Saturation
These are different:
_mm_add_pi8(a, b); // wraparound
_mm_adds_pu8(a, b); // unsigned saturation
For pixels and audio samples, saturating arithmetic is often the correct choice.
Mistake 4: Ignoring Tail Elements
If your loop processes eight bytes at a time, you still need to handle the remaining bytes.
for (; i < count; ++i)
{
// scalar tail
}
Mistake 5: Using MMX for New Code
MMX was important historically, but SSE2 is a better minimum target for modern x86-64 code.
If you are starting from scratch, use __m128i, not __m64.
Mistake 6: Assuming Wider SIMD Automatically Fixes the Algorithm
Moving from scalar code to MMX, or from MMX to SSE2, requires careful thinking about:
- overflow;
- signed vs unsigned operations;
- data layout;
- alignment;
- tail handling;
- fixed-point scaling;
- memory bandwidth.
SIMD makes operations parallel. It does not automatically make the algorithm correct.
Best Practices for Legacy MMX Code
If you must maintain MMX code:
- keep MMX blocks small and well isolated;
- call
_mm_empty()at the end of each MMX section; - avoid mixing MMX and x87 floating-point code;
- document whether arithmetic wraps or saturates;
- handle scalar tails safely;
- avoid reading past buffer ends;
- use clear helper functions for loading and storing;
- consider replacing the code with SSE2;
- benchmark before and after any rewrite.
Best Practices for New Code
For new SIMD code:
- use SSE2 or later instead of MMX;
- use
__m128ifor 128-bit integer SIMD; - use AVX2 where 256-bit integer SIMD is useful and available;
- use AVX-512 only with runtime dispatch and benchmarking;
- keep a scalar fallback when portability matters;
- let the compiler auto-vectorize simple loops before writing intrinsics manually;
- write tests that cover overflow, saturation, signs, and tails.
Summary
MMX was the first major SIMD extension for x86 processors.
It introduced 64-bit packed integer operations and made it possible to process multiple bytes, words, or doublewords with one instruction. This was a major improvement for multimedia and communications workloads in the late 1990s.
The key MMX concepts are:
- SIMD parallelism;
- eight 64-bit MMX registers;
- packed byte, word, doubleword, and quadword data;
- integer and fixed-point arithmetic;
- saturating arithmetic;
- packed comparisons and masks;
- packing and unpacking;
- shift and logical operations;
- the need for
EMMS/_mm_empty().
The main limitation is that MMX shares state with the x87 floating-point unit. That makes _mm_empty() necessary and makes MMX awkward compared with later SIMD instruction sets.
For modern code, MMX is legacy.
SSE2 moved packed integer SIMD into 128-bit XMM registers, doubled the data width, avoided the x87 state problem, and became part of the x86-64 baseline.
MMX is still worth learning because it explains where x86 SIMD programming started. But for new development, the practical recommendation is clear:
Use SSE2 or later, and keep MMX for legacy code and historical understanding.



