SIMD on x64/x86

SSE2 and MMX: How 128-bit Integer SIMD Replaced Legacy MMX

MMX was Intel’s first widely adopted SIMD instruction set on x86 processors. It introduced 64-bit packed integer operations and made it possible to process multiple bytes, words, or doublewords with a single instruction.

For the late 1990s, MMX was an important step forward. It was useful for image processing, audio processing, video decoding, graphics effects, and many other multimedia workloads.

However, MMX had two major limitations:

  1. MMX registers were only 64 bits wide.
  2. MMX registers were aliased on top of the old x87 floating-point register stack.

SSE2 solved both problems for integer SIMD code.

SSE2 introduced 128-bit integer SIMD operations using XMM registers. That meant code that previously processed 64 bits at a time with MMX could often be rewritten to process 128 bits at a time with SSE2.

In other words, SSE2 did not merely add double-precision floating-point support. It also provided the clean replacement path for MMX.

The Short Version

MMX code uses 64-bit mm0mm7 registers.

SSE2 integer code uses 128-bit XMM registers.

FeatureMMXSSE2 integer SIMD
Register typemm0mm7xmm0xmm7 on 32-bit x86, xmm0xmm15 on x86-64
Register width64-bit128-bit
C/C++ intrinsic type__m64__m128i
Typical header<mmintrin.h><emmintrin.h>
Packed bytes per register816
Packed 16-bit words per register48
Packed 32-bit integers per register24
Packed 64-bit integers per register12
Shares state with x87 FPUYesNo
Requires _mm_empty()Yes, after MMX useNo, for pure XMM code
Good target for new codeNoYes, as a baseline for x86-64

The most important practical rule is:

For new integer SIMD code, use SSE2 or later instead of MMX.

MMX is useful to understand legacy optimized code, but SSE2 is the cleaner and more portable foundation for modern x86 SIMD programming.

Why MMX Needed to Be Replaced

MMX was designed around 64-bit packed integer operations.

An MMX register can hold:

Data typeElements per MMX register
8-bit integers8
16-bit integers4
32-bit integers2
64-bit integer1

That was useful, but limited. Processing only eight bytes at a time is not much by modern standards.

The more serious problem is that MMX registers are not independent from the x87 floating-point unit. The MMX registers reuse the same physical register storage as the x87 FPU stack.

Because of this, after using MMX instructions, code must execute EMMS, or the C/C++ intrinsic _mm_empty(), before returning to code that may use x87 floating-point instructions.

For example:

#include <mmintrin.h>

void mmx_function(void)
{
    __m64 a = _mm_set_pi16(4, 3, 2, 1);
    __m64 b = _mm_set_pi16(8, 7, 6, 5);

    __m64 c = _mm_add_pi16(a, b);

    // Required after MMX code before returning to possible x87 floating-point code.
    _mm_empty();
}

Forgetting _mm_empty() can leave the x87 floating-point state in a bad condition.

SSE2 avoids this problem because XMM registers are a separate SIMD register file.

What SSE2 Added

SSE2 was introduced with the Intel Pentium 4. It later became part of the baseline for x86-64, which makes it widely available on modern x86 systems.

SSE2 added several important capabilities:

  • 128-bit packed integer operations;
  • 128-bit XMM versions of many MMX-style operations;
  • double-precision floating-point SIMD;
  • scalar double-precision floating-point operations;
  • integer/floating-point conversion instructions;
  • cache and memory-ordering instructions.

For this article, the most important part is:

SSE2 extended packed integer SIMD from 64-bit MMX registers to 128-bit XMM registers.

An SSE2 __m128i register can hold:

Data typeElements per XMM register
8-bit integers16
16-bit integers8
32-bit integers4
64-bit integers2

This doubles the amount of integer data processed per instruction compared with MMX.

MMX Example

Here is a simple MMX routine that adds eight unsigned bytes at a time.

#include <stdint.h>
#include <stddef.h>
#include <mmintrin.h>

void add_bytes_mmx(uint8_t* dst, const uint8_t* a, const uint8_t* b, size_t count)
{
    size_t i = 0;

    for (; i + 8 <= count; i += 8)
    {
        __m64 va = *(__m64 const*)(a + i);
        __m64 vb = *(__m64 const*)(b + i);

        __m64 vr = _mm_add_pi8(va, vb);

        *(__m64*)(dst + i) = vr;
    }

    _mm_empty();

    for (; i < count; ++i)
    {
        dst[i] = (uint8_t)(a[i] + b[i]);
    }
}

This processes eight bytes per loop iteration.

The SSE2 version can process sixteen bytes per iteration.

SSE2 Version

#include <stdint.h>
#include <stddef.h>
#include <emmintrin.h>

void add_bytes_sse2(uint8_t* dst, const uint8_t* a, const uint8_t* b, size_t count)
{
    size_t i = 0;

    for (; i + 16 <= count; i += 16)
    {
        __m128i va = _mm_loadu_si128((const __m128i*)(a + i));
        __m128i vb = _mm_loadu_si128((const __m128i*)(b + i));

        __m128i vr = _mm_add_epi8(va, vb);

        _mm_storeu_si128((__m128i*)(dst + i), vr);
    }

    for (; i < count; ++i)
    {
        dst[i] = (uint8_t)(a[i] + b[i]);
    }
}

The SSE2 version has several advantages:

  • it processes twice as many bytes per iteration;
  • it uses XMM registers instead of MMX registers;
  • it does not require _mm_empty();
  • it is a better baseline for modern x86 and x86-64 code.

This does not guarantee exactly twice the performance. Real speed depends on memory bandwidth, alignment, instruction throughput, cache behavior, and loop overhead. But SSE2 gives the code twice the vector width to work with.

Why You Cannot Just Rename MMX Registers

It is tempting to think that porting MMX code to SSE2 is just a matter of changing register names:

mm0 -> xmm0
mm1 -> xmm1
mm2 -> xmm2

Unfortunately, that is not enough.

There are several differences between MMX and SSE2 code:

  • MMX registers are 64-bit; XMM registers are 128-bit.
  • Memory loads and stores often need to change.
  • Loop counters and offsets must be updated.
  • Alignment becomes more important.
  • Some shuffle and shift instructions do not map directly.
  • Code that assumes a single 64-bit lane may not work correctly with two 64-bit lanes.
  • Intrinsic types change from __m64 to __m128i.

The conceptual operation may be the same, but the details matter.

Intrinsic Type Mapping

In C and C++, MMX and SSE2 use different intrinsic types.

MMXSSE2
__m64__m128i
<mmintrin.h><emmintrin.h>
_mm_add_pi8_mm_add_epi8
_mm_add_pi16_mm_add_epi16
_mm_add_pi32_mm_add_epi32
_mm_sub_pi8_mm_sub_epi8
_mm_sub_pi16_mm_sub_epi16
_mm_sub_pi32_mm_sub_epi32
_mm_adds_pu8_mm_adds_epu8
_mm_adds_pi16_mm_adds_epi16
_mm_subs_pu8_mm_subs_epu8
_mm_subs_pi16_mm_subs_epi16
_mm_cmpeq_pi8_mm_cmpeq_epi8
_mm_cmpgt_pi16_mm_cmpgt_epi16
_mm_unpacklo_pi8_mm_unpacklo_epi8
_mm_unpackhi_pi8_mm_unpackhi_epi8
_mm_slli_pi16_mm_slli_epi16
_mm_srli_pi16_mm_srli_epi16
_mm_mullo_pi16_mm_mullo_epi16

The names are similar, but not identical.

MMX intrinsics often use suffixes such as:

_pi8
_pi16
_pi32
_pu8

SSE2 integer intrinsics usually use suffixes such as:

_epi8
_epi16
_epi32
_epu8

The e in names such as _mm_add_epi16 means “extended” packed integer.

Load and Store Differences

In MMX, a 64-bit load often uses a simple MOVQ.

In SSE2, a 128-bit integer load usually uses one of these:

MOVDQA   aligned 128-bit integer load/store
MOVDQU   unaligned 128-bit integer load/store

In intrinsics:

__m128i _mm_load_si128(const __m128i* p);   // aligned
__m128i _mm_loadu_si128(const __m128i* p);  // unaligned

void _mm_store_si128(__m128i* p, __m128i a);   // aligned
void _mm_storeu_si128(__m128i* p, __m128i a);  // unaligned

Use aligned loads and stores only when the address is guaranteed to be 16-byte aligned.

__m128i v = _mm_load_si128((const __m128i*)ptr);

This requires ptr to be 16-byte aligned.

If the pointer may be unaligned, use:

__m128i v = _mm_loadu_si128((const __m128i*)ptr);

The unaligned version is safer and often fast enough on modern processors, although aligned data can still be beneficial in tight loops.

Alignment Matters More With SSE2

A 128-bit XMM value is 16 bytes wide. Many early SSE2 instructions expected 16-byte aligned memory operands.

For aligned integer loads and stores, the address must be divisible by 16:

address % 16 == 0

Examples of 16-byte aligned addresses:

0x1000
0x1010
0x1020

Examples of unaligned addresses:

0x1001
0x1008
0x100C

A common safe pattern is:

  1. Use scalar code until the destination pointer is aligned.
  2. Use aligned SSE2 loads/stores for the main loop if the source pointers are also aligned.
  3. Use unaligned loads if source alignment is not guaranteed.
  4. Use scalar code for the tail.

However, for many modern programs, the simpler approach is to use _mm_loadu_si128 and _mm_storeu_si128 unless profiling shows that alignment is a bottleneck.

Allocating Aligned Memory

If you want to use aligned SSE2 loads and stores, allocate memory with 16-byte alignment.

In C11:

#include <stdlib.h>

void* p = aligned_alloc(16, size);

The size passed to aligned_alloc must be a multiple of the alignment.

On POSIX systems:

#include <stdlib.h>

void* p = NULL;
posix_memalign(&p, 16, size);

On Windows:

#include <malloc.h>

void* p = _aligned_malloc(size, 16);

// Later:
_aligned_free(p);

In C++17 and later, aligned allocation is also supported by the language, but the exact best choice depends on how the memory is owned and freed.

The important point is that memory allocated with ordinary malloc on old systems was not always guaranteed to be 16-byte aligned. Modern platforms are generally better, but portable SIMD code should still be explicit about alignment when using aligned loads and stores.

Stack Alignment

Stack alignment is another issue when porting old MMX assembly to SSE2.

MMX only needs 8-byte data chunks, but SSE2 often uses 16-byte XMM values. If old assembly code spills XMM registers to the stack using aligned stores, the stack must be 16-byte aligned.

Modern x86-64 ABIs normally define stack-alignment rules, but older 32-bit code may not.

If you are porting old 32-bit hand-written assembly, check the calling convention and stack alignment carefully.

A simple defensive rule is:

Do not use aligned XMM stack stores unless you know the stack is correctly aligned.

Otherwise, use unaligned stores or fix the function prologue to align the stack.

Updating Loop Counters

Because SSE2 processes 128 bits at once, loop counters usually need to change.

For example, if MMX code processes eight bytes per iteration:

for (i = 0; i + 8 <= count; i += 8)
{
    // MMX
}

the SSE2 version usually processes sixteen bytes per iteration:

for (i = 0; i + 16 <= count; i += 16)
{
    // SSE2
}

The same applies to other element sizes.

Element typeMMX elements per iterationSSE2 elements per iteration
8-bit bytes816
16-bit words48
32-bit integers24
64-bit integers12

You must also update pointer increments, memory offsets, and tail handling.

This is one of the most common sources of bugs when converting MMX loops to SSE2.

Tail Processing

Most buffers are not exact multiples of 16 bytes.

For example, an image row may contain 1923 bytes. An SSE2 loop can process 16 bytes at a time, but there will be leftover bytes at the end.

A common structure is:

size_t i = 0;

for (; i + 16 <= count; i += 16)
{
    // SSE2 main loop.
}

for (; i < count; ++i)
{
    // Scalar tail.
}

This is simple and reliable.

For very performance-sensitive code, you can use masks, overread-safe padding, or specialized tail paths. But scalar tail handling is usually the best starting point.

Saturating Arithmetic Example

MMX was often used for image processing because it supported saturating arithmetic.

For example, unsigned 8-bit saturated addition clamps at 255 instead of wrapping around.

Normal unsigned byte addition:

250 + 20 = 14   // wraparound

Saturated unsigned byte addition:

250 + 20 = 255  // clamp

MMX version:

#include <mmintrin.h>

__m64 add_saturated_mmx(__m64 a, __m64 b)
{
    return _mm_adds_pu8(a, b);
}

SSE2 version:

#include <emmintrin.h>

__m128i add_saturated_sse2(__m128i a, __m128i b)
{
    return _mm_adds_epu8(a, b);
}

The operation is conceptually the same, but the SSE2 version processes sixteen bytes at once instead of eight.

This is one of the easiest MMX-to-SSE2 migrations.

Full Example: Saturating Add for Image Pixels

#include <stdint.h>
#include <stddef.h>
#include <emmintrin.h>

void add_saturated_u8_sse2(
    uint8_t* dst,
    const uint8_t* a,
    const uint8_t* b,
    size_t count)
{
    size_t i = 0;

    for (; i + 16 <= count; i += 16)
    {
        __m128i va = _mm_loadu_si128((const __m128i*)(a + i));
        __m128i vb = _mm_loadu_si128((const __m128i*)(b + i));

        __m128i vr = _mm_adds_epu8(va, vb);

        _mm_storeu_si128((__m128i*)(dst + i), vr);
    }

    for (; i < count; ++i)
    {
        unsigned int sum = (unsigned int)a[i] + (unsigned int)b[i];
        dst[i] = (sum > 255) ? 255 : (uint8_t)sum;
    }
}

This is a typical SSE2 replacement for old MMX image-processing code.

Shuffling: PSHUFW vs SSE2 Shuffles

One of the less direct parts of MMX-to-SSE2 migration is shuffling.

The SSE-era MMX instruction PSHUFW shuffles four 16-bit words inside a 64-bit MMX register.

In SSE2, a 128-bit XMM register contains eight 16-bit words. There is no single SSE2 instruction that performs an arbitrary shuffle of all eight 16-bit words.

Instead, SSE2 provides instructions such as:

PSHUFLW   shuffle low four 16-bit words
PSHUFHW   shuffle high four 16-bit words
PSHUFD    shuffle four 32-bit doublewords

This means a direct replacement may require more than one instruction.

For example:

  • use PSHUFLW to shuffle the low 64-bit half;
  • use PSHUFHW to shuffle the high 64-bit half;
  • use PSHUFD if the operation can be expressed at 32-bit granularity;
  • use unpack instructions when moving values between halves.

Later instruction sets make shuffling easier. SSSE3 added PSHUFB, which is much more flexible for byte-level rearrangement.

But if the target is strictly SSE2, shuffling sometimes requires careful redesign.

Shift Instructions: Lane Shifts vs Whole-Register Shifts

Another common migration problem is shifting.

MMX has 64-bit shifts such as:

PSLLQ
PSRLQ

SSE2 also has XMM versions of these instructions, but they operate on each 64-bit lane independently.

A 128-bit XMM register contains two 64-bit lanes:

[ high 64 bits ][ low 64 bits ]

If you use PSLLQ on an XMM register, each 64-bit lane is shifted separately.

That is not the same as shifting the entire 128-bit register as one large integer.

SSE2 also provides byte-shift instructions:

PSLLDQ
PSRLDQ

These shift the entire 128-bit register left or right by a number of bytes.

So the correct replacement depends on what the MMX code meant:

  • if the old code shifted independent 64-bit values, use the XMM version of PSLLQ or PSRLQ;
  • if the new code needs a whole-register byte shift, use PSLLDQ or PSRLDQ;
  • if the new code needs a true 128-bit bit shift across the whole register, you may need a combination of shifts, masks, and OR operations.

This is one place where blindly changing register names can produce wrong results.

Moving Between MMX and XMM

SSE2 includes instructions for moving 64-bit data between MMX and XMM registers:

MOVQ2DQ   move quadword from MMX to XMM
MOVDQ2Q   move quadword from XMM to MMX

In intrinsics, these correspond to operations such as:

__m128i _mm_movpi64_epi64(__m64 a);
__m64   _mm_movepi64_pi64(__m128i a);

These can be useful when bridging old MMX code and newer SSE2 code.

However, they should not be used as a long-term programming model for new code.

Mixing MMX and SSE2 creates unnecessary complexity:

  • MMX still requires _mm_empty();
  • MMX interacts with x87 state;
  • values must be moved between different register files;
  • the code becomes harder to maintain.

The better approach is usually to migrate the whole hot loop to XMM registers and remove MMX entirely.

Pure SSE2 Code Does Not Need _mm_empty()

This is one of the biggest benefits of SSE2 over MMX.

MMX code:

#include <mmintrin.h>

void old_mmx_code(void)
{
    __m64 a = _mm_set_pi16(4, 3, 2, 1);
    __m64 b = _mm_set_pi16(8, 7, 6, 5);

    __m64 c = _mm_add_pi16(a, b);

    _mm_empty();
}

SSE2 code:

#include <emmintrin.h>

void new_sse2_code(void)
{
    __m128i a = _mm_set_epi16(8, 7, 6, 5, 4, 3, 2, 1);
    __m128i b = _mm_set_epi16(16, 15, 14, 13, 12, 11, 10, 9);

    __m128i c = _mm_add_epi16(a, b);

    // No _mm_empty() needed.
}

As long as the code uses only XMM registers and does not touch __m64, mm0mm7, or MMX intrinsics, _mm_empty() is not required.

SSE2 and x86-64

SSE2 is especially important because it is part of the x86-64 baseline.

That means a normal 64-bit x86 program can generally assume SSE2 support.

This changed the role of SSE2. It is no longer an exotic optional extension for Pentium 4-era software. It is the basic SIMD foundation for modern x86-64 code.

For portable modern x86 code, the usual baseline hierarchy is:

scalar
SSE2
SSSE3 / SSE4.1
AVX2
AVX-512

MMX is no longer a good baseline for new code.

If you are maintaining old MMX routines, SSE2 is usually the first migration target.

SSE2 Does Not Automatically Mean Twice as Fast

SSE2 doubles the register width compared with MMX, but that does not automatically double performance.

Several factors can limit the speedup:

  • memory bandwidth;
  • cache misses;
  • unaligned loads and stores;
  • extra shuffle instructions;
  • loop overhead;
  • dependency chains;
  • instruction latency;
  • store-forwarding issues;
  • scalar tail handling;
  • compiler code generation;
  • whether the algorithm is compute-bound or memory-bound.

For example, a simple memory copy-like loop may already be limited by memory bandwidth. In that case, doubling vector width may not double performance.

On the other hand, an arithmetic-heavy loop with good data locality may benefit significantly.

Always benchmark the real workload.

Porting Checklist: MMX to SSE2

When converting MMX code to SSE2, use this checklist.

1. Replace __m64 with __m128i

Old MMX:

__m64 v;

New SSE2:

__m128i v;

2. Replace MMX intrinsics with SSE2 integer intrinsics

Old MMX:

_mm_add_pi16(a, b)

New SSE2:

_mm_add_epi16(a, b)

3. Replace 64-bit loads with 128-bit loads

Old MMX:

__m64 v = *(__m64 const*)ptr;

New SSE2:

__m128i v = _mm_loadu_si128((const __m128i*)ptr);

Use _mm_load_si128 only if the pointer is guaranteed to be 16-byte aligned.

4. Replace 64-bit stores with 128-bit stores

Old MMX:

*(__m64*)ptr = v;

New SSE2:

_mm_storeu_si128((__m128i*)ptr, v);

Use _mm_store_si128 only for 16-byte aligned destinations.

5. Update loop steps

Old MMX byte loop:

i += 8;

New SSE2 byte loop:

i += 16;

Old MMX 16-bit loop:

i += 4;

New SSE2 16-bit loop:

i += 8;

6. Recheck all shuffles

Do not assume PSHUFW has a direct 128-bit equivalent.

You may need:

PSHUFLW
PSHUFHW
PSHUFD
UNPCKL*
UNPCKH*

7. Recheck all shifts

Decide whether you need:

  • per-element shifts;
  • per-64-bit-lane shifts;
  • whole-register byte shifts;
  • a true 128-bit bit shift.

These are not the same operation.

8. Remove _mm_empty() if MMX is completely gone

If the function no longer uses MMX or __m64, remove _mm_empty().

If any MMX remains, keep _mm_empty() at the end of the MMX section.

9. Benchmark

SSE2 should improve many old MMX routines, but performance must be measured.

Use realistic data sizes and realistic alignment.

Example Migration: MMX to SSE2

Original MMX Version

#include <stdint.h>
#include <stddef.h>
#include <mmintrin.h>

void brighten_mmx(uint8_t* pixels, uint8_t amount, size_t count)
{
    size_t i = 0;

    __m64 increment = _mm_set1_pi8((char)amount);

    for (; i + 8 <= count; i += 8)
    {
        __m64 p = *(__m64 const*)(pixels + i);
        __m64 r = _mm_adds_pu8(p, increment);
        *(__m64*)(pixels + i) = r;
    }

    _mm_empty();

    for (; i < count; ++i)
    {
        unsigned int value = pixels[i] + amount;
        pixels[i] = (value > 255) ? 255 : (uint8_t)value;
    }
}

SSE2 Version

#include <stdint.h>
#include <stddef.h>
#include <emmintrin.h>

void brighten_sse2(uint8_t* pixels, uint8_t amount, size_t count)
{
    size_t i = 0;

    __m128i increment = _mm_set1_epi8((char)amount);

    for (; i + 16 <= count; i += 16)
    {
        __m128i p = _mm_loadu_si128((const __m128i*)(pixels + i));
        __m128i r = _mm_adds_epu8(p, increment);
        _mm_storeu_si128((__m128i*)(pixels + i), r);
    }

    for (; i < count; ++i)
    {
        unsigned int value = pixels[i] + amount;
        pixels[i] = (value > 255) ? 255 : (uint8_t)value;
    }
}

The SSE2 version is shorter, processes twice as many pixels per vector iteration, and avoids MMX cleanup.

What About SSE2 Double-Precision Floating Point?

This article focuses on SSE2 as a replacement for MMX integer SIMD, but SSE2 also added double-precision floating-point SIMD.

SSE introduced packed single-precision floating-point operations:

4 floats per XMM register

SSE2 added packed double-precision floating-point operations:

2 doubles per XMM register

For example:

#include <emmintrin.h>

void add_two_doubles(double* dst, const double* a, const double* b)
{
    __m128d va = _mm_loadu_pd(a);
    __m128d vb = _mm_loadu_pd(b);

    __m128d vr = _mm_add_pd(va, vb);

    _mm_storeu_pd(dst, vr);
}

This was another major reason SSE2 became important. It allowed x86 code to move away from x87 floating-point and toward XMM-based scalar and vector floating-point operations.

MMX, SSE2, and Compiler Support

When SSE2 first appeared, programming it required hand-written assembly or compiler intrinsics. Compiler support was not as mature as it is today.

Today the situation is different.

Modern C and C++ compilers understand SSE2 very well. They can generate SSE2 automatically for many scalar loops, especially at higher optimization levels. Intrinsics are also widely supported.

Common compiler flags include:

-msse2

for GCC and Clang on x86 targets where SSE2 is not already the default.

On x86-64, SSE2 is normally part of the default target.

For Microsoft Visual C++, SSE2 support is standard for x64 builds.

This makes SSE2 much more practical today than it was when Pentium 4 was new.

Should You Still Use MMX?

For new code, almost never.

Use MMX only when:

  • maintaining old code;
  • studying historical SIMD programming;
  • targeting very old processors;
  • preserving a legacy assembly routine that is not worth rewriting.

Use SSE2 or later when:

  • writing new x86-64 code;
  • processing packed integers;
  • replacing old MMX code;
  • mixing integer SIMD and floating-point code;
  • avoiding _mm_empty();
  • building portable SIMD paths for modern systems.

A good rule is:

If your code uses __m64, it is probably legacy code.

Modern x86 SIMD code should normally use __m128i, __m256i, or __m512i, depending on the instruction set being targeted.

From SSE2 to AVX2 and AVX-512

SSE2 is not the end of the story.

Later instruction sets extended the same idea to wider registers:

Instruction setInteger vector typeWidth
MMX__m6464-bit
SSE2__m128i128-bit
AVX2__m256i256-bit
AVX-512__m512i512-bit

AVX2 is the natural successor to SSE2 for 256-bit integer SIMD. AVX-512 extends the model further with 512-bit registers and mask registers.

However, SSE2 remains important because it is the compatibility baseline for x86-64.

A practical modern library may provide several implementations:

scalar fallback
SSE2 implementation
SSSE3 or SSE4.1 implementation
AVX2 implementation
AVX-512 implementation

At runtime, the program selects the best supported implementation.

Common Mistakes

Mistake 1: Replacing mm0 With xmm0 and Assuming the Code Is Correct

SSE2 is wider, so loop counters, memory offsets, tail handling, and data layout must be reviewed.

Mistake 2: Using Aligned Loads on Unaligned Memory

This is unsafe:

__m128i v = _mm_load_si128((const __m128i*)ptr);

unless ptr is guaranteed to be 16-byte aligned.

Use this when alignment is unknown:

__m128i v = _mm_loadu_si128((const __m128i*)ptr);

Mistake 3: Keeping _mm_empty() in Pure SSE2 Code

Pure SSE2 code does not need _mm_empty().

If the function uses only XMM registers, _mm_empty() is unnecessary.

Mistake 4: Removing _mm_empty() While Some MMX Still Remains

If any code still uses __m64 or MMX registers, _mm_empty() may still be required.

Mistake 5: Assuming SSE2 Has Perfect Replacements for Every Shuffle

SSE2 has useful shuffle instructions, but it is not as flexible as SSSE3 or later.

Some MMX shuffle patterns require redesign.

Mistake 6: Ignoring Memory Bandwidth

A wider SIMD loop can still be limited by memory speed.

If the loop is just loading, storing, and doing one cheap operation, memory bandwidth may dominate.

Best Practices

For MMX-to-SSE2 migration:

  • replace __m64 with __m128i;
  • use <emmintrin.h>;
  • use _mm_loadu_si128 and _mm_storeu_si128 unless alignment is guaranteed;
  • process twice as many elements per loop iteration;
  • review every shuffle and shift;
  • add scalar tail handling;
  • remove _mm_empty() only when all MMX code is gone;
  • benchmark before and after.

For new code:

  • start with clear scalar code;
  • let the compiler auto-vectorize if possible;
  • use SSE2 intrinsics when you need explicit control;
  • use AVX2 or AVX-512 only when the target hardware justifies it;
  • avoid MMX entirely.

Summary

MMX introduced packed integer SIMD to x86, but it was limited by 64-bit registers and its awkward relationship with the x87 floating-point unit.

SSE2 provided the real replacement path.

It moved packed integer SIMD into 128-bit XMM registers, doubled the amount of data processed per instruction, avoided the MMX/x87 state problem, and became part of the x86-64 baseline.

The migration from MMX to SSE2 is usually straightforward in concept:

__m64   -> __m128i
64-bit  -> 128-bit
MMX     -> XMM
8 bytes -> 16 bytes

But it is not a mechanical register-name replacement. Loads, stores, alignment, loop counters, shuffles, shifts, and tail handling must all be reviewed.

For modern code, the conclusion is simple:

MMX is legacy. SSE2 is the minimum practical SIMD baseline for x86-64 integer code.

If you maintain old MMX routines, SSE2 is usually the first and most natural upgrade path.

References