SIMD on x64/x86

MMX EMMS: Why _mm_empty() Is Required After MMX Code

MMX was Intel’s first widely used SIMD instruction set for x86 processors. It introduced packed integer operations that could process multiple small values at once inside a single 64-bit register.

For its time, this was extremely useful. MMX made it possible to accelerate image processing, audio processing, video decoding, graphics effects, and other multimedia workloads on ordinary desktop CPUs.

However, MMX has one important design detail that often surprises developers when they first encounter it:

MMX registers share state with the old x87 floating-point unit.

Because of this, MMX code must be cleaned up before the program returns to x87 floating-point code. The instruction that performs this cleanup is called EMMS.

In C and C++, the MMX intrinsic for this instruction is:

_mm_empty();

If you use MMX intrinsics, understanding when and why to call _mm_empty() is essential.

The Short Rule

The practical rule is simple:

After using MMX instructions, call _mm_empty() before executing floating-point code or before returning to code that might execute floating-point code.

A typical MMX function should look like this:

#include <mmintrin.h>

void process_with_mmx(void)
{
    // MMX operations here.

    _mm_empty();   // Clean up MMX/x87 state.
}

You do not need to call _mm_empty() before every MMX instruction.

You usually need to call it once, after the MMX work is finished.

Why MMX Needs Cleanup

The eight MMX registers are named:

mm0 mm1 mm2 mm3 mm4 mm5 mm6 mm7

Each register is 64 bits wide.

The unusual part is that these registers are not completely independent from the older x87 floating-point registers. Architecturally, the MMX registers are aliased on top of the x87 FPU register stack.

The x87 floating-point unit also has eight registers. Unlike normal general-purpose registers, the x87 registers behave like a stack. Internally, the processor tracks which x87 stack entries are empty and which contain valid floating-point values.

When MMX instructions execute, they use the same physical register storage that x87 uses. After MMX code has run, the x87 tag state no longer describes an empty floating-point stack.

From the point of view of later x87 floating-point instructions, the floating-point stack may appear to be full or partially full.

That is the problem EMMS solves.

What EMMS Does

EMMS stands for:

Empty MMX State

The instruction marks the x87 floating-point register stack as empty again.

It does not perform arithmetic. It does not convert MMX values. It does not save your MMX registers. It does not restore previous floating-point values.

Its purpose is to tell the processor:

MMX code is finished. The x87 floating-point stack can be used normally again.

In C or C++, this instruction is exposed through the _mm_empty() intrinsic:

#include <mmintrin.h>

_mm_empty();

What Happens If You Forget _mm_empty()

If MMX code is followed by x87 floating-point code without first executing EMMS, the x87 unit may believe that its register stack is already occupied.

This can cause several kinds of problems:

  • floating-point stack overflow;
  • floating-point exceptions;
  • incorrect floating-point results;
  • failures that only appear in some builds or on some systems;
  • bugs that disappear when the compiler changes how floating-point code is generated.

This makes missing _mm_empty() especially dangerous in library code.

Your MMX function might not use floating-point arithmetic directly, but the caller might. If your function returns without cleaning the MMX state, the caller can be affected by a hidden side effect.

Incorrect Example

The following example uses MMX and then executes floating-point code without cleaning up the MMX state:

#include <mmintrin.h>
#include <math.h>

double bad_example(__m64 a, __m64 b)
{
    __m64 c = _mm_add_pi16(a, b);

    // Problem:
    // MMX state has not been cleared before floating-point code.
    return sqrt(25.0);
}

This may appear to work on some modern systems, especially if the compiler uses SSE instructions for floating-point arithmetic instead of x87 instructions.

However, the code is still wrong as MMX code. It leaves the processor in a state that is unsafe for later x87 floating-point instructions.

Correct Example

The correct version calls _mm_empty() after the MMX work is complete:

#include <mmintrin.h>
#include <math.h>

double good_example(__m64 a, __m64 b)
{
    __m64 c = _mm_add_pi16(a, b);

    // MMX code is finished.
    _mm_empty();

    // Floating-point code can now run safely.
    return sqrt(25.0);
}

This is the essential pattern:

  1. Run MMX code.
  2. Finish all MMX operations.
  3. Call _mm_empty().
  4. Continue with ordinary floating-point code.

Place _mm_empty() at the Boundary

In most real programs, the best place to call _mm_empty() is at the boundary between MMX code and non-MMX code.

For example:

#include <mmintrin.h>

void process_pixels_mmx(const short* input, short* output, int count)
{
    int i = 0;

    for (; i + 4 <= count; i += 4)
    {
        __m64 values = *(__m64*)&input[i];

        // Example operation:
        // Add 1 to each packed 16-bit element.
        __m64 one = _mm_set_pi16(1, 1, 1, 1);
        __m64 result = _mm_add_pi16(values, one);

        *(__m64*)&output[i] = result;
    }

    // End of MMX section.
    _mm_empty();

    // Scalar cleanup could follow here if needed.
}

The important point is that _mm_empty() is not part of the inner computation. It is part of the cleanup after the MMX computation.

Avoid Calling _mm_empty() Inside Tight Loops

This is usually wrong:

for (int i = 0; i < count; i += 4)
{
    __m64 values = *(__m64*)&input[i];

    __m64 one = _mm_set_pi16(1, 1, 1, 1);
    __m64 result = _mm_add_pi16(values, one);

    *(__m64*)&output[i] = result;

    _mm_empty();   // Usually unnecessary here.
}

Calling _mm_empty() inside the loop destroys performance and provides no benefit unless each iteration genuinely needs to switch from MMX to x87 floating-point code.

Prefer this:

for (int i = 0; i < count; i += 4)
{
    __m64 values = *(__m64*)&input[i];

    __m64 one = _mm_set_pi16(1, 1, 1, 1);
    __m64 result = _mm_add_pi16(values, one);

    *(__m64*)&output[i] = result;
}

_mm_empty();

Call _mm_empty() once after the MMX section is complete.

_mm_empty() Is Not an Initialization Instruction

A common misunderstanding is to think that _mm_empty() prepares the MMX unit before use.

That is not its purpose.

You do not need to write this:

_mm_empty();

// Start MMX code here.

_mm_empty() is a cleanup instruction, not a setup instruction.

It belongs after MMX code, not before it.

_mm_empty() Does Not Clear Sensitive Data

Another important detail is that _mm_empty() should not be treated as a secure data-clearing instruction.

The instruction marks the x87 register stack as empty. It does not exist to securely wipe register contents. If you are handling cryptographic keys or other sensitive information, _mm_empty() is not a substitute for proper data clearing.

Its purpose is architectural cleanup between MMX and x87 floating-point code.

What Counts as MMX Code?

You should consider code to be MMX code if it uses any of the following:

  • MMX intrinsics from <mmintrin.h>;
  • the __m64 data type;
  • inline assembly using mm0 through mm7;
  • older SSE integer intrinsics that operate on __m64.

This last point is important. Some old intrinsics use the __m64 type even though they may appear in SSE-related documentation. If an intrinsic operates on __m64, it uses the MMX register file and therefore shares the same cleanup requirements.

A good practical rule is:

If your code touches __m64, assume _mm_empty() is required before returning to normal floating-point code.

Is _mm_empty() Needed After SSE or SSE2?

No, not for normal SSE or SSE2 code that uses XMM registers.

SSE introduced a separate set of 128-bit registers:

xmm0 xmm1 xmm2 ...

These registers do not alias the x87 floating-point stack in the same way that MMX registers do.

Therefore, if your code uses only types such as:

__m128
__m128i
__m128d

then _mm_empty() is not needed.

For example, this kind of SSE2 code does not require _mm_empty():

#include <emmintrin.h>

void process_with_sse2(void)
{
    __m128i a = _mm_set1_epi16(10);
    __m128i b = _mm_set1_epi16(20);
    __m128i c = _mm_add_epi16(a, b);

    // No _mm_empty() required for XMM-only code.
}

The key distinction is:

  • __m64 uses MMX registers;
  • __m128, __m128i, and __m128d use XMM registers.

Only the MMX path requires EMMS.

Why Modern Code Should Prefer SSE2 or Later

MMX is historically important, but it is rarely the best choice for new code today.

For modern x86 development, SSE2 or later is usually preferable because:

  • SSE2 is available on all x86-64 processors;
  • XMM registers are wider than MMX registers;
  • XMM registers do not alias the x87 stack;
  • SSE2 supports packed integer operations similar to MMX;
  • SSE2 code avoids the need for _mm_empty();
  • newer instruction sets such as SSSE3, SSE4.1, AVX2, and AVX-512 provide even more powerful SIMD operations.

For example, instead of processing four 16-bit integers at a time with MMX:

__m64 a = _mm_set_pi16(4, 3, 2, 1);
__m64 b = _mm_set_pi16(8, 7, 6, 5);
__m64 c = _mm_add_pi16(a, b);

_mm_empty();

Modern code would usually use SSE2 and process eight 16-bit integers at a time:

#include <emmintrin.h>

__m128i a = _mm_set_epi16(8, 7, 6, 5, 4, 3, 2, 1);
__m128i b = _mm_set_epi16(16, 15, 14, 13, 12, 11, 10, 9);
__m128i c = _mm_add_epi16(a, b);

// No _mm_empty() needed.

That said, MMX still matters when maintaining old code, studying legacy multimedia libraries, or supporting historical x86 optimization paths.

_mm_empty() in Library Code

The safest place to use _mm_empty() is inside the function that uses MMX.

Do not force callers to know that your function used MMX internally.

For example, this is good library behavior:

void apply_filter_mmx(const short* input, short* output, int count)
{
    // MMX implementation.

    _mm_empty();
}

This is dangerous:

void apply_filter_mmx(const short* input, short* output, int count)
{
    // MMX implementation.

    // No _mm_empty() here.
    // Caller is now responsible for cleanup.
}

The caller should not need to know that the implementation used MMX. A function should not leave hidden MMX/x87 state behind unless that behavior is explicitly documented and intentional.

For ordinary application and library code, clean up before returning.

Interaction With Modern Compilers

On modern x86-64 systems, compilers typically use SSE or SSE2 instructions for ordinary floating-point arithmetic instead of x87 instructions.

Because of this, forgetting _mm_empty() may not immediately break every program.

However, that does not make the omission safe.

There are still cases where x87 code may appear:

  • legacy 32-bit builds;
  • old libraries;
  • inline assembly;
  • compiler options that select x87 floating-point;
  • long double operations on some platforms;
  • third-party binary code;
  • mixed old and new codebases.

If MMX code can be followed by any of these, _mm_empty() is still required.

The safest habit is simple:

If a function uses MMX, call _mm_empty() before the function returns.

Common Mistakes

Mistake 1: Forgetting _mm_empty()

void mmx_function(void)
{
    // MMX code.

    // Missing _mm_empty().
}

This leaves the MMX/x87 state dirty.

Mistake 2: Calling _mm_empty() before every MMX operation

_mm_empty();
__m64 a = _mm_set_pi16(4, 3, 2, 1);

_mm_empty();
__m64 b = _mm_set_pi16(8, 7, 6, 5);

_mm_empty();
__m64 c = _mm_add_pi16(a, b);

This is unnecessary and inefficient.

Mistake 3: Assuming SSE2 needs _mm_empty()

#include <emmintrin.h>

__m128i a = _mm_set1_epi16(1);
__m128i b = _mm_set1_epi16(2);
__m128i c = _mm_add_epi16(a, b);

_mm_empty();   // Not needed for XMM-only SSE2 code.

_mm_empty() is for MMX state, not for ordinary XMM-based SSE2 code.

Mistake 4: Returning from MMX library code without cleanup

int compute_value_mmx(void)
{
    // MMX code.

    return 42;  // Dangerous if _mm_empty() was not called.
}

The caller may execute x87 floating-point code after this function returns. The function should clean up after itself.

Best Practice Checklist

Use _mm_empty() when:

  • your function uses MMX intrinsics;
  • your function uses the __m64 type;
  • your inline assembly touches mm0 through mm7;
  • your code may be followed by x87 floating-point operations;
  • you are writing a library function that hides its MMX implementation from callers.

Do not use _mm_empty() when:

  • your code uses only SSE/SSE2 XMM registers;
  • your code uses only AVX YMM registers;
  • your code uses only AVX-512 ZMM registers;
  • you are trying to initialize MMX before use;
  • you are trying to securely erase register contents;
  • you are inside a tight loop with no transition to x87 floating-point code.

Summary

MMX and x87 floating-point share architectural state. This is the reason the EMMS instruction exists.

When MMX instructions execute, they leave the x87 floating-point register stack in a state that may not be usable by later x87 instructions. The _mm_empty() intrinsic emits EMMS, marking the x87 stack as empty again.

The practical rule is:

After using MMX, call _mm_empty() before executing floating-point code or before returning to code that might execute floating-point code.

For modern SIMD development, prefer SSE2 or later whenever possible. But when maintaining or studying MMX code, _mm_empty() remains one of the most important details to get right.