SIMD on x64/x86

SSE State Management: MXCSR, FXSAVE, FXRSTOR, and FP control

SSE state management is the part of SIMD programming concerned with the processor state used by SSE floating-point instructions.

Most SSE code does not need explicit state management. If you are writing ordinary code with intrinsics such as _mm_add_ps, _mm_mul_ps, _mm_loadu_ps, and _mm_storeu_ps, the compiler, operating system, and calling convention normally take care of the required register handling.

However, there are cases where SSE state matters:

  • changing floating-point rounding behavior,
  • enabling flush-to-zero mode,
  • handling denormal floating-point values,
  • writing low-level runtime code,
  • writing operating-system, kernel, emulator, debugger, or context-switching code,
  • mixing old MMX/x87 code with SSE code,
  • saving or restoring full floating-point and SIMD processor state.

This article explains what SSE state is, what the MXCSR register controls, how to read and write it from C/C++, and when full state-save instructions such as FXSAVE and FXRSTOR are relevant.

What “SSE state” means

SSE state includes the processor state used by SSE instructions.

At a high level, this includes:

StateMeaning
XMM registers128-bit registers used by SSE/SSE2 SIMD instructions
MXCSR registercontrol and status register for SSE floating-point behavior
exception flagsfloating-point exception status bits
exception masksbits controlling whether exceptions are masked
rounding modehow floating-point results are rounded
flush-to-zero modewhether tiny results are flushed to zero
denormals-are-zero modewhether tiny input values are treated as zero on processors that support it

For most application code, the most relevant part is MXCSR.

The XMM registers are ordinary architectural registers. The compiler decides how to use them, and the operating system saves and restores them when switching between threads.

MXCSR is different because it controls floating-point behavior. If your code changes it and does not restore it, later code in the same thread may run with different floating-point settings.

The MXCSR register

MXCSR is the SSE control and status register.

It controls and reports SIMD floating-point behavior for SSE instructions. It includes exception flags, exception masks, rounding mode, and performance-related behavior such as flush-to-zero.

A simplified MXCSR layout looks like this:

BitsNameMeaning
0IEInvalid operation flag
1DEDenormal flag
2ZEDivide-by-zero flag
3OEOverflow flag
4UEUnderflow flag
5PEPrecision flag
6DAZDenormals Are Zero, if supported
7IMInvalid operation mask
8DMDenormal mask
9ZMDivide-by-zero mask
10OMOverflow mask
11UMUnderflow mask
12PMPrecision mask
13–14RCRounding control
15FTZFlush To Zero
16–31ReservedMust not be set unless supported

The exact supported bits can depend on the processor. Do not write arbitrary constants into MXCSR without preserving or masking reserved bits.

Reading and writing MXCSR

In C and C++, the usual way to read and write MXCSR is through SSE intrinsics:

#include <xmmintrin.h>

unsigned int mxcsr = _mm_getcsr();
_mm_setcsr(mxcsr);

These intrinsics correspond to the SSE state-management instructions:

IntrinsicInstructionMeaning
_mm_getcsr()STMXCSRStore MXCSR into memory and return it
_mm_setcsr(value)LDMXCSRLoad MXCSR from a value

At the assembly level, STMXCSR stores the current MXCSR register to a 32-bit memory location, while LDMXCSR loads MXCSR from a 32-bit memory location.

In C/C++, prefer _mm_getcsr() and _mm_setcsr() instead of inline assembly.

A simple MXCSR example

This function reads the current MXCSR value and returns it:

#include <xmmintrin.h>

unsigned int get_sse_control_status()
{
    return _mm_getcsr();
}

This function restores a previously saved MXCSR value:

#include <xmmintrin.h>

void restore_sse_control_status(unsigned int mxcsr)
{
    _mm_setcsr(mxcsr);
}

This is useful when a function temporarily changes floating-point behavior and must restore the previous state before returning.

Use scoped restoration when changing MXCSR

Changing MXCSR affects later floating-point code running in the same thread. For this reason, code that changes MXCSR should normally restore the previous value before returning.

A simple C++ RAII helper can make this safer:

#include <xmmintrin.h>

class ScopedMxcsr
{
public:
    explicit ScopedMxcsr(unsigned int newMxcsr)
        : oldMxcsr_(_mm_getcsr())
    {
        _mm_setcsr(newMxcsr);
    }

    ~ScopedMxcsr()
    {
        _mm_setcsr(oldMxcsr_);
    }

    ScopedMxcsr(const ScopedMxcsr&) = delete;
    ScopedMxcsr& operator=(const ScopedMxcsr&) = delete;

private:
    unsigned int oldMxcsr_;
};

Example usage:

void run_with_temporary_mxcsr(unsigned int temporaryMxcsr)
{
    ScopedMxcsr guard(temporaryMxcsr);

    // Code here runs with temporaryMxcsr.

} // Previous MXCSR is restored here.

This pattern avoids accidentally leaking a modified SSE floating-point environment to the caller.

Flush-to-zero mode

One of the most common reasons to modify MXCSR is to enable flush-to-zero, also known as FTZ.

Very small floating-point values are called subnormal or denormal values. On some processors and workloads, operations involving these values can be much slower than operations on ordinary normalized floating-point values.

Flush-to-zero mode changes the behavior so that tiny underflowed results are flushed to zero instead of being represented as subnormal values.

FTZ is controlled by bit 15 of MXCSR.

#include <xmmintrin.h>

void enable_flush_to_zero()
{
    unsigned int mxcsr = _mm_getcsr();

    mxcsr |= (1u << 15); // FTZ: Flush To Zero

    _mm_setcsr(mxcsr);
}

This modifies only the FTZ bit and preserves the rest of MXCSR.

A scoped version is safer:

#include <xmmintrin.h>

class ScopedFlushToZero
{
public:
    ScopedFlushToZero()
        : oldMxcsr_(_mm_getcsr())
    {
        unsigned int mxcsr = oldMxcsr_;
        mxcsr |= (1u << 15); // FTZ: Flush To Zero

        _mm_setcsr(mxcsr);
    }

    ~ScopedFlushToZero()
    {
        _mm_setcsr(oldMxcsr_);
    }

    ScopedFlushToZero(const ScopedFlushToZero&) = delete;
    ScopedFlushToZero& operator=(const ScopedFlushToZero&) = delete;

private:
    unsigned int oldMxcsr_;
};

Example:

void process_audio_block(float* samples, int count)
{
    ScopedFlushToZero ftz;

    // Audio/DSP processing here.
    // Very small intermediate results may be flushed to zero.

} // Previous MXCSR is restored here.

Flush-to-zero can be useful in audio, DSP, graphics, physics, and numerical code where denormal values are unwanted and tiny underflowed results can safely be treated as zero.

Denormals-are-zero mode

Another related mode is denormals-are-zero, often abbreviated as DAZ.

FTZ affects tiny output results.

DAZ affects tiny input values.

ModeMXCSR bitEffect
FTZ15Flush tiny results to zero
DAZ6Treat tiny input values as zero

DAZ is not available on every old SSE-era processor, so code that sets it should be more careful.

A simple version looks like this:

#include <xmmintrin.h>

void enable_flush_to_zero_and_daz()
{
    unsigned int mxcsr = _mm_getcsr();

    mxcsr |= (1u << 15); // FTZ: Flush To Zero
    mxcsr |= (1u << 6);  // DAZ: Denormals Are Zero, if supported

    _mm_setcsr(mxcsr);
}

For portable low-level code, do not assume every reserved or optional bit can be set. Preserve existing MXCSR bits and set only the bits that are known to be supported by the target environment.

Rounding control

MXCSR also controls the rounding mode for SSE floating-point operations.

Bits 13 and 14 select the rounding mode:

Bits 14:13Rounding mode
00Round to nearest
01Round down
10Round up
11Round toward zero

Most application code should not change the rounding mode globally. If you do change it, use a scoped restore pattern.

Example:

#include <xmmintrin.h>

class ScopedSseRoundingTowardZero
{
public:
    ScopedSseRoundingTowardZero()
        : oldMxcsr_(_mm_getcsr())
    {
        unsigned int mxcsr = oldMxcsr_;

        // Clear rounding-control bits 13 and 14.
        mxcsr &= ~(3u << 13);

        // Set RC = 11: round toward zero.
        mxcsr |= (3u << 13);

        _mm_setcsr(mxcsr);
    }

    ~ScopedSseRoundingTowardZero()
    {
        _mm_setcsr(oldMxcsr_);
    }

    ScopedSseRoundingTowardZero(const ScopedSseRoundingTowardZero&) = delete;
    ScopedSseRoundingTowardZero& operator=(const ScopedSseRoundingTowardZero&) = delete;

private:
    unsigned int oldMxcsr_;
};

Use this kind of change only when the numerical behavior is intentional and well tested.

Exception flags and masks

MXCSR contains both exception flags and exception masks.

Exception flags record that a floating-point condition occurred:

FlagMeaning
Invalid operationAn invalid floating-point operation occurred
DenormalA denormal operand was used
Divide by zeroDivision by zero occurred
OverflowA result was too large
UnderflowA result was too small
PrecisionThe result was inexact

Exception masks control whether these conditions raise exceptions or remain masked.

Most normal application code leaves SSE floating-point exceptions masked. That means operations such as inexact arithmetic set status flags but do not interrupt the program.

You can clear the exception flags by clearing bits 0 through 5:

#include <xmmintrin.h>

void clear_sse_exception_flags()
{
    unsigned int mxcsr = _mm_getcsr();

    // Clear exception status flags: bits 0 through 5.
    mxcsr &= ~0x3Fu;

    _mm_setcsr(mxcsr);
}

Be careful not to change unrelated control bits unintentionally.

Do not write arbitrary MXCSR values

MXCSR contains reserved bits.

Writing non-zero values to reserved bits can cause a fault on some processors. For this reason, code should not do this:

// Bad idea: arbitrary constant.
_mm_setcsr(0xFFFFFFFFu);

Instead, preserve the existing value and change only the bits you need:

unsigned int mxcsr = _mm_getcsr();

mxcsr |= (1u << 15);  // Set FTZ.
mxcsr &= ~0x3Fu;      // Clear exception flags.

_mm_setcsr(mxcsr);

For low-level runtime or OS code, the MXCSR mask stored by FXSAVE can be used to determine which bits are supported before writing a new MXCSR value.

Full SSE state: FXSAVE and FXRSTOR

_mm_getcsr() and _mm_setcsr() deal only with MXCSR.

For full floating-point and SIMD state, x86 provides the FXSAVE and FXRSTOR instructions.

InstructionMeaning
FXSAVESave x87 FPU, MMX, XMM, MXCSR, and related state to memory
FXRSTORRestore x87 FPU, MMX, XMM, MXCSR, and related state from memory

These instructions are mainly used by:

  • operating systems,
  • kernels,
  • hypervisors,
  • debuggers,
  • emulators,
  • exception handlers,
  • low-level runtimes,
  • context-switching code.

Ordinary numerical application code almost never needs to call FXSAVE or FXRSTOR directly.

FXSAVE saves a state image

FXSAVE writes a full floating-point and SIMD state image to memory.

The saved state includes:

  • x87 floating-point control and status,
  • x87 registers,
  • MMX register state,
  • XMM registers,
  • MXCSR,
  • MXCSR mask,
  • instruction and data pointers used by the floating-point environment.

The memory area used by FXSAVE is 512 bytes.

It must be properly aligned. In normal SSE-era usage, the save area should be 16-byte aligned.

A low-level C/C++ example could define a state area like this:

#include <stdint.h>

struct alignas(16) FxSaveArea
{
    uint8_t data[512];
};

Then low-level code could use inline assembly or compiler-specific intrinsics to execute FXSAVE and FXRSTOR.

However, this is not normally needed in portable application code.

FXSAVE does not clear the state

FXSAVE saves the current state, but it does not clear or reset the processor’s floating-point or SIMD registers.

This distinction matters.

FXSAVE  = copy current state to memory
FXRSTOR = reload state from memory

Saving the state is not the same as resetting it.

If code needs to reset part of the floating-point environment, it must use the appropriate reset or initialization instruction separately.

Who normally manages SSE state?

Different layers manage different parts of SSE state.

LayerResponsibility
CompilerAllocates XMM registers and follows the platform calling convention
Calling conventionDefines which registers must be preserved across calls
Operating systemSaves/restores thread state during context switches
Runtime librariesMay initialize floating-point environment
Application codeMay temporarily adjust MXCSR when needed
Kernel/hypervisor/debuggerMay use full state save/restore instructions

For most C and C++ application code, the only SSE state you might explicitly touch is MXCSR.

SSE state and function calls

You usually do not need to save and restore XMM registers manually around ordinary function calls.

The platform calling convention defines which registers are caller-saved and which are callee-saved. The compiler follows those rules.

For example, if a function uses SSE intrinsics, the compiler will allocate XMM registers as needed and emit code that respects the calling convention.

Manual XMM register save/restore is only relevant in unusual low-level code, such as:

  • hand-written assembly,
  • JIT compilers,
  • context switching,
  • signal or exception handling,
  • kernel code,
  • ABI boundary code.

SSE state and threads

MXCSR is part of the thread’s floating-point/SIMD state.

That means changing MXCSR affects the current thread. Other threads have their own saved processor state.

However, within the same thread, the change remains active until something changes it again.

This is why a library function that modifies MXCSR should restore the old value before returning. Otherwise, the caller may observe different rounding, exception, or denormal behavior.

SSE state vs MMX cleanup

Do not confuse SSE state management with MMX cleanup.

MMX uses the old x87 floating-point register state. After using MMX instructions, code should call:

_mm_empty();

This emits the EMMS instruction.

SSE XMM registers do not require _mm_empty().

This is an MMX-specific issue:

#include <mmintrin.h>

void mmx_function()
{
    // MMX operations...

    _mm_empty(); // Needed after MMX.
}

But ordinary SSE code does not need this:

#include <xmmintrin.h>

void sse_function()
{
    // SSE operations...

    // No _mm_empty() needed.
}

Example: temporary flush-to-zero in a processing function

This example shows a realistic pattern: save MXCSR, enable FTZ, run performance-sensitive code, and restore MXCSR.

#include <xmmintrin.h>

void process_samples_with_ftz(float* samples, int count, float gain)
{
    unsigned int oldMxcsr = _mm_getcsr();

    unsigned int mxcsr = oldMxcsr;
    mxcsr |= (1u << 15); // Enable FTZ.

    _mm_setcsr(mxcsr);

    int i = 0;

    __m128 vgain = _mm_set1_ps(gain);

    for (; i + 3 < count; i += 4)
    {
        __m128 x = _mm_loadu_ps(&samples[i]);
        x = _mm_mul_ps(x, vgain);
        _mm_storeu_ps(&samples[i], x);
    }

    for (; i < count; ++i)
    {
        samples[i] *= gain;
    }

    _mm_setcsr(oldMxcsr);
}

This works, but it has one weakness: if an exception is thrown before the final _mm_setcsr(oldMxcsr), the old value may not be restored.

In C++, the RAII version is safer:

#include <xmmintrin.h>

class ScopedFlushToZero
{
public:
    ScopedFlushToZero()
        : oldMxcsr_(_mm_getcsr())
    {
        unsigned int mxcsr = oldMxcsr_;
        mxcsr |= (1u << 15);
        _mm_setcsr(mxcsr);
    }

    ~ScopedFlushToZero()
    {
        _mm_setcsr(oldMxcsr_);
    }

private:
    unsigned int oldMxcsr_;
};

void process_samples_with_scoped_ftz(float* samples, int count, float gain)
{
    ScopedFlushToZero ftz;

    __m128 vgain = _mm_set1_ps(gain);

    int i = 0;

    for (; i + 3 < count; i += 4)
    {
        __m128 x = _mm_loadu_ps(&samples[i]);
        x = _mm_mul_ps(x, vgain);
        _mm_storeu_ps(&samples[i], x);
    }

    for (; i < count; ++i)
    {
        samples[i] *= gain;
    }
}

This version restores MXCSR automatically when the function exits.

Example: inspecting MXCSR flags

The following example reads MXCSR and checks whether any floating-point exception status flag is set.

#include <xmmintrin.h>

int has_sse_exception_flags()
{
    unsigned int mxcsr = _mm_getcsr();

    // Bits 0 through 5 are exception status flags.
    return (mxcsr & 0x3Fu) != 0;
}

A debugging helper could print individual flag names:

#include <stdio.h>
#include <xmmintrin.h>

void print_mxcsr_flags()
{
    unsigned int mxcsr = _mm_getcsr();

    if (mxcsr & (1u << 0)) printf("Invalid operation flag set\n");
    if (mxcsr & (1u << 1)) printf("Denormal flag set\n");
    if (mxcsr & (1u << 2)) printf("Divide-by-zero flag set\n");
    if (mxcsr & (1u << 3)) printf("Overflow flag set\n");
    if (mxcsr & (1u << 4)) printf("Underflow flag set\n");
    if (mxcsr & (1u << 5)) printf("Precision flag set\n");
}

This can be useful when debugging numerical code.

Common pitfalls

SSE state management has several traps.

Changing MXCSR without restoring it

MXCSR is part of the thread’s floating-point environment. If a function changes rounding mode, exception masks, FTZ, or DAZ, the change remains active until changed again.

Use a scoped restore pattern whenever possible.

Writing reserved MXCSR bits

Do not write arbitrary constants to MXCSR. Preserve the existing value and modify only the bits you need.

This is safer:

unsigned int mxcsr = _mm_getcsr();
mxcsr |= (1u << 15);
_mm_setcsr(mxcsr);

This is unsafe:

_mm_setcsr(0xFFFFFFFFu);

Assuming all processors support DAZ

FTZ and DAZ are related, but DAZ is not universally supported on all old SSE processors.

If you need to support very old machines, check processor support or avoid setting optional bits blindly.

Confusing FXSAVE with ordinary application-level SSE code

FXSAVE and FXRSTOR are full state-management instructions. They are mainly for operating systems, debuggers, exception handlers, and low-level runtimes.

They are not normally needed inside ordinary SIMD functions.

Forgetting FXSAVE alignment

The FXSAVE memory area must be properly aligned. A 16-byte aligned 512-byte buffer is the usual basic requirement for SSE-era state saving.

Thinking FXSAVE clears registers

FXSAVE saves the current state to memory. It does not clear the current processor state.

Calling _mm_empty() after SSE code

_mm_empty() is for MMX cleanup, not ordinary SSE cleanup.

Use _mm_empty() after MMX code. Do not add it after normal SSE code unless MMX was actually used.

Changing rounding mode globally

Changing the SSE rounding mode can affect later calculations in the same thread. This can produce subtle numerical bugs.

Prefer local, scoped changes, and document why the rounding mode is being changed.

When should application code care?

Most application code should not care about full SSE state management.

You usually do not need to save XMM registers manually, and you usually do not need FXSAVE or FXRSTOR.

Application code may care about MXCSR when:

  • enabling flush-to-zero for performance,
  • controlling denormal behavior,
  • temporarily changing rounding behavior,
  • debugging floating-point exceptions,
  • writing numerical libraries with explicit floating-point environment requirements.

Low-level code may care about full SSE state when:

  • implementing thread context switching,
  • writing an operating system,
  • writing a hypervisor,
  • implementing an emulator,
  • writing a debugger,
  • handling signals or exceptions,
  • writing a JIT compiler or runtime.

Build notes

The MXCSR intrinsics are available through:

#include <xmmintrin.h>

Example GCC or Clang build command for 32-bit targets where SSE must be enabled explicitly:

gcc -O2 -msse example.c -o example

For C++:

g++ -O2 -msse example.cpp -o example

On modern x86-64 targets, SSE/SSE2 support is normally part of the baseline architecture, but explicit compiler flags may still be useful when controlling code generation for older or specific targets.

Complete example program

This program prints the current MXCSR value, enables flush-to-zero, prints the modified value, and then restores the original value.

#include <stdio.h>
#include <xmmintrin.h>

static void print_mxcsr(const char* label, unsigned int mxcsr)
{
    printf("%s: 0x%08X\n", label, mxcsr);

    printf("  Exception flags: 0x%02X\n", mxcsr & 0x3F);
    printf("  DAZ: %s\n", (mxcsr & (1u << 6)) ? "on" : "off");
    printf("  Exception masks: 0x%02X\n", (mxcsr >> 7) & 0x3F);
    printf("  Rounding mode: %u\n", (mxcsr >> 13) & 0x3);
    printf("  FTZ: %s\n", (mxcsr & (1u << 15)) ? "on" : "off");
}

int main()
{
    unsigned int original = _mm_getcsr();

    print_mxcsr("Original MXCSR", original);

    unsigned int modified = original;
    modified |= (1u << 15); // Enable FTZ.

    _mm_setcsr(modified);

    print_mxcsr("Modified MXCSR", _mm_getcsr());

    _mm_setcsr(original);

    print_mxcsr("Restored MXCSR", _mm_getcsr());

    return 0;
}

Example build command:

gcc -O2 -msse mxcsr_example.c -o mxcsr_example

The exact MXCSR value can vary depending on operating system, compiler, runtime initialization, and processor support.

Summary

SSE state management is mostly about understanding what the processor keeps as part of its floating-point and SIMD environment.

The most important practical points are:

  • MXCSR controls SSE floating-point behavior.
  • _mm_getcsr() reads MXCSR.
  • _mm_setcsr() writes MXCSR.
  • STMXCSR and LDMXCSR are the underlying instructions.
  • FTZ can improve performance in workloads affected by denormal values.
  • DAZ treats denormal inputs as zero on processors that support it.
  • changing MXCSR affects the current thread until restored.
  • use scoped restoration when modifying MXCSR.
  • do not write arbitrary values to MXCSR because reserved bits can fault.
  • FXSAVE and FXRSTOR save and restore full x87/MMX/SSE state and are mainly for low-level code.
  • _mm_empty() is for MMX cleanup, not ordinary SSE code.

Most SSE application code does not need explicit state management. But when numerical behavior, denormal performance, or low-level context handling matters, understanding MXCSR and full SIMD state becomes essential.

References

Leave a Reply

Your email address will not be published. Required fields are marked *