SIMD on x64/x86

SSE Intrinsics: a practical guide to __m128 SIMD programming

SSE, short for Streaming SIMD Extensions, is an x86 instruction set extension that allows one instruction to operate on multiple floating-point values at the same time.

The most common SSE programming model uses 128-bit XMM registers. Each register can hold four 32-bit single-precision floating-point values:

[ f0 | f1 | f2 | f3 ]

Instead of adding one float at a time, SSE can add four floats with one packed instruction. This is the basic idea behind SIMD: Single Instruction, Multiple Data.

This article introduces the most important SSE intrinsics for C and C++: data types, packed and scalar operations, arithmetic, reciprocal approximations, comparisons, logical operations, memory access, and common pitfalls.

Header file

SSE intrinsics are declared in:

#include <xmmintrin.h>

This header provides access to the original SSE intrinsics, including single-precision floating-point arithmetic, comparisons, logical operations, shuffles, loads, and stores.

The __m128 data type

The main SSE data type is:

__m128

A __m128 value represents a 128-bit XMM register containing four single-precision floating-point values.

Conceptually:

__m128 v = [ v0 | v1 | v2 | v3 ]

Each lane is a 32-bit float.

SSE intrinsics usually take one or more __m128 values and return another __m128 value.

Example:

__m128 c = _mm_add_ps(a, b);

This adds the four float lanes of a and b.

Packed single vs scalar single

Most SSE arithmetic intrinsics come in two forms:

SuffixMeaningOperation
PSPacked Single-precisionoperates on all four float lanes
SSScalar Single-precisionoperates only on the low float lane

For example:

_mm_add_ps(a, b)

adds all four lanes:

[ a0 + b0 | a1 + b1 | a2 + b2 | a3 + b3 ]

But:

_mm_add_ss(a, b)

adds only the lowest lane and copies the upper three lanes from the first operand:

[ a0 + b0 | a1 | a2 | a3 ]

This distinction is essential. Many bugs in SSE code come from accidentally using a scalar instruction where a packed instruction was intended, or from forgetting that the upper lanes of a scalar result are copied from the first operand.

Naming conventions

SSE intrinsic names are easier to understand once the suffixes are familiar.

Name fragmentMeaning
_mmSIMD intrinsic prefix
add, sub, mul, divarithmetic operation
pspacked single-precision floats
ssscalar single-precision float
loadload from memory
storestore to memory
uunaligned memory access
set1broadcast one value to all lanes
cmpcompare
movemaskextract sign bits into an integer mask

Examples:

_mm_add_ps

means “add packed single-precision floats.”

_mm_add_ss

means “add scalar single-precision floats.”

_mm_loadu_ps

means “load four unaligned single-precision floats.”

Creating SSE values

There are several ways to create __m128 values.

Set values manually

__m128 v = _mm_setr_ps(1.0f, 2.0f, 3.0f, 4.0f);

The r in _mm_setr_ps stands for reverse order relative to _mm_set_ps. It is often easier to read because the values appear in the natural left-to-right order:

[ 1.0 | 2.0 | 3.0 | 4.0 ]

Broadcast one value to all lanes

__m128 v = _mm_set1_ps(2.0f);

Result:

[ 2.0 | 2.0 | 2.0 | 2.0 ]

This is useful when applying the same constant to several values.

Create a zero vector

__m128 zero = _mm_setzero_ps();

Result:

[ 0.0 | 0.0 | 0.0 | 0.0 ]

Loading and storing floats

SSE code usually loads data from memory into an XMM register, performs operations, and stores the result back to memory.

Unaligned load and store

__m128 v = _mm_loadu_ps(ptr);
_mm_storeu_ps(out, v);

The u means unaligned. The pointer does not need to be 16-byte aligned.

Aligned load and store

__m128 v = _mm_load_ps(ptr);
_mm_store_ps(out, v);

Aligned loads and stores require the memory address to be 16-byte aligned.

For simple and safe code, use _mm_loadu_ps and _mm_storeu_ps unless you know that the data is correctly aligned.

Arithmetic intrinsics

The following table summarizes the most common SSE arithmetic intrinsics.

In the examples below:

a = [ a0 | a1 | a2 | a3 ]
b = [ b0 | b1 | b2 | b3 ]
IntrinsicInstructionOperationResult
_mm_add_ps(a, b)ADDPSpacked add`[a0+b0
_mm_add_ss(a, b)ADDSSscalar add`[a0+b0
_mm_sub_ps(a, b)SUBPSpacked subtract`[a0-b0
_mm_sub_ss(a, b)SUBSSscalar subtract`[a0-b0
_mm_mul_ps(a, b)MULPSpacked multiply`[a0*b0
_mm_mul_ss(a, b)MULSSscalar multiply`[a0*b0
_mm_div_ps(a, b)DIVPSpacked divide`[a0/b0
_mm_div_ss(a, b)DIVSSscalar divide`[a0/b0
_mm_sqrt_ps(a)SQRTPSpacked square root`[sqrt(a0)
_mm_sqrt_ss(a)SQRTSSscalar square root`[sqrt(a0)
_mm_min_ps(a, b)MINPSpacked minimum`[min(a0,b0)
_mm_min_ss(a, b)MINSSscalar minimum`[min(a0,b0)
_mm_max_ps(a, b)MAXPSpacked maximum`[max(a0,b0)
_mm_max_ss(a, b)MAXSSscalar maximum`[max(a0,b0)

For addition and multiplication, the order of operands does not change the mathematical result. For subtraction and division, the order matters:

_mm_sub_ps(a, b); // a - b
_mm_div_ps(a, b); // a / b

Reciprocal and reciprocal square root

SSE also provides approximate reciprocal and reciprocal-square-root operations.

IntrinsicInstructionOperation
_mm_rcp_ps(a)RCPPSapproximate 1.0f / a for all four lanes
_mm_rcp_ss(a)RCPSSapproximate 1.0f / a0 for the low lane
_mm_rsqrt_ps(a)RSQRTPSapproximate 1.0f / sqrt(a) for all four lanes
_mm_rsqrt_ss(a)RSQRTSSapproximate 1.0f / sqrt(a0) for the low lane

These instructions are faster than full division or square root, but they are approximate. They are often useful in graphics, image processing, games, and simulations where a small numerical error is acceptable.

If more accuracy is needed, the approximation can be refined with a Newton-Raphson step.

For reciprocal:

// y ~= 1 / x
__m128 y = _mm_rcp_ps(x);

// One Newton-Raphson refinement:
// y = y * (2 - x * y)
__m128 two = _mm_set1_ps(2.0f);
y = _mm_mul_ps(y, _mm_sub_ps(two, _mm_mul_ps(x, y)));

For reciprocal square root:

// y ~= 1 / sqrt(x)
__m128 y = _mm_rsqrt_ps(x);

// One Newton-Raphson refinement:
// y = 0.5 * y * (3 - x * y * y)
__m128 half = _mm_set1_ps(0.5f);
__m128 three = _mm_set1_ps(3.0f);

y = _mm_mul_ps(
        _mm_mul_ps(half, y),
        _mm_sub_ps(three, _mm_mul_ps(x, _mm_mul_ps(y, y))));

Example: adding four floats

This is the simplest SSE example. It adds four floats from one array to four floats from another array.

#include <xmmintrin.h>

void add4_floats(const float* a, const float* b, float* out)
{
    __m128 va = _mm_loadu_ps(a);
    __m128 vb = _mm_loadu_ps(b);

    __m128 result = _mm_add_ps(va, vb);

    _mm_storeu_ps(out, result);
}

If the input arrays contain:

a = [1.0, 2.0, 3.0, 4.0]
b = [10.0, 20.0, 30.0, 40.0]

then the output is:

out = [11.0, 22.0, 33.0, 44.0]

Example: packed vs scalar addition

This example shows the difference between _mm_add_ps and _mm_add_ss.

#include <xmmintrin.h>

void packed_vs_scalar(float* packedOut, float* scalarOut)
{
    __m128 a = _mm_setr_ps(1.0f, 2.0f, 3.0f, 4.0f);
    __m128 b = _mm_setr_ps(10.0f, 20.0f, 30.0f, 40.0f);

    __m128 packed = _mm_add_ps(a, b);
    __m128 scalar = _mm_add_ss(a, b);

    _mm_storeu_ps(packedOut, packed);
    _mm_storeu_ps(scalarOut, scalar);
}

The packed result is:

packedOut = [11.0, 22.0, 33.0, 44.0]

The scalar result is:

scalarOut = [11.0, 2.0, 3.0, 4.0]

Only the first lane was added. The other three lanes were copied from a.

Example: processing an array of floats

Most real SSE code processes arrays in groups of four floats.

#include <xmmintrin.h>

void add_float_arrays(const float* a,
                      const float* b,
                      float* out,
                      int count)
{
    int i = 0;

    for (; i + 3 < count; i += 4)
    {
        __m128 va = _mm_loadu_ps(&a[i]);
        __m128 vb = _mm_loadu_ps(&b[i]);

        __m128 result = _mm_add_ps(va, vb);

        _mm_storeu_ps(&out[i], result);
    }

    // Handle the remaining 1, 2, or 3 elements.
    for (; i < count; ++i)
    {
        out[i] = a[i] + b[i];
    }
}

The vector loop handles four floats per iteration. The scalar cleanup loop handles the remaining elements when count is not a multiple of four.

Example: scale and offset an array

This example computes:

out[i] = input[i] * scale + offset

for every value in the input array.

#include <xmmintrin.h>

void scale_and_offset(const float* input,
                      float* output,
                      int count,
                      float scale,
                      float offset)
{
    __m128 vscale = _mm_set1_ps(scale);
    __m128 voffset = _mm_set1_ps(offset);

    int i = 0;

    for (; i + 3 < count; i += 4)
    {
        __m128 x = _mm_loadu_ps(&input[i]);

        __m128 y = _mm_mul_ps(x, vscale);
        y = _mm_add_ps(y, voffset);

        _mm_storeu_ps(&output[i], y);
    }

    for (; i < count; ++i)
    {
        output[i] = input[i] * scale + offset;
    }
}

This pattern is common in graphics, audio, image processing, and numerical code.

Example: clamping values with MINPS and MAXPS

SSE MINPS and MAXPS can clamp values to a range.

The following function clamps every value to:

[minValue, maxValue]
#include <xmmintrin.h>

void clamp_float_array(const float* input,
                       float* output,
                       int count,
                       float minValue,
                       float maxValue)
{
    __m128 vmin = _mm_set1_ps(minValue);
    __m128 vmax = _mm_set1_ps(maxValue);

    int i = 0;

    for (; i + 3 < count; i += 4)
    {
        __m128 x = _mm_loadu_ps(&input[i]);

        x = _mm_max_ps(x, vmin);
        x = _mm_min_ps(x, vmax);

        _mm_storeu_ps(&output[i], x);
    }

    for (; i < count; ++i)
    {
        float x = input[i];

        if (x < minValue)
            x = minValue;

        if (x > maxValue)
            x = maxValue;

        output[i] = x;
    }
}

This is useful when values must stay inside a valid range, such as pixel values, audio samples, simulation parameters, or normalized coordinates.

Example: square roots

The SQRTPS instruction computes four square roots in parallel.

#include <xmmintrin.h>
#include <math.h>

void sqrt_float_array(const float* input,
                      float* output,
                      int count)
{
    int i = 0;

    for (; i + 3 < count; i += 4)
    {
        __m128 x = _mm_loadu_ps(&input[i]);
        __m128 y = _mm_sqrt_ps(x);

        _mm_storeu_ps(&output[i], y);
    }

    for (; i < count; ++i)
    {
        output[i] = sqrtf(input[i]);
    }
}

This assumes that the input values are non-negative. Negative inputs follow the usual floating-point behavior for square root.

Example: normalizing 2D vectors

This example normalizes arrays of 2D vectors.

Given:

length = sqrt(x*x + y*y)

outX = x / length
outY = y / length

The SSE version processes four vectors at a time.

#include <xmmintrin.h>
#include <math.h>

void normalize_2d_vectors(const float* x,
                          const float* y,
                          float* outX,
                          float* outY,
                          int count)
{
    int i = 0;

    for (; i + 3 < count; i += 4)
    {
        __m128 vx = _mm_loadu_ps(&x[i]);
        __m128 vy = _mm_loadu_ps(&y[i]);

        __m128 x2 = _mm_mul_ps(vx, vx);
        __m128 y2 = _mm_mul_ps(vy, vy);

        __m128 lengthSquared = _mm_add_ps(x2, y2);
        __m128 length = _mm_sqrt_ps(lengthSquared);

        __m128 nx = _mm_div_ps(vx, length);
        __m128 ny = _mm_div_ps(vy, length);

        _mm_storeu_ps(&outX[i], nx);
        _mm_storeu_ps(&outY[i], ny);
    }

    for (; i < count; ++i)
    {
        float length = sqrtf(x[i] * x[i] + y[i] * y[i]);

        outX[i] = x[i] / length;
        outY[i] = y[i] / length;
    }
}

This function assumes that no vector has zero length. Production code should handle zero-length vectors before dividing by the length.

Example: approximate normalization with RSQRTPS

If full precision is not required, normalization can use _mm_rsqrt_ps.

Instead of computing:

length = sqrt(x*x + y*y)
outX = x / length
outY = y / length

we can compute:

invLength = 1 / sqrt(x*x + y*y)
outX = x * invLength
outY = y * invLength
#include <xmmintrin.h>

void normalize_2d_vectors_approx(const float* x,
                                 const float* y,
                                 float* outX,
                                 float* outY,
                                 int count)
{
    int i = 0;

    for (; i + 3 < count; i += 4)
    {
        __m128 vx = _mm_loadu_ps(&x[i]);
        __m128 vy = _mm_loadu_ps(&y[i]);

        __m128 x2 = _mm_mul_ps(vx, vx);
        __m128 y2 = _mm_mul_ps(vy, vy);

        __m128 lengthSquared = _mm_add_ps(x2, y2);
        __m128 invLength = _mm_rsqrt_ps(lengthSquared);

        __m128 nx = _mm_mul_ps(vx, invLength);
        __m128 ny = _mm_mul_ps(vy, invLength);

        _mm_storeu_ps(&outX[i], nx);
        _mm_storeu_ps(&outY[i], ny);
    }

    for (; i < count; ++i)
    {
        float invLength = 1.0f / sqrtf(x[i] * x[i] + y[i] * y[i]);

        outX[i] = x[i] * invLength;
        outY[i] = y[i] * invLength;
    }
}

This version is faster on many processors, but the result is approximate. Use it only when the precision is acceptable.

Comparisons

SSE comparisons produce masks.

Each lane in the result is either all bits set, meaning true, or all bits clear, meaning false.

Some common comparison intrinsics are:

IntrinsicMeaning
_mm_cmpeq_ps(a, b)compare packed floats for equality
_mm_cmplt_ps(a, b)compare packed floats for less-than
_mm_cmple_ps(a, b)compare packed floats for less-than-or-equal
_mm_cmpgt_ps(a, b)compare packed floats for greater-than
_mm_cmpge_ps(a, b)compare packed floats for greater-than-or-equal
_mm_cmpneq_ps(a, b)compare packed floats for not-equal

Example:

__m128 a = _mm_setr_ps(1.0f, 5.0f, 3.0f, 8.0f);
__m128 b = _mm_set1_ps(4.0f);

__m128 mask = _mm_cmpgt_ps(a, b);

Conceptually, this compares:

[1.0 > 4.0 | 5.0 > 4.0 | 3.0 > 4.0 | 8.0 > 4.0]

Result:

[false | true | false | true]

Internally, those are not Boolean values like in C. They are bit masks.

Example: zero values below a threshold

This example keeps values greater than or equal to a threshold and replaces the others with zero.

#include <xmmintrin.h>

void threshold_to_zero(const float* input,
                       float* output,
                       int count,
                       float threshold)
{
    __m128 vthreshold = _mm_set1_ps(threshold);

    int i = 0;

    for (; i + 3 < count; i += 4)
    {
        __m128 x = _mm_loadu_ps(&input[i]);

        __m128 mask = _mm_cmpge_ps(x, vthreshold);
        __m128 y = _mm_and_ps(mask, x);

        _mm_storeu_ps(&output[i], y);
    }

    for (; i < count; ++i)
    {
        output[i] = input[i] >= threshold ? input[i] : 0.0f;
    }
}

The comparison creates a mask. The bitwise AND keeps the original value where the mask is true and clears it where the mask is false.

Logical operations

SSE includes bitwise logical operations for __m128 values.

IntrinsicMeaning
_mm_and_ps(a, b)bitwise AND
_mm_or_ps(a, b)bitwise OR
_mm_xor_ps(a, b)bitwise XOR
_mm_andnot_ps(a, b)bitwise AND NOT

These operations are commonly used with comparison masks.

For example:

__m128 mask = _mm_cmpgt_ps(a, b);
__m128 selected = _mm_and_ps(mask, a);

This keeps the lanes from a where the comparison is true and produces zero in the other lanes.

A common select pattern is:

// result = mask ? valueIfTrue : valueIfFalse
__m128 result = _mm_or_ps(
    _mm_and_ps(mask, valueIfTrue),
    _mm_andnot_ps(mask, valueIfFalse));

_mm_andnot_ps(mask, valueIfFalse) computes:

(~mask) & valueIfFalse

Extracting comparison results with MOVMSKPS

The _mm_movemask_ps intrinsic extracts the sign bit of each lane and packs those bits into a normal integer.

This is useful after comparisons.

#include <xmmintrin.h>

int all_greater_than_zero(__m128 x)
{
    __m128 zero = _mm_setzero_ps();
    __m128 mask = _mm_cmpgt_ps(x, zero);

    int bits = _mm_movemask_ps(mask);

    return bits == 0xF;
}

The result of _mm_movemask_ps has one bit per lane.

bit 0 -> lane 0
bit 1 -> lane 1
bit 2 -> lane 2
bit 3 -> lane 3

If all four comparison lanes are true, the result is:

0b1111 = 0xF

Shuffles

SSE can rearrange values inside XMM registers.

The most common shuffle intrinsic is:

_mm_shuffle_ps(a, b, mask)

The mask is a compile-time constant that chooses which lanes to copy.

One simple use is broadcasting the first lane to all lanes:

__m128 v = _mm_setr_ps(1.0f, 2.0f, 3.0f, 4.0f);

__m128 x = _mm_shuffle_ps(v, v, _MM_SHUFFLE(0, 0, 0, 0));

Result:

[1.0 | 1.0 | 1.0 | 1.0]

Shuffles are useful for rearranging data, duplicating values, and building more complex vector algorithms. They can also become expensive if overused, so data layout is often just as important as instruction choice.

Data layout matters

SSE works best when the values you want to process together are stored next to each other in memory.

This layout is SIMD-friendly:

x0, x1, x2, x3, x4, x5, x6, x7

Four x values can be loaded directly:

__m128 x = _mm_loadu_ps(&values[i]);

This layout is sometimes harder to vectorize:

struct Point
{
    float x;
    float y;
    float z;
};

Point points[count];

The x, y, and z values are interleaved:

x0, y0, z0, x1, y1, z1, x2, y2, z2, ...

If you frequently process all x values together, all y values together, or all z values together, a structure-of-arrays layout may be better:

struct Points
{
    float* x;
    float* y;
    float* z;
};

This makes it easy to load four x values, four y values, or four z values with one instruction.

Common pitfalls

SSE intrinsics are powerful, but several details can lead to wrong results or disappointing performance.

Confusing packed and scalar operations

PS operations work on all four lanes.

SS operations work only on the low lane and copy the upper lanes from the first operand.

__m128 a = _mm_setr_ps(1.0f, 2.0f, 3.0f, 4.0f);
__m128 b = _mm_setr_ps(10.0f, 20.0f, 30.0f, 40.0f);

__m128 r = _mm_add_ss(a, b);

The result is:

[11.0 | 2.0 | 3.0 | 4.0]

not:

[11.0 | 22.0 | 33.0 | 44.0]

Forgetting the operand order

For subtraction and division, operand order matters:

_mm_sub_ps(a, b); // a - b
_mm_sub_ps(b, a); // b - a

_mm_div_ps(a, b); // a / b
_mm_div_ps(b, a); // b / a

When reading assembly, also remember that assembly syntax and intrinsic syntax may make the destination operand feel different. Check the documentation when in doubt.

Using aligned loads on unaligned memory

Aligned loads and stores require 16-byte alignment:

_mm_load_ps(ptr);   // ptr must be 16-byte aligned
_mm_store_ps(ptr, v);

If alignment is not guaranteed, use:

_mm_loadu_ps(ptr);
_mm_storeu_ps(ptr, v);

This is safer and usually fast enough on modern processors.

Forgetting scalar cleanup

SSE processes four floats per packed operation. If the array size is not a multiple of four, the final one, two, or three values still need to be processed.

A typical loop has two parts:

for (; i + 3 < count; i += 4)
{
    // SSE vector loop
}

for (; i < count; ++i)
{
    // scalar cleanup
}

Assuming reciprocal instructions are exact

_mm_rcp_ps and _mm_rsqrt_ps are approximate.

They are useful for speed, but they should not be used when full floating-point precision is required. Use _mm_div_ps or _mm_sqrt_ps when accuracy matters more than speed.

Assuming MIN and MAX are always simple mathematical operations

MINPS, MINSS, MAXPS, and MAXSS have specific floating-point behavior, especially for NaN values, signed zero, and equal operands.

If your data can contain NaNs or invalid values, check the exact instruction behavior instead of assuming that SSE min/max behaves like a high-level mathematical function.

Overusing shuffles

Shuffles are powerful, but they do not do arithmetic. If an algorithm spends too much time rearranging data before doing useful work, the SIMD speedup may disappear.

A better memory layout can often reduce the number of shuffles.

Hand-vectorizing too early

Modern compilers can auto-vectorize many simple loops.

Before writing explicit SSE intrinsics:

  1. Write clear scalar code.
  2. Compile with optimization enabled.
  3. Measure performance.
  4. Inspect the generated assembly.
  5. Use intrinsics only where they improve the measured hot path.

Manual SSE is most useful when the compiler cannot auto-vectorize the code, when you need a specific instruction, or when you need explicit control over data movement and numerical behavior.

Build notes

On GCC or Clang, enable SSE explicitly when targeting older 32-bit configurations:

gcc -O2 -msse example.c -o example

For C++:

g++ -O2 -msse example.cpp -o example

On modern x86-64 systems, SSE support is normally available as part of the baseline architecture. On very old 32-bit x86 systems, SSE support may need to be detected or enabled explicitly.

For MSVC, include the correct header and build with optimization enabled. On old 32-bit targets, the /arch option controls which SIMD instruction sets the compiler may use.

Complete example program

The following program demonstrates packed and scalar addition.

#include <stdio.h>
#include <xmmintrin.h>

static void print4(const char* name, __m128 value)
{
    float out[4];
    _mm_storeu_ps(out, value);

    printf("%s = [%f, %f, %f, %f]\n",
           name,
           out[0],
           out[1],
           out[2],
           out[3]);
}

int main()
{
    __m128 a = _mm_setr_ps(1.0f, 2.0f, 3.0f, 4.0f);
    __m128 b = _mm_setr_ps(10.0f, 20.0f, 30.0f, 40.0f);

    __m128 packedAdd = _mm_add_ps(a, b);
    __m128 scalarAdd = _mm_add_ss(a, b);

    print4("a", a);
    print4("b", b);
    print4("_mm_add_ps(a, b)", packedAdd);
    print4("_mm_add_ss(a, b)", scalarAdd);

    return 0;
}

Expected output:

a = [1.000000, 2.000000, 3.000000, 4.000000]
b = [10.000000, 20.000000, 30.000000, 40.000000]
_mm_add_ps(a, b) = [11.000000, 22.000000, 33.000000, 44.000000]
_mm_add_ss(a, b) = [11.000000, 2.000000, 3.000000, 4.000000]

Summary

SSE intrinsics provide direct access to 128-bit SIMD operations on single-precision floating-point values.

The key ideas are:

  • __m128 represents four 32-bit floats.
  • PS means packed single-precision: all four lanes are used.
  • SS means scalar single-precision: only the low lane is used.
  • _mm_loadu_ps and _mm_storeu_ps are safe for unaligned memory.
  • _mm_add_ps, _mm_sub_ps, _mm_mul_ps, and _mm_div_ps perform four arithmetic operations at once.
  • _mm_rcp_ps and _mm_rsqrt_ps are fast but approximate.
  • comparisons produce masks, not ordinary Boolean values.
  • data layout is often as important as instruction choice.
  • scalar cleanup is needed when the number of elements is not a multiple of four.

SSE is old, but it remains an important foundation for understanding x86 SIMD programming. The same concepts — lanes, packed operations, scalar operations, alignment, masks, shuffles, and data layout — carry forward to SSE2, SSE3, SSE4, AVX, AVX2, and AVX-512.

References

Leave a Reply

Your email address will not be published. Required fields are marked *