SIMD on x64/x86

SSE Arithmetic Instructions: packed and scalar floating-point operations

SSE arithmetic lets one instruction operate on four 32-bit floating-point values packed in a 128-bit XMM register. This page explains the arithmetic instructions, their C/C++ intrinsics, the difference between packed and scalar forms, and the practical pitfalls that matter when writing or reading SSE code.

Packed vs scalar: PS and SS

Most arithmetic instructions come in two forms:

  • `PS` — packed single-precision: operates on all four lanes.
  • `SS` — scalar single-precision: operates only on the low lane and leaves the upper three lanes unchanged from the first operand.

For example, `_mm_add_ps(a, b)` adds four floats in parallel, while `_mm_add_ss(a, b)` adds only `a0 + b0` and copies `a1`, `a2`, and `a3` from `a`.

SSE arithmetic instruction summary

InstructionIntrinsicOperation
ADDPS_mm_add_ps(a, b)[a0+b0, a1+b1, a2+b2, a3+b3]
ADDSS_mm_add_ss(a, b)[a0+b0, a1, a2, a3]
SUBPS_mm_sub_ps(a, b)[a0-b0, a1-b1, a2-b2, a3-b3]
MULPS_mm_mul_ps(a, b)[a0*b0, a1*b1, a2*b2, a3*b3]
DIVPS_mm_div_ps(a, b)[a0/b0, a1/b1, a2/b2, a3/b3]
SQRTPS_mm_sqrt_ps(a)[sqrt(a0), sqrt(a1), sqrt(a2), sqrt(a3)]
MINPS_mm_min_ps(a, b)Lane-wise min
MAXPS_mm_max_ps(a, b)Lane-wise max
sse02

ADDPS (parallel) and ADDSS (scalar) add the pair of operands.
SUBPS (parallel) and SUBSS (scalar) subtract the pair of operands.
MULPS (parallel) and MULSS (scalar) multiply the pair of operands.
DIVPS (parallel) and DIVSS (scalar) divides the pair of operands.
SQRTPS (parallel) and SQRTSS (scalar) return the square root of the source operand to the destination register.
MAXPS (parallel) and MAXSS (scalar) return the maximum of the pair of operands: DestReg[i] = Max(DestReg[i], SrcReg[i]) 
MINPS (parallel) and MINSS (scalar) return the minimum of the pair of operands: DestReg[i] = Min(DestReg[i], SrcReg[i])

ADDSS xmm1, xmm2/m32

Adds the low single-precision floating-point values from the source operand (second operand) and the destination operand (first operand), and stores the single-precision floating-point result in the destination operand. The source operand can be an XMM register or a 32-bit memory location. The destination operand is an XMM register. The three high-order doublewords of the destination operand remain unchanged.

DEST[31-0] ? DEST[31-0] + SRC[31-0];
ADDSS __m128 _mm_add_ss(__m128 a, __m128 b)

ADDPS xmm1, xmm2/m128

Performs a SIMD add of the four packed single-precision floating-point values from the source operand (second operand) and the destination operand (first operand), and stores the packed single-precision floating-point results in the destination operand. The source operand can be an XMM register or a 128-bit memory location. The destination operand is an XMM register.

DEST[31-0] ? DEST[31-0] + SRC[31-0];
DEST[63-32] ? DEST[63-32] + SRC[63-32];
DEST[95-64] ? DEST[95-64] + SRC[95-64];
DEST[127-96] ? DEST[127-96] + SRC[127-96];
ADDPS __m128 _mm_add_ps(__m128 a, __m128 b)

SUBSS xmm1, xmm2/m32

Subtracts the low single-precision floating-point value in the source operand (second operand) from the low single-precision floating-point value in the destination operand (first operand), and stores the single-precision floating-point result in the destination operand. The source operand can be an XMM register or a 32-bit memory location. The destination operand is an XMM register. The three high-order doublewords of the destination operand remain unchanged.

DEST[31-0] ? DEST[31-0] – SRC[31-0];
SUBSS __m128 _mm_sub_ss(__m128 a, __m128 b)

SUBPS xmm1 xmm2/m128

Performs a SIMD subtract of the four packed single-precision floating-point values in the source operand (second operand) from the four packed single-precision floating-point values in the destination operand (first operand), and stores the packed single-precision floating-point results in the destination operand. The source operand can be an XMM register or a 128-bit memory location. The destination operand is an XMM register.

DEST[31-0] ? DEST[31-0] ? SRC[31-0];
DEST[63-32] ? DEST[63-32] ? SRC[63-32];
DEST[95-64] ? DEST[95-64] ? SRC[95-64];
DEST[127-96] ? DEST[127-96] ? SRC[127-96];
SUBPS __m128 _mm_sub_ps(__m128 a, __m128 b)

MULSS xmm1, xmm2/m32

Multiplies the low single-precision floating-point value from the source operand (second operand) by the low single-precision floating-point value in the destination operand (first operand), and stores the single-precision floating-point result in the destination operand. The source operand can be an XMM register or a 32-bit memory location. The destination operand is an XMM register. The three high-order doublewords of the destination operand remain unchanged.

DEST[31-0] ? DEST[31-0] * SRC[31-0];
MULSS __m128 _mm_mul_ss(__m128 a, __m128 b)

MULPS xmm1, xmm2/m128

Performs a SIMD multiply of the four packed single-precision floating-point values from the source operand (second operand) and the destination operand (first operand), and stores the packed single-precision floating-point results in the destination operand. The source operand can be an XMM register or a 128-bit memory location. The destination operand is an XMM register.
DEST[31-0] ? DEST[31-0] * SRC[31-0];
DEST[63-32] ? DEST[63-32] * SRC[63-32];
DEST[95-64] ? DEST[95-64] * SRC[95-64];
DEST[127-96] ? DEST[127-96] * SRC[127-96];
MULPS __m128 _mm_mul_ps(__m128 a, __m128 b)

DIVSS xmm1, xmm2/m32

Divides the low single-precision floating-point value in the destination operand (first operand) by the low single-precision floating-point value in the source operand (second operand), and stores the single-precision floating-point result in the destination operand. The source operand can be an XMM register or a 32-bit memory location. The destination operand is an XMM register. The three high-order doublewords of the destination operand remain unchanged.

DEST[31-0] ? DEST[31-0] / SRC[31-0];
DIVSS __m128 _mm_div_ss(__m128 a, __m128 b)

DIVPS xmm1, xmm2/m128

Performs a SIMD divide of the two packed single-precision floating-point values in the destination operand (first operand) by the two packed single-precision floating-point values in the source operand (second operand), and stores the packed single-precision floating-point results in the destination operand. The source operand can be an XMM register or a 128-bit memory location. The destination operand is an XMM register.

DEST[31-0] ? DEST[31-0] / (SRC[31-0]);
DEST[63-32] ? DEST[63-32] / (SRC[63-32]);
DEST[95-64] ? DEST[95-64] / (SRC[95-64]);
DEST[127-96] ? DEST[127-96] / (SRC[127-96]);
DIVPS __m128 _mm_div_ps(__m128 a, __m128 b)

SQRTSS xmm1, xmm2/m32

Computes the square root of the low single-precision floating-point value in the source operand (second operand) and stores the single-precision floating-point result in the destination operand. The source operand can be an XMM register or a 32-bit memory location. The destination operand is an XMM register. The three high-order doublewords of the destination operand remains unchanged.

DEST[31-0] ? SQRT (SRC[31-0]);
SQRTSS __m128 _mm_sqrt_ss(__m128 a)

SQRTPS xmm1, xmm2/m128

Performs a SIMD computation of the square roots of the four packed single-precision floating point values in the source operand (second operand) stores the packed single-precision floating point results in the destination operand. The source operand can be an XMM register or a 128-bit memory location. The destination operand is an XMM register.

DEST[31-0] ? SQRT(SRC[31-0]);
DEST[63-32] ? SQRT(SRC[63-32]);
DEST[95-64] ? SQRT(SRC[95-64]);
DEST[127-96] ? SQRT(SRC[127-96]);
SQRTPS __m128 _mm_sqrt_ps(__m128 a)

MAXSS xmm1, xmm2/m32

Compares the low single-precision floating-point values in the destination operand (first operand) and the source operand (second operand), and returns the maximum value to the low doubleword of the destination operand. The source operand can be an XMM register or a 32-bit memory location. The destination operand is an XMM register. The three high-order doublewords of the destination operand remain unchanged.

DEST[63-0] ? IF ((DEST[31-0] = 0.0) AND (SRC[31-0] = 0.0)) THEN SRC[31-0]
ELSE IF (DEST[31-0] = SNaN) THEN SRC[31-0];
ELSE IF SRC[31-0] = SNaN) THEN SRC[31-0];
ELSE IF (DEST[31-0] > SRC[31-0])
THEN DEST[31-0]
ELSE SRC[31-0];
__m128d _mm_max_ss(__m128d a, __m128d b)

MAXPS xmm1, xmm2/m128

Performs a SIMD compare of the packed single-precision floating-point values in the destination operand (first operand) and the source operand (second operand), and returns the maximum value for each pair of values to the destination operand. The source operand can be an XMM register or a 128-bit memory location. The destination operand is an XMM register.

DEST[31-0] ? IF ((DEST[31-0] = 0.0) AND (SRC[31-0] = 0.0)) THEN SRC[31-0]
ELSE IF (DEST[31-0] = SNaN) THEN SRC[31-0];
ELSE IF SRC[31-0] = SNaN) THEN SRC[31-0];
ELSE IF (DEST[31-0] > SRC[31-0])
THEN DEST[31-0]
ELSE SRC[31-0];
repeat operation for 2nd and 3rd doublewords
DEST[127-64] ? IF ((DEST[127-96] = 0.0) AND (SRC[127-96] = 0.0))
THEN SRC[127-96]
ELSE IF (DEST[127-96] = SNaN) THEN SRC[127-96];
ELSE IF SRC[127-96] = SNaN) THEN SRC[127-96];
ELSE IF (DEST[127-96] > SRC[127-96])
THEN DEST[127-96]
ELSE SRC[127-96];
__m128d _mm_max_ps(__m128d a, __m128d b)

MINSS xmm1, xmm2/m32

Compares the low single-precision floating-point values in the destination operand (first operand) and the source operand (second operand), and returns the minimum value to the low doubleword of the destination operand. The source operand can be an XMM register or a 32-bit memory location. The destination operand is an XMM register. The three high-order doublewords of the destination operand remain unchanged.

DEST[63-0] ? IF ((DEST[31-0] = 0.0) AND (SRC[31-0] = 0.0)) THEN SRC[31-0]
ELSE IF (DEST[31-0] = SNaN) THEN SRC[31-0];
ELSE IF SRC[31-0] = SNaN) THEN SRC[31-0];
ELSE IF (DEST[31-0] < SRC[31-0])
THEN DEST[31-0]
ELSE SRC[31-0];
__m128d _mm_min_ss(__m128d a, __m128d b)

MINPS xmm1, xmm2/m128

Performs a SIMD compare of the packed single-precision floating-point values in the destination operand (first operand) and the source operand (second operand), and returns the minimum value for each pair of values to the destination operand. The source operand can be an XMM register or a 128-bit memory location. The destination operand is an XMM register.

DEST[63-0] ? IF ((DEST[31-0] = 0.0) AND (SRC[31-0] = 0.0)) THEN SRC[31-0]
ELSE IF (DEST[31-0] = SNaN) THEN SRC[31-0];
ELSE IF SRC[31-0] = SNaN) THEN SRC[31-0];
ELSE IF (DEST[31-0] > SRC[31-0])
THEN DEST[31-0]
ELSE SRC[31-0];
repeat operation for 2nd and 3rd doublewords
DEST[127-64] ? IF ((DEST127-96] = 0.0) AND (SRC[127-96] = 0.0))
THEN SRC[127-96]
ELSE IF (DEST[127-96] = SNaN) THEN SRC[127-96];
ELSE IF SRC[127-96] = SNaN) THEN SRC[127-96];
ELSE IF (DEST[127-96] < SRC[127-96])
THEN DEST[127-96]
ELSE SRC[127-96];
__m128d _mm_min_ps(__m128d a, __m128d b)

Code examples

The examples below use the C/C++ SSE intrinsics declared in <xmmintrin.h>. For simplicity, they use unaligned loads and stores with _mm_loadu_ps and _mm_storeu_ps, so the input arrays do not need to be 16-byte aligned.

Each __m128 value contains four single-precision floating-point values:

[ f0, f1, f2, f3 ]

The examples use _mm_setr_ps when building registers manually. The r means “reverse” compared with _mm_set_ps, and it lets the values appear in the natural left-to-right order:

__m128 v = _mm_setr_ps(1.0f, 2.0f, 3.0f, 4.0f);
// v contains [1.0, 2.0, 3.0, 4.0]

Adding four floats

This is the simplest example of packed SSE arithmetic. Four floating-point additions are performed with one instruction.

#include <xmmintrin.h>

void add4_floats(const float* a, const float* b, float* out)
{
    __m128 va = _mm_loadu_ps(a);
    __m128 vb = _mm_loadu_ps(b);

    __m128 result = _mm_add_ps(va, vb);

    _mm_storeu_ps(out, result);
}

If the input arrays contain:

a = [1.0, 2.0, 3.0, 4.0]
b = [10.0, 20.0, 30.0, 40.0]

then the result is:

out = [11.0, 22.0, 33.0, 44.0]

Packed arithmetic: add, subtract, multiply, divide

The same pattern applies to the other basic arithmetic instructions.

#include <xmmintrin.h>

void arithmetic4(const float* a, const float* b,
                 float* addOut,
                 float* subOut,
                 float* mulOut,
                 float* divOut)
{
    __m128 va = _mm_loadu_ps(a);
    __m128 vb = _mm_loadu_ps(b);

    __m128 addResult = _mm_add_ps(va, vb); // a + b
    __m128 subResult = _mm_sub_ps(va, vb); // a - b
    __m128 mulResult = _mm_mul_ps(va, vb); // a * b
    __m128 divResult = _mm_div_ps(va, vb); // a / b

    _mm_storeu_ps(addOut, addResult);
    _mm_storeu_ps(subOut, subResult);
    _mm_storeu_ps(mulOut, mulResult);
    _mm_storeu_ps(divOut, divResult);
}

For subtraction and division, the order of operands matters:

_mm_sub_ps(a, b); // a - b
_mm_div_ps(a, b); // a / b

Packed vs scalar arithmetic

Packed instructions operate on all four lanes. Scalar instructions operate only on the lowest lane and copy the upper three lanes from the first operand.

#include <xmmintrin.h>

void packed_vs_scalar(float* packedOut, float* scalarOut)
{
    __m128 a = _mm_setr_ps(1.0f, 2.0f, 3.0f, 4.0f);
    __m128 b = _mm_setr_ps(10.0f, 20.0f, 30.0f, 40.0f);

    __m128 packed = _mm_add_ps(a, b);
    __m128 scalar = _mm_add_ss(a, b);

    _mm_storeu_ps(packedOut, packed);
    _mm_storeu_ps(scalarOut, scalar);
}

The packed result is:

packedOut = [11.0, 22.0, 33.0, 44.0]

The scalar result is:

scalarOut = [11.0, 2.0, 3.0, 4.0]

Only the first value was added. The remaining three values were copied from a.

Processing an array

Most real SSE code processes arrays in chunks of four floats.

#include <xmmintrin.h>

void add_arrays(const float* a, const float* b, float* out, int count)
{
    int i = 0;

    for (; i + 3 < count; i += 4)
    {
        __m128 va = _mm_loadu_ps(&a[i]);
        __m128 vb = _mm_loadu_ps(&b[i]);

        __m128 result = _mm_add_ps(va, vb);

        _mm_storeu_ps(&out[i], result);
    }

    // Handle the remaining 1, 2, or 3 elements.
    for (; i < count; ++i)
    {
        out[i] = a[i] + b[i];
    }
}

The first loop handles four floats per iteration. The second loop handles the final elements when count is not a multiple of four.

Scaling and offsetting an array

This example computes:

out[i] = input[i] * scale + offset;

for every element in the input array.

#include <xmmintrin.h>

void scale_and_offset(const float* input,
                      float* output,
                      int count,
                      float scale,
                      float offset)
{
    __m128 vscale = _mm_set1_ps(scale);
    __m128 voffset = _mm_set1_ps(offset);

    int i = 0;

    for (; i + 3 < count; i += 4)
    {
        __m128 x = _mm_loadu_ps(&input[i]);

        __m128 y = _mm_mul_ps(x, vscale);
        y = _mm_add_ps(y, voffset);

        _mm_storeu_ps(&output[i], y);
    }

    for (; i < count; ++i)
    {
        output[i] = input[i] * scale + offset;
    }
}

_mm_set1_ps broadcasts the same value to all four lanes:

_mm_set1_ps(2.0f) -> [2.0, 2.0, 2.0, 2.0]

This is useful when applying the same constant to several values.

Clamping values with MINPS and MAXPS

MINPS and MAXPS can be used to clamp four values at a time.

The following example clamps every value to the range [minValue, maxValue].

#include <xmmintrin.h>

void clamp_array(const float* input,
                 float* output,
                 int count,
                 float minValue,
                 float maxValue)
{
    __m128 vmin = _mm_set1_ps(minValue);
    __m128 vmax = _mm_set1_ps(maxValue);

    int i = 0;

    for (; i + 3 < count; i += 4)
    {
        __m128 x = _mm_loadu_ps(&input[i]);

        x = _mm_max_ps(x, vmin); // x = max(x, minValue)
        x = _mm_min_ps(x, vmax); // x = min(x, maxValue)

        _mm_storeu_ps(&output[i], x);
    }

    for (; i < count; ++i)
    {
        float x = input[i];

        if (x < minValue)
            x = minValue;

        if (x > maxValue)
            x = maxValue;

        output[i] = x;
    }
}

This kind of operation is common in image processing, audio processing, physics, and numerical code.

Square root of four values

The SQRTPS instruction computes four square roots in parallel.

#include <xmmintrin.h>

void sqrt_array(const float* input, float* output, int count)
{
    int i = 0;

    for (; i + 3 < count; i += 4)
    {
        __m128 x = _mm_loadu_ps(&input[i]);
        __m128 y = _mm_sqrt_ps(x);

        _mm_storeu_ps(&output[i], y);
    }

    for (; i < count; ++i)
    {
        output[i] = sqrtf(input[i]);
    }
}

This function assumes that the input values are non-negative. Negative inputs produce floating-point results according to the processor’s floating-point rules.

Computing squared distances

SSE works particularly well when related values are stored in separate arrays. This is often called a structure-of-arrays layout.

The following example computes the squared distance from four points to a reference point:

distanceSquared = (x - refX) * (x - refX)
                + (y - refY) * (y - refY);
#include <xmmintrin.h>

void squared_distance_2d(const float* x,
                         const float* y,
                         float* distanceSquared,
                         int count,
                         float refX,
                         float refY)
{
    __m128 vxRef = _mm_set1_ps(refX);
    __m128 vyRef = _mm_set1_ps(refY);

    int i = 0;

    for (; i + 3 < count; i += 4)
    {
        __m128 vx = _mm_loadu_ps(&x[i]);
        __m128 vy = _mm_loadu_ps(&y[i]);

        __m128 dx = _mm_sub_ps(vx, vxRef);
        __m128 dy = _mm_sub_ps(vy, vyRef);

        __m128 dx2 = _mm_mul_ps(dx, dx);
        __m128 dy2 = _mm_mul_ps(dy, dy);

        __m128 d2 = _mm_add_ps(dx2, dy2);

        _mm_storeu_ps(&distanceSquared[i], d2);
    }

    for (; i < count; ++i)
    {
        float dx = x[i] - refX;
        float dy = y[i] - refY;

        distanceSquared[i] = dx * dx + dy * dy;
    }
}

This example uses only subtraction, multiplication, and addition, but it performs the calculation for four points at a time.

Normalizing four 2D vectors

This example uses multiplication, addition, square root, and division.

Given arrays x and y, it computes normalized vectors:

length = sqrt(x * x + y * y)

outX = x / length
outY = y / length
#include <xmmintrin.h>

void normalize_2d_vectors(const float* x,
                          const float* y,
                          float* outX,
                          float* outY,
                          int count)
{
    int i = 0;

    for (; i + 3 < count; i += 4)
    {
        __m128 vx = _mm_loadu_ps(&x[i]);
        __m128 vy = _mm_loadu_ps(&y[i]);

        __m128 x2 = _mm_mul_ps(vx, vx);
        __m128 y2 = _mm_mul_ps(vy, vy);

        __m128 lengthSquared = _mm_add_ps(x2, y2);
        __m128 length = _mm_sqrt_ps(lengthSquared);

        __m128 nx = _mm_div_ps(vx, length);
        __m128 ny = _mm_div_ps(vy, length);

        _mm_storeu_ps(&outX[i], nx);
        _mm_storeu_ps(&outY[i], ny);
    }

    for (; i < count; ++i)
    {
        float length = sqrtf(x[i] * x[i] + y[i] * y[i]);

        outX[i] = x[i] / length;
        outY[i] = y[i] / length;
    }
}

This code assumes that no vector has zero length. Production code should handle zero-length vectors before dividing by the length.

Using approximate reciprocal square root

For some workloads, especially graphics and real-time simulations, an approximate reciprocal square root may be good enough.

Instead of computing:

1.0f / sqrtf(x)

SSE provides _mm_rsqrt_ps:

#include <xmmintrin.h>

void reciprocal_sqrt_approx(const float* input, float* output, int count)
{
    int i = 0;

    for (; i + 3 < count; i += 4)
    {
        __m128 x = _mm_loadu_ps(&input[i]);
        __m128 y = _mm_rsqrt_ps(x);

        _mm_storeu_ps(&output[i], y);
    }

    for (; i < count; ++i)
    {
        output[i] = 1.0f / sqrtf(input[i]);
    }
}

_mm_rsqrt_ps is faster than computing a square root and then dividing, but it is approximate. It should only be used when the loss of precision is acceptable.

Example test program

The following small program demonstrates packed and scalar addition.

#include <stdio.h>
#include <xmmintrin.h>

static void print4(const char* name, __m128 value)
{
    float out[4];
    _mm_storeu_ps(out, value);

    printf("%s = [%f, %f, %f, %f]\n",
           name,
           out[0],
           out[1],
           out[2],
           out[3]);
}

int main()
{
    __m128 a = _mm_setr_ps(1.0f, 2.0f, 3.0f, 4.0f);
    __m128 b = _mm_setr_ps(10.0f, 20.0f, 30.0f, 40.0f);

    __m128 packedAdd = _mm_add_ps(a, b);
    __m128 scalarAdd = _mm_add_ss(a, b);

    print4("a", a);
    print4("b", b);
    print4("_mm_add_ps(a, b)", packedAdd);
    print4("_mm_add_ss(a, b)", scalarAdd);

    return 0;
}

Expected output:

a = [1.000000, 2.000000, 3.000000, 4.000000]
b = [10.000000, 20.000000, 30.000000, 40.000000]
_mm_add_ps(a, b) = [11.000000, 22.000000, 33.000000, 44.000000]
_mm_add_ss(a, b) = [11.000000, 2.000000, 3.000000, 4.000000]

Common pitfalls

SSE arithmetic instructions are simple once the packed-lane model is clear, but a few details can easily lead to incorrect results, slower code, or confusing bugs.

Packed and scalar instructions behave differently

The packed instructions, such as ADDPS, SUBPS, MULPS, and DIVPS, operate on all four single-precision floating-point values in the XMM register.

The scalar instructions, such as ADDSS, SUBSS, MULSS, and DIVSS, operate only on the lowest 32-bit floating-point value. The upper three values are copied from the first operand.

For example:

// a = [1.0, 2.0, 3.0, 4.0]
// b = [10.0, 20.0, 30.0, 40.0]

__m128 r1 = _mm_add_ps(a, b);
// r1 = [11.0, 22.0, 33.0, 44.0]

__m128 r2 = _mm_add_ss(a, b);
// r2 = [11.0, 2.0, 3.0, 4.0]

This is often the source of unexpected results when scalar and packed instructions are mixed.

Memory layout is often more important than the instruction

SSE works best when the data is already stored in a SIMD-friendly layout. For example, processing four consecutive floats is straightforward:

x0, x1, x2, x3

But processing structures like this can be less efficient:

struct Point
{
    float x;
    float y;
    float z;
};

With an array of Point, the x, y, and z values are interleaved in memory. This may require extra shuffle or gather-like operations before arithmetic can be performed efficiently.

For SIMD-heavy code, a structure-of-arrays layout is often more efficient than an array-of-structures layout:

struct Points
{
    float* x;
    float* y;
    float* z;
};

This makes it easier to load four x values, four y values, or four z values at once.

Division and square root are relatively expensive

Instructions such as DIVPS, DIVSS, SQRTPS, and SQRTSS are usually slower than addition, subtraction, and multiplication.

When full precision is not required, SSE also provides approximate reciprocal and reciprocal-square-root instructions:

__m128 r = _mm_rcp_ps(x);    // approximate 1 / x
__m128 s = _mm_rsqrt_ps(x);  // approximate 1 / sqrt(x)

These approximate instructions can be useful in graphics, image processing, simulations, and other workloads where a small numerical error is acceptable. If more accuracy is needed, the approximation can often be refined with one or more Newton-Raphson iterations.

Be careful with MIN and MAX

The MINPS, MINSS, MAXPS, and MAXSS instructions are not always equivalent to a simple mathematical minimum or maximum in high-level code.

Special floating-point values can affect the result, especially:

  • NaN values
  • positive zero and negative zero
  • equal operands

This matters when the input data may contain invalid values, missing values, or results from previous operations such as division by zero. When exact NaN behavior matters, check the processor documentation instead of assuming that SSE MIN and MAX behave like std::min or std::max.

Scalar cleanup is often needed

When processing an array with packed SSE instructions, each iteration usually handles four floats. If the array length is not a multiple of four, the remaining elements must still be handled.

Example:

int i = 0;

for (; i + 3 < count; i += 4)
{
    __m128 a = _mm_loadu_ps(&inputA[i]);
    __m128 b = _mm_loadu_ps(&inputB[i]);
    __m128 r = _mm_add_ps(a, b);
    _mm_storeu_ps(&output[i], r);
}

for (; i < count; ++i)
{
    output[i] = inputA[i] + inputB[i];
}

The first loop processes four elements at a time. The second loop handles the final one, two, or three elements.

Avoid unnecessary loads and stores

SSE code can become slower than scalar code if values are repeatedly loaded from memory and stored back too often.

Try to keep values in XMM registers across several operations:

__m128 a = _mm_loadu_ps(input);
__m128 b = _mm_mul_ps(a, scale);
__m128 c = _mm_add_ps(b, offset);
_mm_storeu_ps(output, c);

This is usually better than storing the result after every intermediate operation.

Be aware of floating-point differences

SSE arithmetic follows floating-point rules, so results may differ slightly from scalar code depending on compiler settings, instruction selection, and evaluation order.

This is especially visible when:

  • summing many floating-point values
  • changing the order of operations
  • replacing division with reciprocal approximation
  • enabling aggressive compiler optimizations
  • mixing SSE code with older x87 floating-point code

Small differences are normal in floating-point arithmetic. For numerical code, compare results with a tolerance instead of expecting exact bit-for-bit equality.

Do not hand-vectorize too early

Modern compilers are often able to auto-vectorize simple loops. Before writing intrinsics manually, it is worth checking whether the compiler already generates good SSE or AVX code from clear scalar C or C++.

Manual SSE is most useful when:

  • the compiler cannot auto-vectorize the loop
  • the data layout requires explicit control
  • the code needs a specific instruction
  • the hot path has been measured and is worth optimizing
  • the same operation is applied to large arrays of data

For maintainability, start with simple scalar code, measure it, inspect the generated assembly, and only then replace the critical parts with explicit SSE intrinsics.

Modern notes: SSE, SSE2, AVX and compiler auto-vectorization

SSE is no longer the newest SIMD extension, but it remains useful as a baseline for understanding x86 SIMD. The same mental model, lanes, packed operations, scalar operations, alignment, and data layout, carries forward to SSE2, AVX, AVX2, and AVX-512.

Leave a Reply

Your email address will not be published. Required fields are marked *