SSE2 Intrinsics: double-precision and integer SIMD programming

May 27, 2010 - By Stefano Tommesani

SSE2 extends the original SSE instruction set with support for double-precision floating-point arithmetic and a wider set of integer SIMD operations.

The original SSE instructions operate mainly on four 32-bit single-precision floating-point values stored in a 128-bit XMM register. SSE2 adds the ability to operate on two 64-bit double-precision floating-point values in the same kind of register. It also adds integer SIMD operations using XMM registers, replacing many older MMX-based techniques with a cleaner 128-bit model.

This article introduces the most important SSE2 data types, naming conventions, arithmetic intrinsics, integer operations, memory operations, conversions, and common pitfalls.

What SSE2 adds over SSE

SSE introduced packed single-precision arithmetic. A 128-bit XMM register can contain four 32-bit floats:

[ f0 | f1 | f2 | f3 ]

SSE2 adds packed double-precision arithmetic. The same 128-bit register can contain two 64-bit doubles:

[ d0 | d1 ]

SSE2 also adds integer SIMD support using the __m128i type. A 128-bit integer vector can be interpreted in several ways:

16 x 8-bit integers
 8 x 16-bit integers
 4 x 32-bit integers
 2 x 64-bit integers

The instruction or intrinsic determines how the bits are interpreted.

Header file

Most SSE2 intrinsics are declared in:

#include <emmintrin.h>

This header includes the SSE2 intrinsic definitions for double-precision floating-point operations, integer vector operations, memory operations, comparisons, conversions, and related functionality.

Main SSE2 data types

SSE2 code usually works with three 128-bit vector types.

Type	Meaning	Typical contents
`__m128`	SSE single-precision vector	4 floats
`__m128d`	SSE2 double-precision vector	2 doubles
`__m128i`	SSE2 integer vector	integers of different lane widths

This article focuses mostly on __m128d and __m128i.

Packed double and scalar double

SSE2 double-precision floating-point intrinsics usually come in two forms:

Suffix	Meaning	Operation
`PD`	Packed Double-precision	operates on both 64-bit double lanes
`SD`	Scalar Double-precision	operates only on the low 64-bit double lane

For example:

_mm_add_pd(a, b)

adds both double-precision lanes:

[ a0 + b0 | a1 + b1 ]

But:

_mm_add_sd(a, b)

adds only the low lane and copies the high lane from the first operand:

[ a0 + b0 | a1 ]

This behavior is one of the most important things to understand when reading or writing SSE2 code.

Naming conventions

SSE2 intrinsic names are easier to understand once the suffixes are familiar.

Name fragment	Meaning
`_mm`	SIMD intrinsic prefix
`add`, `sub`, `mul`, `div`	arithmetic operation
`pd`	packed double-precision floating-point
`sd`	scalar double-precision floating-point
`epi8`	packed 8-bit integers
`epi16`	packed 16-bit integers
`epi32`	packed 32-bit integers
`epi64`	packed 64-bit integers
`si128`	128-bit integer vector
`loadu`	unaligned load
`storeu`	unaligned store

For example:

_mm_add_pd

means “add packed doubles.”

_mm_add_epi16

means “add packed 16-bit integers.”

_mm_storeu_si128

means “store an unaligned 128-bit integer vector.”

Floating-point arithmetic intrinsics

The following table summarizes the most common SSE2 double-precision arithmetic intrinsics.

In the examples below:

a = [ a0 | a1 ]
b = [ b0 | b1 ]

Intrinsic	Instruction	Operation	Result
`_mm_add_pd(a, b)`	`ADDPD`	packed add	`[a0 + b0
`_mm_add_sd(a, b)`	`ADDSD`	scalar add	`[a0 + b0
`_mm_sub_pd(a, b)`	`SUBPD`	packed subtract	`[a0 – b0
`_mm_sub_sd(a, b)`	`SUBSD`	scalar subtract	`[a0 – b0
`_mm_mul_pd(a, b)`	`MULPD`	packed multiply	`[a0 * b0
`_mm_mul_sd(a, b)`	`MULSD`	scalar multiply	`[a0 * b0
`_mm_div_pd(a, b)`	`DIVPD`	packed divide	`[a0 / b0
`_mm_div_sd(a, b)`	`DIVSD`	scalar divide	`[a0 / b0
`_mm_sqrt_pd(a)`	`SQRTPD`	packed square root	`[sqrt(a0)
`_mm_sqrt_sd(a, b)`	`SQRTSD`	scalar square root	`[sqrt(b0)
`_mm_min_pd(a, b)`	`MINPD`	packed minimum	`[min(a0,b0)
`_mm_min_sd(a, b)`	`MINSD`	scalar minimum	`[min(a0,b0)
`_mm_max_pd(a, b)`	`MAXPD`	packed maximum	`[max(a0,b0)
`_mm_max_sd(a, b)`	`MAXSD`	scalar maximum	`[max(a0,b0)

Notice the special case for _mm_sqrt_sd(a, b). Unlike most scalar arithmetic operations, the value being square-rooted comes from the low lane of the second operand, while the high lane is copied from the first operand.

Example: adding two doubles

This is the simplest packed double-precision SSE2 example. The function adds two pairs of doubles using one vector operation.

#include <emmintrin.h>

void add2_doubles(const double* a, const double* b, double* out)
{
    __m128d va = _mm_loadu_pd(a);
    __m128d vb = _mm_loadu_pd(b);

    __m128d result = _mm_add_pd(va, vb);

    _mm_storeu_pd(out, result);
}

If the input arrays contain:

a = [1.0, 2.0]
b = [10.0, 20.0]

then the output is:

out = [11.0, 22.0]

Example: packed vs scalar double arithmetic

Packed double instructions operate on both lanes. Scalar double instructions operate only on the low lane and copy the high lane from the first operand.

#include <emmintrin.h>

void packed_vs_scalar(double* packedOut, double* scalarOut)
{
    __m128d a = _mm_setr_pd(1.0, 2.0);
    __m128d b = _mm_setr_pd(10.0, 20.0);

    __m128d packed = _mm_add_pd(a, b);
    __m128d scalar = _mm_add_sd(a, b);

    _mm_storeu_pd(packedOut, packed);
    _mm_storeu_pd(scalarOut, scalar);
}

The packed result is:

packedOut = [11.0, 22.0]

The scalar result is:

scalarOut = [11.0, 2.0]

Only the first lane was added. The second lane was copied from a.

Example: processing an array of doubles

Most real SSE2 code processes arrays in chunks of two doubles.

#include <emmintrin.h>

void add_double_arrays(const double* a,
                       const double* b,
                       double* out,
                       int count)
{
    int i = 0;

    for (; i + 1 < count; i += 2)
    {
        __m128d va = _mm_loadu_pd(&a[i]);
        __m128d vb = _mm_loadu_pd(&b[i]);

        __m128d result = _mm_add_pd(va, vb);

        _mm_storeu_pd(&out[i], result);
    }

    // Handle the remaining element if count is odd.
    for (; i < count; ++i)
    {
        out[i] = a[i] + b[i];
    }
}

The vector loop handles two doubles per iteration. The scalar cleanup loop handles the final element when the array length is odd.

Example: scale and offset an array of doubles

This example computes:

out[i] = input[i] * scale + offset

for every element in the array.

#include <emmintrin.h>

void scale_and_offset_doubles(const double* input,
                              double* output,
                              int count,
                              double scale,
                              double offset)
{
    __m128d vscale = _mm_set1_pd(scale);
    __m128d voffset = _mm_set1_pd(offset);

    int i = 0;

    for (; i + 1 < count; i += 2)
    {
        __m128d x = _mm_loadu_pd(&input[i]);

        __m128d y = _mm_mul_pd(x, vscale);
        y = _mm_add_pd(y, voffset);

        _mm_storeu_pd(&output[i], y);
    }

    for (; i < count; ++i)
    {
        output[i] = input[i] * scale + offset;
    }
}

_mm_set1_pd broadcasts one double value to both lanes:

_mm_set1_pd(2.0) -> [2.0 | 2.0]

This is useful when the same constant must be applied to multiple values.

Example: clamping doubles with MINPD and MAXPD

The MINPD and MAXPD instructions can be used to clamp values to a range.

The following function clamps every value to the interval:

[minValue, maxValue]

#include <emmintrin.h>

void clamp_doubles(const double* input,
                   double* output,
                   int count,
                   double minValue,
                   double maxValue)
{
    __m128d vmin = _mm_set1_pd(minValue);
    __m128d vmax = _mm_set1_pd(maxValue);

    int i = 0;

    for (; i + 1 < count; i += 2)
    {
        __m128d x = _mm_loadu_pd(&input[i]);

        x = _mm_max_pd(x, vmin);
        x = _mm_min_pd(x, vmax);

        _mm_storeu_pd(&output[i], x);
    }

    for (; i < count; ++i)
    {
        double x = input[i];

        if (x < minValue)
            x = minValue;

        if (x > maxValue)
            x = maxValue;

        output[i] = x;
    }
}

This pattern is common in numerical code, image processing, signal processing, and physics simulations.

Integer SIMD with SSE2

SSE2 also introduced many integer SIMD operations using the __m128i type.

A __m128i register is just 128 bits. The same bits can be interpreted as different lane widths depending on the intrinsic:

16 x 8-bit integers
 8 x 16-bit integers
 4 x 32-bit integers
 2 x 64-bit integers

For example:

Intrinsic	Operation
`_mm_add_epi8(a, b)`	add sixteen 8-bit integers
`_mm_add_epi16(a, b)`	add eight 16-bit integers
`_mm_add_epi32(a, b)`	add four 32-bit integers
`_mm_add_epi64(a, b)`	add two 64-bit integers
`_mm_sub_epi8(a, b)`	subtract sixteen 8-bit integers
`_mm_sub_epi16(a, b)`	subtract eight 16-bit integers
`_mm_sub_epi32(a, b)`	subtract four 32-bit integers
`_mm_sub_epi64(a, b)`	subtract two 64-bit integers
`_mm_mullo_epi16(a, b)`	multiply 16-bit integers and keep the low 16 bits

Example: adding 32-bit integers

This function adds four 32-bit integers at a time.

#include <emmintrin.h>

void add_int32_arrays(const int* a,
                      const int* b,
                      int* out,
                      int count)
{
    int i = 0;

    for (; i + 3 < count; i += 4)
    {
        __m128i va = _mm_loadu_si128((const __m128i*)&a[i]);
        __m128i vb = _mm_loadu_si128((const __m128i*)&b[i]);

        __m128i result = _mm_add_epi32(va, vb);

        _mm_storeu_si128((__m128i*)&out[i], result);
    }

    for (; i < count; ++i)
    {
        out[i] = a[i] + b[i];
    }
}

The vector loop processes four integers per iteration.

Saturating arithmetic

Some SSE2 integer operations support saturation.

Normal integer addition wraps on overflow. Saturating addition clamps the result to the minimum or maximum value representable by the lane type.

For unsigned 8-bit integers:

250 + 20 = 255    // saturating
250 + 20 = 14     // wrapping modulo 256

SSE2 provides both signed and unsigned saturating operations for 8-bit and 16-bit lanes.

Intrinsic	Meaning
`_mm_adds_epi8(a, b)`	signed saturating add, 8-bit lanes
`_mm_adds_epi16(a, b)`	signed saturating add, 16-bit lanes
`_mm_adds_epu8(a, b)`	unsigned saturating add, 8-bit lanes
`_mm_adds_epu16(a, b)`	unsigned saturating add, 16-bit lanes
`_mm_subs_epi8(a, b)`	signed saturating subtract, 8-bit lanes
`_mm_subs_epi16(a, b)`	signed saturating subtract, 16-bit lanes
`_mm_subs_epu8(a, b)`	unsigned saturating subtract, 8-bit lanes
`_mm_subs_epu16(a, b)`	unsigned saturating subtract, 16-bit lanes

Saturating arithmetic is especially useful in image and audio processing, where values often need to stay within a fixed range.

Example: brighten 8-bit pixels with saturation

The following example adds a brightness value to unsigned 8-bit pixels and clamps the result to 255.

#include <emmintrin.h>

void brighten_u8(const unsigned char* input,
                 unsigned char* output,
                 int count,
                 unsigned char amount)
{
    __m128i vamount = _mm_set1_epi8((char)amount);

    int i = 0;

    for (; i + 15 < count; i += 16)
    {
        __m128i pixels = _mm_loadu_si128((const __m128i*)&input[i]);

        __m128i result = _mm_adds_epu8(pixels, vamount);

        _mm_storeu_si128((__m128i*)&output[i], result);
    }

    for (; i < count; ++i)
    {
        int value = input[i] + amount;

        if (value > 255)
            value = 255;

        output[i] = (unsigned char)value;
    }
}

The SSE2 loop processes sixteen pixels per iteration.

Memory operations

SSE2 includes aligned and unaligned load/store intrinsics.

For double-precision floating-point values:

Intrinsic	Meaning
`_mm_load_pd(ptr)`	load two aligned doubles
`_mm_loadu_pd(ptr)`	load two unaligned doubles
`_mm_store_pd(ptr, v)`	store two aligned doubles
`_mm_storeu_pd(ptr, v)`	store two unaligned doubles

For integer vectors:

Intrinsic	Meaning
`_mm_load_si128(ptr)`	load aligned 128-bit integer vector
`_mm_loadu_si128(ptr)`	load unaligned 128-bit integer vector
`_mm_store_si128(ptr, v)`	store aligned 128-bit integer vector
`_mm_storeu_si128(ptr, v)`	store unaligned 128-bit integer vector

Aligned loads and stores require the memory address to be 16-byte aligned. Unaligned loads and stores work with arbitrary addresses.

For simple and safe code, use the unaligned versions unless you know that the memory is correctly aligned.

__m128d a = _mm_loadu_pd(ptr);
_mm_storeu_pd(out, a);

Modern processors handle unaligned memory accesses much better than early SSE2-era processors, but alignment can still matter in performance-critical loops.

Comparisons

SSE2 provides comparison intrinsics for packed and scalar doubles.

Examples:

Intrinsic	Meaning
`_mm_cmpeq_pd(a, b)`	compare packed doubles for equality
`_mm_cmplt_pd(a, b)`	compare packed doubles for less-than
`_mm_cmple_pd(a, b)`	compare packed doubles for less-than-or-equal
`_mm_cmpgt_pd(a, b)`	compare packed doubles for greater-than
`_mm_cmpge_pd(a, b)`	compare packed doubles for greater-than-or-equal
`_mm_cmpneq_pd(a, b)`	compare packed doubles for not-equal

The result of a comparison is a mask. Each lane is either all bits set or all bits clear.

That mask can then be used with logical operations to select or combine values.

Logical operations

SSE2 includes bitwise logical operations for floating-point and integer vectors.

For double-precision vectors:

Intrinsic	Meaning
`_mm_and_pd(a, b)`	bitwise AND
`_mm_or_pd(a, b)`	bitwise OR
`_mm_xor_pd(a, b)`	bitwise XOR
`_mm_andnot_pd(a, b)`	bitwise AND NOT

For integer vectors:

Intrinsic	Meaning
`_mm_and_si128(a, b)`	bitwise AND
`_mm_or_si128(a, b)`	bitwise OR
`_mm_xor_si128(a, b)`	bitwise XOR
`_mm_andnot_si128(a, b)`	bitwise AND NOT

These operations are often used with comparison masks.

Conversions

SSE2 includes conversion intrinsics between double-precision floating-point values and integers.

Some commonly used conversions are:

Intrinsic	Meaning
`_mm_cvtepi32_pd(a)`	convert two 32-bit integers to two doubles
`_mm_cvtpd_epi32(a)`	convert two doubles to 32-bit integers using current rounding mode
`_mm_cvttpd_epi32(a)`	convert two doubles to 32-bit integers using truncation
`_mm_cvtpd_ps(a)`	convert two doubles to two floats
`_mm_cvtps_pd(a)`	convert two floats to two doubles

The difference between rounding and truncating conversions matters.

For example:

3.9 converted with rounding may become 4
3.9 converted with truncation becomes 3

Use the intrinsic that matches the numerical behavior you need.

Common pitfalls

SSE2 intrinsics are powerful, but there are several details that can cause mistakes.

Confusing packed and scalar operations

PD operations work on both double lanes.

SD operations work only on the low double lane and copy the high lane from the first operand.

__m128d a = _mm_setr_pd(1.0, 2.0);
__m128d b = _mm_setr_pd(10.0, 20.0);

__m128d r = _mm_add_sd(a, b);

The result is:

[11.0 | 2.0]

not:

[11.0 | 22.0]

Forgetting that `_mm_sqrt_sd(a, b)` uses `b`

Most scalar arithmetic operations look like:

low result = a0 op b0
high result = a1

But _mm_sqrt_sd(a, b) computes:

low result = sqrt(b0)
high result = a1

That makes it slightly different from the other scalar double operations.

Assuming `MIN` and `MAX` are always simple mathematical min/max

MINPD, MINSD, MAXPD, and MAXSD have specific floating-point behavior, especially for NaN values and signed zero.

If your data may contain NaNs, invalid values, or results from division by zero, do not assume that these instructions behave exactly like a high-level mathematical minimum or maximum. Check the processor documentation for the exact behavior.

Ignoring alignment

Aligned load/store intrinsics require 16-byte alignment.

_mm_load_pd(ptr);     // ptr must be aligned
_mm_loadu_pd(ptr);    // ptr may be unaligned

Using an aligned load on an unaligned address can crash or produce a fault on older or stricter systems. Use unaligned loads unless alignment is guaranteed.

Processing only vector-sized chunks

Packed double SSE2 code processes two doubles per iteration. Integer code may process 16, 8, 4, or 2 elements per iteration depending on lane width.

If the array length is not a multiple of the vector width, handle the remaining elements with a scalar cleanup loop.

Assuming hand-written intrinsics are always faster

Modern compilers can auto-vectorize many simple loops. Intrinsics are useful when you need explicit control, but they also make code harder to read and maintain.

Before rewriting scalar code with intrinsics:

Write clear scalar code.
Compile with optimization enabled.
Measure performance.
Inspect the generated assembly.
Use intrinsics only where they improve the measured hot path.

Mixing SSE2 and older x87 floating-point behavior

On modern x86-64 systems, floating-point code usually uses SSE/SSE2 instructions. Older 32-bit x86 code may use the x87 floating-point unit.

x87 uses an extended-precision internal format, while SSE2 double operations use 64-bit double-precision lanes. This can produce small numerical differences.

For numerical tests, compare floating-point results with a tolerance rather than expecting exact bit-for-bit equality.

Build notes

On x86-64, SSE2 is generally available as a baseline instruction set. On 32-bit x86, very old processors may not support SSE2, so software targeting old 32-bit systems may need runtime CPU feature detection.

Example GCC or Clang build command:

gcc -O2 -msse2 example.c -o example

For C++:

g++ -O2 -msse2 example.cpp -o example

On MSVC, include the correct headers and build with optimization enabled. For modern x64 builds, SSE2 support is normally part of the baseline target.

Complete example program

The following program demonstrates packed and scalar double addition.

#include <stdio.h>
#include <emmintrin.h>

static void print2(const char* name, __m128d value)
{
    double out[2];
    _mm_storeu_pd(out, value);

    printf("%s = [%f, %f]\n", name, out[0], out[1]);
}

int main()
{
    __m128d a = _mm_setr_pd(1.0, 2.0);
    __m128d b = _mm_setr_pd(10.0, 20.0);

    __m128d packedAdd = _mm_add_pd(a, b);
    __m128d scalarAdd = _mm_add_sd(a, b);

    print2("a", a);
    print2("b", b);
    print2("_mm_add_pd(a, b)", packedAdd);
    print2("_mm_add_sd(a, b)", scalarAdd);

    return 0;
}

Expected output:

a = [1.000000, 2.000000]
b = [10.000000, 20.000000]
_mm_add_pd(a, b) = [11.000000, 22.000000]
_mm_add_sd(a, b) = [11.000000, 2.000000]

Summary

SSE2 extends SSE in two major ways.

First, it adds double-precision floating-point SIMD operations. With __m128d, a single XMM register can hold two 64-bit doubles, and packed operations such as _mm_add_pd, _mm_mul_pd, and _mm_div_pd operate on both values in parallel.

Second, it adds integer SIMD operations using __m128i, allowing one 128-bit register to hold sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers, or two 64-bit integers.

The most important concepts to remember are:

PD means packed double: both double lanes are used.
SD means scalar double: only the low double lane is used.
__m128d is used for two double-precision values.
__m128i is used for integer vectors.
memory alignment matters for aligned loads and stores.
scalar cleanup is needed when array lengths are not multiples of the vector width.
intrinsics should be used where they improve measured performance, not just because they look lower-level.

SSE2 is no longer new, but it remains important. It is a baseline SIMD instruction set on modern x86-64 systems, and understanding it provides a solid foundation for later SIMD extensions such as SSE3, SSSE3, SSE4, AVX, AVX2, and AVX-512.

References

Intel Intrinsics Guide
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
Intel 64 and IA-32 Architectures Software Developer Manuals
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
Microsoft x86/x64 intrinsics list
https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list
Microsoft x64/AMD64 intrinsics list
https://learn.microsoft.com/en-us/cpp/intrinsics/x64-amd64-intrinsics-list

What SSE2 adds over SSE

Header file

Main SSE2 data types

Packed double and scalar double

Naming conventions

Floating-point arithmetic intrinsics

Example: adding two doubles

Example: packed vs scalar double arithmetic

Example: processing an array of doubles

Example: scale and offset an array of doubles

Example: clamping doubles with MINPD and MAXPD

Integer SIMD with SSE2

Example: adding 32-bit integers

Saturating arithmetic

Example: brighten 8-bit pixels with saturation

Memory operations

Comparisons

Logical operations

Conversions

Common pitfalls

Confusing packed and scalar operations

Forgetting that _mm_sqrt_sd(a, b) uses b

Assuming MIN and MAX are always simple mathematical min/max

Ignoring alignment

Processing only vector-sized chunks

Assuming hand-written intrinsics are always faster

Mixing SSE2 and older x87 floating-point behavior

Build notes

Complete example program

Summary

References

Related Posts

AltaLux 2.0: a new multiscale engine and a simpler way to enhance images

AltaLux 1.9.1.92: major update for performance, correctness, and documentation

Intel AVX-512 instruction set: 512-bit SIMD, masks, and high-throughput vector computing

Leave a Reply Cancel reply

Forgetting that `_mm_sqrt_sd(a, b)` uses `b`

Assuming `MIN` and `MAX` are always simple mathematical min/max