SSE, short for Streaming SIMD Extensions, is an x86 instruction set extension that allows one instruction to operate on multiple floating-point values at the same time.
The most common SSE programming model uses 128-bit XMM registers. Each register can hold four 32-bit single-precision floating-point values:
[ f0 | f1 | f2 | f3 ]
Instead of adding one float at a time, SSE can add four floats with one packed instruction. This is the basic idea behind SIMD: Single Instruction, Multiple Data.
This article introduces the most important SSE intrinsics for C and C++: data types, packed and scalar operations, arithmetic, reciprocal approximations, comparisons, logical operations, memory access, and common pitfalls.
Header file
SSE intrinsics are declared in:
#include <xmmintrin.h>
This header provides access to the original SSE intrinsics, including single-precision floating-point arithmetic, comparisons, logical operations, shuffles, loads, and stores.
The __m128 data type
The main SSE data type is:
__m128
A __m128 value represents a 128-bit XMM register containing four single-precision floating-point values.
Conceptually:
__m128 v = [ v0 | v1 | v2 | v3 ]
Each lane is a 32-bit float.
SSE intrinsics usually take one or more __m128 values and return another __m128 value.
Example:
__m128 c = _mm_add_ps(a, b);
This adds the four float lanes of a and b.
Packed single vs scalar single
Most SSE arithmetic intrinsics come in two forms:
| Suffix | Meaning | Operation |
|---|---|---|
PS | Packed Single-precision | operates on all four float lanes |
SS | Scalar Single-precision | operates only on the low float lane |
For example:
_mm_add_ps(a, b)
adds all four lanes:
[ a0 + b0 | a1 + b1 | a2 + b2 | a3 + b3 ]
But:
_mm_add_ss(a, b)
adds only the lowest lane and copies the upper three lanes from the first operand:
[ a0 + b0 | a1 | a2 | a3 ]
This distinction is essential. Many bugs in SSE code come from accidentally using a scalar instruction where a packed instruction was intended, or from forgetting that the upper lanes of a scalar result are copied from the first operand.
Naming conventions
SSE intrinsic names are easier to understand once the suffixes are familiar.
| Name fragment | Meaning |
|---|---|
_mm | SIMD intrinsic prefix |
add, sub, mul, div | arithmetic operation |
ps | packed single-precision floats |
ss | scalar single-precision float |
load | load from memory |
store | store to memory |
u | unaligned memory access |
set1 | broadcast one value to all lanes |
cmp | compare |
movemask | extract sign bits into an integer mask |
Examples:
_mm_add_ps
means “add packed single-precision floats.”
_mm_add_ss
means “add scalar single-precision floats.”
_mm_loadu_ps
means “load four unaligned single-precision floats.”
Creating SSE values
There are several ways to create __m128 values.
Set values manually
__m128 v = _mm_setr_ps(1.0f, 2.0f, 3.0f, 4.0f);
The r in _mm_setr_ps stands for reverse order relative to _mm_set_ps. It is often easier to read because the values appear in the natural left-to-right order:
[ 1.0 | 2.0 | 3.0 | 4.0 ]
Broadcast one value to all lanes
__m128 v = _mm_set1_ps(2.0f);
Result:
[ 2.0 | 2.0 | 2.0 | 2.0 ]
This is useful when applying the same constant to several values.
Create a zero vector
__m128 zero = _mm_setzero_ps();
Result:
[ 0.0 | 0.0 | 0.0 | 0.0 ]
Loading and storing floats
SSE code usually loads data from memory into an XMM register, performs operations, and stores the result back to memory.
Unaligned load and store
__m128 v = _mm_loadu_ps(ptr);
_mm_storeu_ps(out, v);
The u means unaligned. The pointer does not need to be 16-byte aligned.
Aligned load and store
__m128 v = _mm_load_ps(ptr);
_mm_store_ps(out, v);
Aligned loads and stores require the memory address to be 16-byte aligned.
For simple and safe code, use _mm_loadu_ps and _mm_storeu_ps unless you know that the data is correctly aligned.
Arithmetic intrinsics
The following table summarizes the most common SSE arithmetic intrinsics.
In the examples below:
a = [ a0 | a1 | a2 | a3 ]
b = [ b0 | b1 | b2 | b3 ]
| Intrinsic | Instruction | Operation | Result |
|---|---|---|---|
_mm_add_ps(a, b) | ADDPS | packed add | `[a0+b0 |
_mm_add_ss(a, b) | ADDSS | scalar add | `[a0+b0 |
_mm_sub_ps(a, b) | SUBPS | packed subtract | `[a0-b0 |
_mm_sub_ss(a, b) | SUBSS | scalar subtract | `[a0-b0 |
_mm_mul_ps(a, b) | MULPS | packed multiply | `[a0*b0 |
_mm_mul_ss(a, b) | MULSS | scalar multiply | `[a0*b0 |
_mm_div_ps(a, b) | DIVPS | packed divide | `[a0/b0 |
_mm_div_ss(a, b) | DIVSS | scalar divide | `[a0/b0 |
_mm_sqrt_ps(a) | SQRTPS | packed square root | `[sqrt(a0) |
_mm_sqrt_ss(a) | SQRTSS | scalar square root | `[sqrt(a0) |
_mm_min_ps(a, b) | MINPS | packed minimum | `[min(a0,b0) |
_mm_min_ss(a, b) | MINSS | scalar minimum | `[min(a0,b0) |
_mm_max_ps(a, b) | MAXPS | packed maximum | `[max(a0,b0) |
_mm_max_ss(a, b) | MAXSS | scalar maximum | `[max(a0,b0) |
For addition and multiplication, the order of operands does not change the mathematical result. For subtraction and division, the order matters:
_mm_sub_ps(a, b); // a - b
_mm_div_ps(a, b); // a / b
Reciprocal and reciprocal square root
SSE also provides approximate reciprocal and reciprocal-square-root operations.
| Intrinsic | Instruction | Operation |
|---|---|---|
_mm_rcp_ps(a) | RCPPS | approximate 1.0f / a for all four lanes |
_mm_rcp_ss(a) | RCPSS | approximate 1.0f / a0 for the low lane |
_mm_rsqrt_ps(a) | RSQRTPS | approximate 1.0f / sqrt(a) for all four lanes |
_mm_rsqrt_ss(a) | RSQRTSS | approximate 1.0f / sqrt(a0) for the low lane |
These instructions are faster than full division or square root, but they are approximate. They are often useful in graphics, image processing, games, and simulations where a small numerical error is acceptable.
If more accuracy is needed, the approximation can be refined with a Newton-Raphson step.
For reciprocal:
// y ~= 1 / x
__m128 y = _mm_rcp_ps(x);
// One Newton-Raphson refinement:
// y = y * (2 - x * y)
__m128 two = _mm_set1_ps(2.0f);
y = _mm_mul_ps(y, _mm_sub_ps(two, _mm_mul_ps(x, y)));
For reciprocal square root:
// y ~= 1 / sqrt(x)
__m128 y = _mm_rsqrt_ps(x);
// One Newton-Raphson refinement:
// y = 0.5 * y * (3 - x * y * y)
__m128 half = _mm_set1_ps(0.5f);
__m128 three = _mm_set1_ps(3.0f);
y = _mm_mul_ps(
_mm_mul_ps(half, y),
_mm_sub_ps(three, _mm_mul_ps(x, _mm_mul_ps(y, y))));
Example: adding four floats
This is the simplest SSE example. It adds four floats from one array to four floats from another array.
#include <xmmintrin.h>
void add4_floats(const float* a, const float* b, float* out)
{
__m128 va = _mm_loadu_ps(a);
__m128 vb = _mm_loadu_ps(b);
__m128 result = _mm_add_ps(va, vb);
_mm_storeu_ps(out, result);
}
If the input arrays contain:
a = [1.0, 2.0, 3.0, 4.0]
b = [10.0, 20.0, 30.0, 40.0]
then the output is:
out = [11.0, 22.0, 33.0, 44.0]
Example: packed vs scalar addition
This example shows the difference between _mm_add_ps and _mm_add_ss.
#include <xmmintrin.h>
void packed_vs_scalar(float* packedOut, float* scalarOut)
{
__m128 a = _mm_setr_ps(1.0f, 2.0f, 3.0f, 4.0f);
__m128 b = _mm_setr_ps(10.0f, 20.0f, 30.0f, 40.0f);
__m128 packed = _mm_add_ps(a, b);
__m128 scalar = _mm_add_ss(a, b);
_mm_storeu_ps(packedOut, packed);
_mm_storeu_ps(scalarOut, scalar);
}
The packed result is:
packedOut = [11.0, 22.0, 33.0, 44.0]
The scalar result is:
scalarOut = [11.0, 2.0, 3.0, 4.0]
Only the first lane was added. The other three lanes were copied from a.
Example: processing an array of floats
Most real SSE code processes arrays in groups of four floats.
#include <xmmintrin.h>
void add_float_arrays(const float* a,
const float* b,
float* out,
int count)
{
int i = 0;
for (; i + 3 < count; i += 4)
{
__m128 va = _mm_loadu_ps(&a[i]);
__m128 vb = _mm_loadu_ps(&b[i]);
__m128 result = _mm_add_ps(va, vb);
_mm_storeu_ps(&out[i], result);
}
// Handle the remaining 1, 2, or 3 elements.
for (; i < count; ++i)
{
out[i] = a[i] + b[i];
}
}
The vector loop handles four floats per iteration. The scalar cleanup loop handles the remaining elements when count is not a multiple of four.
Example: scale and offset an array
This example computes:
out[i] = input[i] * scale + offset
for every value in the input array.
#include <xmmintrin.h>
void scale_and_offset(const float* input,
float* output,
int count,
float scale,
float offset)
{
__m128 vscale = _mm_set1_ps(scale);
__m128 voffset = _mm_set1_ps(offset);
int i = 0;
for (; i + 3 < count; i += 4)
{
__m128 x = _mm_loadu_ps(&input[i]);
__m128 y = _mm_mul_ps(x, vscale);
y = _mm_add_ps(y, voffset);
_mm_storeu_ps(&output[i], y);
}
for (; i < count; ++i)
{
output[i] = input[i] * scale + offset;
}
}
This pattern is common in graphics, audio, image processing, and numerical code.
Example: clamping values with MINPS and MAXPS
SSE MINPS and MAXPS can clamp values to a range.
The following function clamps every value to:
[minValue, maxValue]
#include <xmmintrin.h>
void clamp_float_array(const float* input,
float* output,
int count,
float minValue,
float maxValue)
{
__m128 vmin = _mm_set1_ps(minValue);
__m128 vmax = _mm_set1_ps(maxValue);
int i = 0;
for (; i + 3 < count; i += 4)
{
__m128 x = _mm_loadu_ps(&input[i]);
x = _mm_max_ps(x, vmin);
x = _mm_min_ps(x, vmax);
_mm_storeu_ps(&output[i], x);
}
for (; i < count; ++i)
{
float x = input[i];
if (x < minValue)
x = minValue;
if (x > maxValue)
x = maxValue;
output[i] = x;
}
}
This is useful when values must stay inside a valid range, such as pixel values, audio samples, simulation parameters, or normalized coordinates.
Example: square roots
The SQRTPS instruction computes four square roots in parallel.
#include <xmmintrin.h>
#include <math.h>
void sqrt_float_array(const float* input,
float* output,
int count)
{
int i = 0;
for (; i + 3 < count; i += 4)
{
__m128 x = _mm_loadu_ps(&input[i]);
__m128 y = _mm_sqrt_ps(x);
_mm_storeu_ps(&output[i], y);
}
for (; i < count; ++i)
{
output[i] = sqrtf(input[i]);
}
}
This assumes that the input values are non-negative. Negative inputs follow the usual floating-point behavior for square root.
Example: normalizing 2D vectors
This example normalizes arrays of 2D vectors.
Given:
length = sqrt(x*x + y*y)
outX = x / length
outY = y / length
The SSE version processes four vectors at a time.
#include <xmmintrin.h>
#include <math.h>
void normalize_2d_vectors(const float* x,
const float* y,
float* outX,
float* outY,
int count)
{
int i = 0;
for (; i + 3 < count; i += 4)
{
__m128 vx = _mm_loadu_ps(&x[i]);
__m128 vy = _mm_loadu_ps(&y[i]);
__m128 x2 = _mm_mul_ps(vx, vx);
__m128 y2 = _mm_mul_ps(vy, vy);
__m128 lengthSquared = _mm_add_ps(x2, y2);
__m128 length = _mm_sqrt_ps(lengthSquared);
__m128 nx = _mm_div_ps(vx, length);
__m128 ny = _mm_div_ps(vy, length);
_mm_storeu_ps(&outX[i], nx);
_mm_storeu_ps(&outY[i], ny);
}
for (; i < count; ++i)
{
float length = sqrtf(x[i] * x[i] + y[i] * y[i]);
outX[i] = x[i] / length;
outY[i] = y[i] / length;
}
}
This function assumes that no vector has zero length. Production code should handle zero-length vectors before dividing by the length.
Example: approximate normalization with RSQRTPS
If full precision is not required, normalization can use _mm_rsqrt_ps.
Instead of computing:
length = sqrt(x*x + y*y)
outX = x / length
outY = y / length
we can compute:
invLength = 1 / sqrt(x*x + y*y)
outX = x * invLength
outY = y * invLength
#include <xmmintrin.h>
void normalize_2d_vectors_approx(const float* x,
const float* y,
float* outX,
float* outY,
int count)
{
int i = 0;
for (; i + 3 < count; i += 4)
{
__m128 vx = _mm_loadu_ps(&x[i]);
__m128 vy = _mm_loadu_ps(&y[i]);
__m128 x2 = _mm_mul_ps(vx, vx);
__m128 y2 = _mm_mul_ps(vy, vy);
__m128 lengthSquared = _mm_add_ps(x2, y2);
__m128 invLength = _mm_rsqrt_ps(lengthSquared);
__m128 nx = _mm_mul_ps(vx, invLength);
__m128 ny = _mm_mul_ps(vy, invLength);
_mm_storeu_ps(&outX[i], nx);
_mm_storeu_ps(&outY[i], ny);
}
for (; i < count; ++i)
{
float invLength = 1.0f / sqrtf(x[i] * x[i] + y[i] * y[i]);
outX[i] = x[i] * invLength;
outY[i] = y[i] * invLength;
}
}
This version is faster on many processors, but the result is approximate. Use it only when the precision is acceptable.
Comparisons
SSE comparisons produce masks.
Each lane in the result is either all bits set, meaning true, or all bits clear, meaning false.
Some common comparison intrinsics are:
| Intrinsic | Meaning |
|---|---|
_mm_cmpeq_ps(a, b) | compare packed floats for equality |
_mm_cmplt_ps(a, b) | compare packed floats for less-than |
_mm_cmple_ps(a, b) | compare packed floats for less-than-or-equal |
_mm_cmpgt_ps(a, b) | compare packed floats for greater-than |
_mm_cmpge_ps(a, b) | compare packed floats for greater-than-or-equal |
_mm_cmpneq_ps(a, b) | compare packed floats for not-equal |
Example:
__m128 a = _mm_setr_ps(1.0f, 5.0f, 3.0f, 8.0f);
__m128 b = _mm_set1_ps(4.0f);
__m128 mask = _mm_cmpgt_ps(a, b);
Conceptually, this compares:
[1.0 > 4.0 | 5.0 > 4.0 | 3.0 > 4.0 | 8.0 > 4.0]
Result:
[false | true | false | true]
Internally, those are not Boolean values like in C. They are bit masks.
Example: zero values below a threshold
This example keeps values greater than or equal to a threshold and replaces the others with zero.
#include <xmmintrin.h>
void threshold_to_zero(const float* input,
float* output,
int count,
float threshold)
{
__m128 vthreshold = _mm_set1_ps(threshold);
int i = 0;
for (; i + 3 < count; i += 4)
{
__m128 x = _mm_loadu_ps(&input[i]);
__m128 mask = _mm_cmpge_ps(x, vthreshold);
__m128 y = _mm_and_ps(mask, x);
_mm_storeu_ps(&output[i], y);
}
for (; i < count; ++i)
{
output[i] = input[i] >= threshold ? input[i] : 0.0f;
}
}
The comparison creates a mask. The bitwise AND keeps the original value where the mask is true and clears it where the mask is false.
Logical operations
SSE includes bitwise logical operations for __m128 values.
| Intrinsic | Meaning |
|---|---|
_mm_and_ps(a, b) | bitwise AND |
_mm_or_ps(a, b) | bitwise OR |
_mm_xor_ps(a, b) | bitwise XOR |
_mm_andnot_ps(a, b) | bitwise AND NOT |
These operations are commonly used with comparison masks.
For example:
__m128 mask = _mm_cmpgt_ps(a, b);
__m128 selected = _mm_and_ps(mask, a);
This keeps the lanes from a where the comparison is true and produces zero in the other lanes.
A common select pattern is:
// result = mask ? valueIfTrue : valueIfFalse
__m128 result = _mm_or_ps(
_mm_and_ps(mask, valueIfTrue),
_mm_andnot_ps(mask, valueIfFalse));
_mm_andnot_ps(mask, valueIfFalse) computes:
(~mask) & valueIfFalse
Extracting comparison results with MOVMSKPS
The _mm_movemask_ps intrinsic extracts the sign bit of each lane and packs those bits into a normal integer.
This is useful after comparisons.
#include <xmmintrin.h>
int all_greater_than_zero(__m128 x)
{
__m128 zero = _mm_setzero_ps();
__m128 mask = _mm_cmpgt_ps(x, zero);
int bits = _mm_movemask_ps(mask);
return bits == 0xF;
}
The result of _mm_movemask_ps has one bit per lane.
bit 0 -> lane 0
bit 1 -> lane 1
bit 2 -> lane 2
bit 3 -> lane 3
If all four comparison lanes are true, the result is:
0b1111 = 0xF
Shuffles
SSE can rearrange values inside XMM registers.
The most common shuffle intrinsic is:
_mm_shuffle_ps(a, b, mask)
The mask is a compile-time constant that chooses which lanes to copy.
One simple use is broadcasting the first lane to all lanes:
__m128 v = _mm_setr_ps(1.0f, 2.0f, 3.0f, 4.0f);
__m128 x = _mm_shuffle_ps(v, v, _MM_SHUFFLE(0, 0, 0, 0));
Result:
[1.0 | 1.0 | 1.0 | 1.0]
Shuffles are useful for rearranging data, duplicating values, and building more complex vector algorithms. They can also become expensive if overused, so data layout is often just as important as instruction choice.
Data layout matters
SSE works best when the values you want to process together are stored next to each other in memory.
This layout is SIMD-friendly:
x0, x1, x2, x3, x4, x5, x6, x7
Four x values can be loaded directly:
__m128 x = _mm_loadu_ps(&values[i]);
This layout is sometimes harder to vectorize:
struct Point
{
float x;
float y;
float z;
};
Point points[count];
The x, y, and z values are interleaved:
x0, y0, z0, x1, y1, z1, x2, y2, z2, ...
If you frequently process all x values together, all y values together, or all z values together, a structure-of-arrays layout may be better:
struct Points
{
float* x;
float* y;
float* z;
};
This makes it easy to load four x values, four y values, or four z values with one instruction.
Common pitfalls
SSE intrinsics are powerful, but several details can lead to wrong results or disappointing performance.
Confusing packed and scalar operations
PS operations work on all four lanes.
SS operations work only on the low lane and copy the upper lanes from the first operand.
__m128 a = _mm_setr_ps(1.0f, 2.0f, 3.0f, 4.0f);
__m128 b = _mm_setr_ps(10.0f, 20.0f, 30.0f, 40.0f);
__m128 r = _mm_add_ss(a, b);
The result is:
[11.0 | 2.0 | 3.0 | 4.0]
not:
[11.0 | 22.0 | 33.0 | 44.0]
Forgetting the operand order
For subtraction and division, operand order matters:
_mm_sub_ps(a, b); // a - b
_mm_sub_ps(b, a); // b - a
_mm_div_ps(a, b); // a / b
_mm_div_ps(b, a); // b / a
When reading assembly, also remember that assembly syntax and intrinsic syntax may make the destination operand feel different. Check the documentation when in doubt.
Using aligned loads on unaligned memory
Aligned loads and stores require 16-byte alignment:
_mm_load_ps(ptr); // ptr must be 16-byte aligned
_mm_store_ps(ptr, v);
If alignment is not guaranteed, use:
_mm_loadu_ps(ptr);
_mm_storeu_ps(ptr, v);
This is safer and usually fast enough on modern processors.
Forgetting scalar cleanup
SSE processes four floats per packed operation. If the array size is not a multiple of four, the final one, two, or three values still need to be processed.
A typical loop has two parts:
for (; i + 3 < count; i += 4)
{
// SSE vector loop
}
for (; i < count; ++i)
{
// scalar cleanup
}
Assuming reciprocal instructions are exact
_mm_rcp_ps and _mm_rsqrt_ps are approximate.
They are useful for speed, but they should not be used when full floating-point precision is required. Use _mm_div_ps or _mm_sqrt_ps when accuracy matters more than speed.
Assuming MIN and MAX are always simple mathematical operations
MINPS, MINSS, MAXPS, and MAXSS have specific floating-point behavior, especially for NaN values, signed zero, and equal operands.
If your data can contain NaNs or invalid values, check the exact instruction behavior instead of assuming that SSE min/max behaves like a high-level mathematical function.
Overusing shuffles
Shuffles are powerful, but they do not do arithmetic. If an algorithm spends too much time rearranging data before doing useful work, the SIMD speedup may disappear.
A better memory layout can often reduce the number of shuffles.
Hand-vectorizing too early
Modern compilers can auto-vectorize many simple loops.
Before writing explicit SSE intrinsics:
- Write clear scalar code.
- Compile with optimization enabled.
- Measure performance.
- Inspect the generated assembly.
- Use intrinsics only where they improve the measured hot path.
Manual SSE is most useful when the compiler cannot auto-vectorize the code, when you need a specific instruction, or when you need explicit control over data movement and numerical behavior.
Build notes
On GCC or Clang, enable SSE explicitly when targeting older 32-bit configurations:
gcc -O2 -msse example.c -o example
For C++:
g++ -O2 -msse example.cpp -o example
On modern x86-64 systems, SSE support is normally available as part of the baseline architecture. On very old 32-bit x86 systems, SSE support may need to be detected or enabled explicitly.
For MSVC, include the correct header and build with optimization enabled. On old 32-bit targets, the /arch option controls which SIMD instruction sets the compiler may use.
Complete example program
The following program demonstrates packed and scalar addition.
#include <stdio.h>
#include <xmmintrin.h>
static void print4(const char* name, __m128 value)
{
float out[4];
_mm_storeu_ps(out, value);
printf("%s = [%f, %f, %f, %f]\n",
name,
out[0],
out[1],
out[2],
out[3]);
}
int main()
{
__m128 a = _mm_setr_ps(1.0f, 2.0f, 3.0f, 4.0f);
__m128 b = _mm_setr_ps(10.0f, 20.0f, 30.0f, 40.0f);
__m128 packedAdd = _mm_add_ps(a, b);
__m128 scalarAdd = _mm_add_ss(a, b);
print4("a", a);
print4("b", b);
print4("_mm_add_ps(a, b)", packedAdd);
print4("_mm_add_ss(a, b)", scalarAdd);
return 0;
}
Expected output:
a = [1.000000, 2.000000, 3.000000, 4.000000]
b = [10.000000, 20.000000, 30.000000, 40.000000]
_mm_add_ps(a, b) = [11.000000, 22.000000, 33.000000, 44.000000]
_mm_add_ss(a, b) = [11.000000, 2.000000, 3.000000, 4.000000]
Summary
SSE intrinsics provide direct access to 128-bit SIMD operations on single-precision floating-point values.
The key ideas are:
__m128represents four 32-bit floats.PSmeans packed single-precision: all four lanes are used.SSmeans scalar single-precision: only the low lane is used._mm_loadu_psand_mm_storeu_psare safe for unaligned memory._mm_add_ps,_mm_sub_ps,_mm_mul_ps, and_mm_div_psperform four arithmetic operations at once._mm_rcp_psand_mm_rsqrt_psare fast but approximate.- comparisons produce masks, not ordinary Boolean values.
- data layout is often as important as instruction choice.
- scalar cleanup is needed when the number of elements is not a multiple of four.
SSE is old, but it remains an important foundation for understanding x86 SIMD programming. The same concepts — lanes, packed operations, scalar operations, alignment, masks, shuffles, and data layout — carry forward to SSE2, SSE3, SSE4, AVX, AVX2, and AVX-512.
References
- Intel Intrinsics Guide
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html - Intel 64 and IA-32 Architectures Software Developer Manuals
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html - Microsoft x86 intrinsics list
https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list - Microsoft x64/AMD64 intrinsics list
https://learn.microsoft.com/en-us/cpp/intrinsics/x64-amd64-intrinsics-list



