# SSE2 Intrinsics

## Floating-Point Intrinsics

**Arithmetic Operation Intrinsics**

Intrinsic name | Corresponding instruction | Operation | R0 value | R1 value |
---|---|---|---|---|

_mm_add_sd | ADDSD | Adds |
a0 [op] b0 |
a1 |

_mm_add_pd | ADDPD | Adds |
a0 [op] b0 |
a1 [op] b1 |

_mm_div_sd | DIVSD | Divides |
a0 [op] b0 |
a1 |

_mm_div_pd | DIVPD | Divides |
a0 [op] b0 |
a1 [op] b1 |

_mm_max_sd | MAXSD | Computes maximum |
a0 [op] b0 |
a1 |

_mm_min_pd | MAXPD | Computes maximum |
a0 [op] b0 |
a1 [op] b1 |

_mm_min_sd | MINSD | Computes minimum |
a0 [op] b0 |
a1 |

_mm_min_pd | MINPD | Computes minimum |
a0 [op] b0 |
a1 [op] b1 |

_mm_mul_sd | MULSD | Multiplies |
a0 [op] b0 |
a1 |

_mm_mul_pd | MULPD | Multiplies |
a0 [op] b0 |
a1 [op] b1 |

_mm_sqrt_sd | SQRTSD | Computes square root |
a0 [op] b0 |
a1 |

_mm_sqrt_pd | SQRTPD | Computes square root |
a0 [op] b0 |
a1 [op] b1 |

_mm_sub_sd | SUBSD | Subtracts |
a0 [op] b0 |
a1 |

_mm_sub_pd | SUBPD | Subtracts |
a0 [op] b0 |
a1 [op] b1 |

## Logical Operations

__m128d _mm_andnot_pd (__m128d a, __m128d b); ANDNPD

Computes the bitwise `AND`

of the 128-bit value in `b`

and the bitwise `NOT`

of the 128-bit value in `a`

.

r0 := (~a0) & b0 r1 := (~a1) & b1 __m128d _mm_and_pd (__m128d a, __m128d b); ANDPD

Computes the bitwise `AND`

of the two double-precision, floating-point values of `a`

and `b`

.

r0 := a0 & b0 r1 := a1 & b1 __m128d _mm_or_pd (__m128d a, __m128d b); ORPD

Computes the bitwise `OR`

of the two double-precision, floating-point values of `a`

and `b`

.

r0 := a0 | b0 r1 := a1 | b1 __m128d _mm_xor_pd (__m128d a, __m128d b); XORPD

Computes the bitwise `XOR`

of the two double-precision, floating-point values of `a`

and `b`

.

r0 := a0 ^ b0 r1 := a1 ^ b1

**Comparison Intrinsics**

Intrinsic name | Corresponding instruction | Compare for |
---|---|---|

_mm_cmpeq_pd | CMPEQPD | Equality |

_mm_cmplt_pd | CMPLTPD | Less than |

_mm_cmple_pd | CMPLEPD | Less than or equal |

_mm_cmpgt_pd | CMPLTPDr | Greater than |

_mm_cmpge_pd | CMPLEPDr | Greater than or equal |

_mm_cmpord_pd | CMPORDPD | Ordered |

_mm_cmpunord_pd | CMPUNORDPD | Unordered |

_mm_cmpneq_pd | CMPNEQPD | Inequality |

_mm_cmpnlt_pd | CMPNLTPD | Not less than |

_mm_cmpnle_pd | CMPNLEPD | Not less than or equal |

_mm_cmpngt_pd | CMPNLTPDr | Not greater than |

_mm_cmpnge_pd | CMPLEPDr | Not greater than or equal |

_mm_cmpeq_sd | CMPEQSD | Equality |

_mm_cmplt_sd | CMPLTSD | Less than |

_mm_cmple_sd | CMPLESD | Less than or equal |

_mm_cmpgt_sd | CMPLTSDr | Greater than |

_mm_cmpge_sd | CMPLESDr | Greater than or equal |

_mm_cmpord_sd | CMPORDSD | Ordered |

_mm_cmpunord_sd | CMPUNORDSD | Unordered |

_mm_cmpneq_sd | CMPNEQSD | Inequality |

_mm_cmpnlt_sd | CMPNLTSD | Not less than |

_mm_cmpnle_sd | CMPNLESD | Not less than or equal |

_mm_cmpngt_sd | CMPNLTSDr | Not greater than |

_mm_cmpnge_sd | CMPNLESDR | Not greater than or equal |

_mm_comieq_sd | COMISD | Equality |

_mm_comilt_sd | COMISD | Less than |

_mm_comile_sd | COMISD | Less than or equal |

_mm_comigt_sd | COMISD | Greater than |

_mm_comige_sd | COMISD | Greater than or equal |

_mm_comineq_sd | COMISD | Not equal |

_mm_ucomieq_sd | UCOMISD | Equality |

_mm_ucomilt_sd | UCOMISD | Less than |

_mm_ucomile_sd | UCOMISD | Less than or equal |

_mm_ucomigt_sd | UCOMISD | Greater than |

_mm_ucomige_sd | UCOMISD | Greater than or equal |

_mm_ucomineq_sd | UCOMISD | Not equal |

**Conversion Operations**

Intrinsic name | Corresponding instruction | Return type | Parameters |
---|---|---|---|

_mm_cvtpd_ps | CVTPD2PS | __m128 | (__m128d a) |

_mm_cvtps_pd | CVTPS2PD | __m128d | (__m128 a) |

_mm_cvtepi32_pd | CVTDQ2PD | __m128d | (__m128i a) |

_mm_cvtpd_epi32 | CVTPD2DQ | __m128i | (__m128d a) |

_mm_cvtsd_si32 | CVTSD2SI | int | (__m128d a) |

_mm_cvtsd_ss | CVTSD2SS | __m128 | (__m128 a, __m128d b) |

_mm_cvtsi32_sd | CVTSI2SD | __m128d | (__m128d a, int b) |

_mm_cvtss_sd | CVTSS2SD | __m128d | (__m128d a, __m128 b) |

_mm_cvttpd_epi32 | CVTTPD2DQ | __m128i | (__m128d a) |

_mm_cvttsd_si32 | CVTTSD2SI | int | (__m128d a) |

_mm_cvtepi32_ps | CVTDQ2PS | __m128 | (__m128i a) |

_mm_cvtps_epi32 | CVTPS2DQ | __m128i | (__m128 a) |

_mm_cvttps_epi32 | CVTTPS2DQ | __m128i | (__m128 a) |

_mm_cvtpd_pi32 | CVTPD2PI | __m64 | (__m128d a) |

_mm_cvttpd_pi32 | CVTTPD2PI | __m64 | (__m128d a) |

_mm_cvtpi32_pd | CVTPI2PD | __m128d | (__m64 a) |

## Miscellaneous Operations

__m128d _mm_unpackhi_pd (__m128d a, __m128d b); UNPCKHPD

Interleaves the upper double-precision, floating-point values of `a`

and `b`

.

r0 := a1 r1 := b1 __m128d _mm_unpacklo_pd (__m128d a, __m128d b); UNPCKLPD

Interleaves the lower double-precision, floating-point values of `a`

and `b`

.

r0 := a0 1 := b0 int _mm_movemask_pd (__m128d a); MOVMSKPD

Creates a two-bit mask from the sign bits of the two double-precision, floating-point values of `a`

.

r := sign(a1) << 1 | sign(a0) __m128d _mm_shuffle_pd (__m128d a, __m128d b, int i); SHUFPD

Selects two specific double-precision, floating-point values from `a`

and `b`

, based on the mask `i`

. The mask must be an immediate. See Macro Function for Shuffle Using Streaming SIMD Extensions 2 Instructions section for a description of the shuffle semantics.

## Integer Intrinsics

**Integer Arithmetic Operations**

Intrinsic | Instruction | Operation |
---|---|---|

_mm_add_epi8 | PADDB | Addition |

_mm_add_epi16 | PADDW | Addition |

_mm_add_epi32 | PADDD | Addition |

_mm_add_si64 | PADDQ | Addition |

_mm_add_epi64 | PADDQ | Addition |

_mm_adds_epi8 | PADDSB | Addition |

_mm_adds_epi16 | PADDSW | Addition |

_mm_adds_epu8 | PADDUSB | Addition |

_mm_adds_epu16 | PADDUSW | Addition |

_mm_avg_epu8 | PAVGB | Computes average |

_mm_avg_epu16 | PAVGW | Computes average |

_mm_madd_epi16 | PMADDWD | Multiplication/addition |

_mm_max_epi16 | PMAXSW | Computes maxima |

_mm_max_epu8 | PMAXUB | Computes maxima |

_mm_min_epi16 | PMINSW | Computes minima |

_mm_min_epu8 | PMINUB | Computes minima |

_mm_mulhi_epi16 | PMULHW | Multiplication |

_mm_mulhi_epu16 | PMULHUW | Multiplication |

_mm_mullo_epi16 | PMULLW | Multiplication |

_mm_mul_su32 | PMULUDQ | Multiplication |

_mm_mul_epu32 | PMULUDQ | Multiplication |

_mm_sad_epu8 | PSADBW | Computes difference/adds |

_mm_sub_epi8 | PSUBB | Subtraction |

_mm_sub_epi16 | PSUBW | Subtraction |

_mm_sub_epi32 | PSUBD | Subtraction |

_mm_sub_si64 | PSUBQ | Subtraction |

_mm_sub_epi64 | PSUBQ | Subtraction |

_mm_subs_epi8 | PSUBSB | Subtraction |

_mm_subs_epi16 | PSUBSW | Subtraction |

_mm_subs_epu8 | PSUBUSB | Subtraction |

_mm_subs_epu16 | PSUBUSW | Subtraction |

## Logical Operations Intrinsics

an explanation of the syntax used in code samples in this topic, see Floating-Point Intrinsics Using Streaming SIMD Extensions.

__m128i _mm_and_si128 (__m128i a, __m128i b); PAND

Computes the bitwise `AND`

of the 128-bit value in `a`

and the 128-bit value in `b`

.

r := a & b __m128i _mm_andnot_si128 (__m128i a, __m128i b); PANDN

Computes the bitwise `AND`

of the 128-bit value in `b`

and the bitwise `NOT`

of the 128-bit value in `a`

.

r := (~a) & b __m128i _mm_or_si128 (__m128i a, __m128i b); POR

Computes the bitwise `OR`

of the 128-bit value in `a`

and the 128-bit value in `b`

.

r := a | b __m128i _mm_xor_si128 ( __m128i a, __m128i b); PXOR

Computes the bitwise `XOR`

of the 128-bit value in `a`

and the 128-bit value in `b`

.

r := a ^ b

**Shift Operation Intrinsics**

Intrinsic shift | Direction shift | Type | Corresponding instruction |
---|---|---|---|

_mm_slli_si128 | Left | Logical | PSLLDQ |

_mm_slli_epi16 | Left | Logical | PSLLW |

_mm_sll_epi16 | Left | Logical | PSLLW |

_mm_slli_epi32 | Left | Logical | PSLLD |

_mm_sll_epi32 | Left | Logical | PSLLD |

_mm_slli_epi64 | Left | Logical | PSLLQ |

_mm_sll_epi64 | Left | Logical | PSLLQ |

_mm_srai_epi16 | Right | Arithmetic | PSRAW |

_mm_sra_epi16 | Right | Arithmetic | PSRAW |

_mm_srai_epi32 | Right | Arithmetic | PSRAD |

_mm_sra_epi32 | Right | Arithmetic | PSRAD |

_mm_srli_si128 | Right | Logical | PSRLDQ |

_mm_srli_epi16 | Right | Logical | PSRLW |

_mm_srl_epi16 | Right | Logical | PSRLW |

_mm_srli_epi32 | Right | Logical | PSRLD |

_mm_srl_epi32 | Right | Logical | PSRLD |

_mm_srli_epi64 | Right | Logical | PSRLQ |

_mm_srl_epi64 | Right | Logical | PSRLQ |

## Conversion Intrinsics

__m128i _mm_cvtsi32_si128 (int a); MOVD

Moves 32-bit integer `a`

to the least significant 32 bits of an `__m128`

object one extending the upper bits.

r0 := a r1 := 0x0 ; r2 := 0x0 ; r3 := 0x0 int _mm_cvtsi128_si32 (__m128i a); MOVD

Moves the least significant 32 bits of `a`

to a 32-bit integer.

r := a0

**Comparison Intrinsics**

Intrinsic name | Instruction | Comparison | Elements | Size of elements |
---|---|---|---|---|

_mm_cmpeq_epi8 | PCMPEQB | Equality | 16 | 8 |

_mm_cmpeq_epi16 | PCMPEQW | Equality | 8 | 16 |

_mm_cmpeq_epi32 | PCMPEQD | Equality | 4 | 32 |

_mm_cmpgt_epi8 | PCMPGTB | Greater than | 16 | 8 |

_mm_cmpgt_epi16 | PCMPGTW | Greater than | 8 | 16 |

_mm_cmpgt_epi32 | PCMPGTD | Greater than | 4 | 32 |

_mm_cmplt_epi8 | PCMPGTBr | Less than | 16 | 8 |

_mm_cmplt_epi16 | PCMPGTWr | Less than | 8 | 16 |

_mm_cmplt_epi32 | PCMPGTDr | Less than | 4 | 32 |

**Miscellaneous Operations Intrinsics**

Intrinsic | Corresponding instruction | Operation |
---|---|---|

_mm_packs_epi16 | PACKSSWB | Packed saturation |

_mm_packs_epi32 | PACKSSDW | Packed saturation |

_mm_packus_epi16 | PACKUSWB | Packed saturation |

_mm_extract_epi16 | PEXTRW | Extraction |

_mm_insert_epi16 | PINSRW | Insertion |

_mm_movemask_epi8 | PMOVMSKB | Mask creation |

_mm_shuffle_epi32 | PSHUFD | Shuffle |

_mm_shufflehi_epi16 | PSHUFHW | Shuffle |

_mm_shufflelo_epi16 | PSHUFLW | Shuffle |

_mm_unpackhi_epi8 | PUNPCKHBW | Interleave |

_mm_unpackhi_epi16 | PUNPCKHWD | Interleave |

_mm_unpackhi_epi32 | PUNPCKHDQ | Interleave |

_mm_unpackhi_epi64 | PUNPCKHQDQ | Interleave |

_mm_unpacklo_epi8 | PUNPCKLBW | Interleave |

_mm_unpacklo_epi16 | PUNPCKLWD | Interleave |

_mm_unpacklo_epi32 | PUNPCKLDQ | Interleave |

_mm_unpacklo_epi64 | PUNPCKLQDQ | Interleave |

_mm_movepi64_pi64 | MOVDQ2Q | Move |

_mm_movpi64_pi64 | MOVQ2DQ | Move |

_mm_move_epi64 | MOVQ | Move |

## Cache Support Intrinsics

void _mm_stream_pd (double *p, __m128d a); MOVLPD

Stores the data in `a`

to the address p without polluting caches. The address `p`

must be 16-byte aligned. If the cache line containing address `p`

is already in the cache, the cache will be updated.

p[0] := a0 p[1] := a1

## Integer Load Operation

__m128i _mm_load_si128 (__m128i *p); MOVDQA

Loads 128-bit value. Address `p`

must be 16-byte aligned.

r := *p __m128i _mm_loadu_si128 (__m128i *p); MOVDQU

Loads 128-bit value. Address `p`

does not need be 16-byte aligned.

r := *p __m128i _mm_loadl_epi64(__m128i const*p); MOVQ

Load the lower 64 bits of the value pointed to by p into the lower 64 bits of the result, zeroing the upper 64 bits of the result.

r0:= *p[63:0] r1:=0x0

**Integer Set Operation Intrinsics**

Intrinsic | Corresponding instruction |
---|---|

_mm_set_epi64 | Composite |

_mm_set_epi32 | Composite |

_mm_set_epi16 | Composite |

_mm_set_epi8 | Composite |

_mm_set1_epi64 | Composite |

_mm_set1_epi32 | Composite |

_mm_set1_epi16 | Composite |

_mm_set1_epi8 | Composite |

_mm_setr_epi64 | Composite |

_mm_setr_epi32 | Composite |

_mm_setr_epi16 | Composite |

_mm_setr_epi8 | Composite |

_mm_setzero_si128 | PXOR |

## Integer Store Operation Intrinsics

void _mm_store_si128 (__m128i *p, __m128i a); MOVDQA

Stores 128-bit value. Address `p`

must be 16-byte aligned.

*p := a void _mm_storeu_si128 (__m128i *p, __m128i a); MOVDQU

Stores 128-bit value. Address `p`

does not need to be 16-byte aligned.

*p := a void _mm_maskmoveu_si128(__m128i d, __m128i n, char *p); MASKMOVDQU

Conditionally store byte elements of `d`

to address `p`

. The high bit of each byte in the selector `n`

determines whether the corresponding byte in `d`

will be stored. Address `p`

does not need to be 16-byte aligned.

if (n0[7]) p[0] := d0 if (n1[7]) p[1] := d1 ... if (n15[7]) p[15] := d15 void _mm_store1_epi64(__m128i *p, __m128i a); MOVQ

Stores the lower 64 bits of the value pointed to by `p`

.

*p[63:0]:=a0

## Cache Support

void _mm_stream_si128(__m128i *p, __m128i a)

Stores the data in `a`

to the address `p`

without polluting the caches. If the cache line containing address `p`

is already in the cache, the cache will be updated. Address `p`

must be 16-byte aligned.

*p := a void _mm_stream_si32(int *p, int a)

Stores the data in `a`

to the address `p`

without polluting the caches. If the cache line containing address `p`

is already in the cache, the cache will be updated.

*p := a void _mm_clflush(void const*p)

Cache line containing `p`

is flushed and invalidated from all caches in the coherency domain.

void _mm_lfence(void)

Guarantees that every load instruction that precedes, in program order, the load fence instruction is globally visible before any load instruction that follows the fence in program order.

void _mm_mfence(void)

Guarantees that every memory access that precedes, in program order, the memory fence instruction is globally visible before any memory instruction that follows the fence in program order.

void _mm_pause(void)

The execution of the next instruction is delayed an implementation specific amount of time. The instruction does not modify the architectural state.

**Shuffle Function Macro**

_MM_SHUFFLE2(x, y) /* expands to the value of */ (x<<1) | y

You can view the two integers as selectors for choosing which two words from the first input operand and which two words from the second are to be put into the result word.

**View of Original and Result Words with Shuffle Function Macro**