
FloatingPoint Intrinsics
Arithmetic Operation Intrinsics
Intrinsic name  Corresponding instruction  Operation  R0 value  R1 value 

_mm_add_sd  ADDSD  Adds 
a0 [op] b0 
a1 
_mm_add_pd  ADDPD  Adds 
a0 [op] b0 
a1 [op] b1 
_mm_div_sd  DIVSD  Divides 
a0 [op] b0 
a1 
_mm_div_pd  DIVPD  Divides 
a0 [op] b0 
a1 [op] b1 
_mm_max_sd  MAXSD  Computes maximum 
a0 [op] b0 
a1 
_mm_min_pd  MAXPD  Computes maximum 
a0 [op] b0 
a1 [op] b1 
_mm_min_sd  MINSD  Computes minimum 
a0 [op] b0 
a1 
_mm_min_pd  MINPD  Computes minimum 
a0 [op] b0 
a1 [op] b1 
_mm_mul_sd  MULSD  Multiplies 
a0 [op] b0 
a1 
_mm_mul_pd  MULPD  Multiplies 
a0 [op] b0 
a1 [op] b1 
_mm_sqrt_sd  SQRTSD  Computes square root 
a0 [op] b0 
a1 
_mm_sqrt_pd  SQRTPD  Computes square root 
a0 [op] b0 
a1 [op] b1 
_mm_sub_sd  SUBSD  Subtracts 
a0 [op] b0 
a1 
_mm_sub_pd  SUBPD  Subtracts 
a0 [op] b0 
a1 [op] b1 
Logical Operations
__m128d _mm_andnot_pd (__m128d a, __m128d b); ANDNPD
Computes the bitwise AND
of the 128bit value in b
and the bitwise NOT
of the 128bit value in a
.
r0 := (~a0) & b0 r1 := (~a1) & b1 __m128d _mm_and_pd (__m128d a, __m128d b); ANDPD
Computes the bitwise AND
of the two doubleprecision, floatingpoint values of a
and b
.
r0 := a0 & b0 r1 := a1 & b1 __m128d _mm_or_pd (__m128d a, __m128d b); ORPD
Computes the bitwise OR
of the two doubleprecision, floatingpoint values of a
and b
.
r0 := a0  b0 r1 := a1  b1 __m128d _mm_xor_pd (__m128d a, __m128d b); XORPD
Computes the bitwise XOR
of the two doubleprecision, floatingpoint values of a
and b
.
r0 := a0 ^ b0 r1 := a1 ^ b1
Comparison Intrinsics
Intrinsic name  Corresponding instruction  Compare for 

_mm_cmpeq_pd  CMPEQPD  Equality 
_mm_cmplt_pd  CMPLTPD  Less than 
_mm_cmple_pd  CMPLEPD  Less than or equal 
_mm_cmpgt_pd  CMPLTPDr  Greater than 
_mm_cmpge_pd  CMPLEPDr  Greater than or equal 
_mm_cmpord_pd  CMPORDPD  Ordered 
_mm_cmpunord_pd  CMPUNORDPD  Unordered 
_mm_cmpneq_pd  CMPNEQPD  Inequality 
_mm_cmpnlt_pd  CMPNLTPD  Not less than 
_mm_cmpnle_pd  CMPNLEPD  Not less than or equal 
_mm_cmpngt_pd  CMPNLTPDr  Not greater than 
_mm_cmpnge_pd  CMPLEPDr  Not greater than or equal 
_mm_cmpeq_sd  CMPEQSD  Equality 
_mm_cmplt_sd  CMPLTSD  Less than 
_mm_cmple_sd  CMPLESD  Less than or equal 
_mm_cmpgt_sd  CMPLTSDr  Greater than 
_mm_cmpge_sd  CMPLESDr  Greater than or equal 
_mm_cmpord_sd  CMPORDSD  Ordered 
_mm_cmpunord_sd  CMPUNORDSD  Unordered 
_mm_cmpneq_sd  CMPNEQSD  Inequality 
_mm_cmpnlt_sd  CMPNLTSD  Not less than 
_mm_cmpnle_sd  CMPNLESD  Not less than or equal 
_mm_cmpngt_sd  CMPNLTSDr  Not greater than 
_mm_cmpnge_sd  CMPNLESDR  Not greater than or equal 
_mm_comieq_sd  COMISD  Equality 
_mm_comilt_sd  COMISD  Less than 
_mm_comile_sd  COMISD  Less than or equal 
_mm_comigt_sd  COMISD  Greater than 
_mm_comige_sd  COMISD  Greater than or equal 
_mm_comineq_sd  COMISD  Not equal 
_mm_ucomieq_sd  UCOMISD  Equality 
_mm_ucomilt_sd  UCOMISD  Less than 
_mm_ucomile_sd  UCOMISD  Less than or equal 
_mm_ucomigt_sd  UCOMISD  Greater than 
_mm_ucomige_sd  UCOMISD  Greater than or equal 
_mm_ucomineq_sd  UCOMISD  Not equal 
Conversion Operations
Intrinsic name  Corresponding instruction  Return type  Parameters 

_mm_cvtpd_ps  CVTPD2PS  __m128  (__m128d a) 
_mm_cvtps_pd  CVTPS2PD  __m128d  (__m128 a) 
_mm_cvtepi32_pd  CVTDQ2PD  __m128d  (__m128i a) 
_mm_cvtpd_epi32  CVTPD2DQ  __m128i  (__m128d a) 
_mm_cvtsd_si32  CVTSD2SI  int  (__m128d a) 
_mm_cvtsd_ss  CVTSD2SS  __m128  (__m128 a, __m128d b) 
_mm_cvtsi32_sd  CVTSI2SD  __m128d  (__m128d a, int b) 
_mm_cvtss_sd  CVTSS2SD  __m128d  (__m128d a, __m128 b) 
_mm_cvttpd_epi32  CVTTPD2DQ  __m128i  (__m128d a) 
_mm_cvttsd_si32  CVTTSD2SI  int  (__m128d a) 
_mm_cvtepi32_ps  CVTDQ2PS  __m128  (__m128i a) 
_mm_cvtps_epi32  CVTPS2DQ  __m128i  (__m128 a) 
_mm_cvttps_epi32  CVTTPS2DQ  __m128i  (__m128 a) 
_mm_cvtpd_pi32  CVTPD2PI  __m64  (__m128d a) 
_mm_cvttpd_pi32  CVTTPD2PI  __m64  (__m128d a) 
_mm_cvtpi32_pd  CVTPI2PD  __m128d  (__m64 a) 
Miscellaneous Operations
__m128d _mm_unpackhi_pd (__m128d a, __m128d b); UNPCKHPD
Interleaves the upper doubleprecision, floatingpoint values of a
and b
.
r0 := a1 r1 := b1 __m128d _mm_unpacklo_pd (__m128d a, __m128d b); UNPCKLPD
Interleaves the lower doubleprecision, floatingpoint values of a
and b
.
r0 := a0 1 := b0 int _mm_movemask_pd (__m128d a); MOVMSKPD
Creates a twobit mask from the sign bits of the two doubleprecision, floatingpoint values of a
.
r := sign(a1) << 1  sign(a0) __m128d _mm_shuffle_pd (__m128d a, __m128d b, int i); SHUFPD
Selects two specific doubleprecision, floatingpoint values from a
and b
, based on the mask i
. The mask must be an immediate. See Macro Function for Shuffle Using Streaming SIMD Extensions 2 Instructions section for a description of the shuffle semantics.
Integer Intrinsics
Integer Arithmetic Operations
Intrinsic  Instruction  Operation 

_mm_add_epi8  PADDB  Addition 
_mm_add_epi16  PADDW  Addition 
_mm_add_epi32  PADDD  Addition 
_mm_add_si64  PADDQ  Addition 
_mm_add_epi64  PADDQ  Addition 
_mm_adds_epi8  PADDSB  Addition 
_mm_adds_epi16  PADDSW  Addition 
_mm_adds_epu8  PADDUSB  Addition 
_mm_adds_epu16  PADDUSW  Addition 
_mm_avg_epu8  PAVGB  Computes average 
_mm_avg_epu16  PAVGW  Computes average 
_mm_madd_epi16  PMADDWD  Multiplication/addition 
_mm_max_epi16  PMAXSW  Computes maxima 
_mm_max_epu8  PMAXUB  Computes maxima 
_mm_min_epi16  PMINSW  Computes minima 
_mm_min_epu8  PMINUB  Computes minima 
_mm_mulhi_epi16  PMULHW  Multiplication 
_mm_mulhi_epu16  PMULHUW  Multiplication 
_mm_mullo_epi16  PMULLW  Multiplication 
_mm_mul_su32  PMULUDQ  Multiplication 
_mm_mul_epu32  PMULUDQ  Multiplication 
_mm_sad_epu8  PSADBW  Computes difference/adds 
_mm_sub_epi8  PSUBB  Subtraction 
_mm_sub_epi16  PSUBW  Subtraction 
_mm_sub_epi32  PSUBD  Subtraction 
_mm_sub_si64  PSUBQ  Subtraction 
_mm_sub_epi64  PSUBQ  Subtraction 
_mm_subs_epi8  PSUBSB  Subtraction 
_mm_subs_epi16  PSUBSW  Subtraction 
_mm_subs_epu8  PSUBUSB  Subtraction 
_mm_subs_epu16  PSUBUSW  Subtraction 
Logical Operations Intrinsics
an explanation of the syntax used in code samples in this topic, see FloatingPoint Intrinsics Using Streaming SIMD Extensions.
__m128i _mm_and_si128 (__m128i a, __m128i b); PAND
Computes the bitwise AND
of the 128bit value in a
and the 128bit value in b
.
r := a & b __m128i _mm_andnot_si128 (__m128i a, __m128i b); PANDN
Computes the bitwise AND
of the 128bit value in b
and the bitwise NOT
of the 128bit value in a
.
r := (~a) & b __m128i _mm_or_si128 (__m128i a, __m128i b); POR
Computes the bitwise OR
of the 128bit value in a
and the 128bit value in b
.
r := a  b __m128i _mm_xor_si128 ( __m128i a, __m128i b); PXOR
Computes the bitwise XOR
of the 128bit value in a
and the 128bit value in b
.
r := a ^ b
Shift Operation Intrinsics
Intrinsic shift  Direction shift  Type  Corresponding instruction 

_mm_slli_si128  Left  Logical  PSLLDQ 
_mm_slli_epi16  Left  Logical  PSLLW 
_mm_sll_epi16  Left  Logical  PSLLW 
_mm_slli_epi32  Left  Logical  PSLLD 
_mm_sll_epi32  Left  Logical  PSLLD 
_mm_slli_epi64  Left  Logical  PSLLQ 
_mm_sll_epi64  Left  Logical  PSLLQ 
_mm_srai_epi16  Right  Arithmetic  PSRAW 
_mm_sra_epi16  Right  Arithmetic  PSRAW 
_mm_srai_epi32  Right  Arithmetic  PSRAD 
_mm_sra_epi32  Right  Arithmetic  PSRAD 
_mm_srli_si128  Right  Logical  PSRLDQ 
_mm_srli_epi16  Right  Logical  PSRLW 
_mm_srl_epi16  Right  Logical  PSRLW 
_mm_srli_epi32  Right  Logical  PSRLD 
_mm_srl_epi32  Right  Logical  PSRLD 
_mm_srli_epi64  Right  Logical  PSRLQ 
_mm_srl_epi64  Right  Logical  PSRLQ 
Conversion Intrinsics
__m128i _mm_cvtsi32_si128 (int a); MOVD
Moves 32bit integer a
to the least significant 32 bits of an __m128
object one extending the upper bits.
r0 := a r1 := 0x0 ; r2 := 0x0 ; r3 := 0x0 int _mm_cvtsi128_si32 (__m128i a); MOVD
Moves the least significant 32 bits of a
to a 32bit integer.
r := a0
Comparison Intrinsics
Intrinsic name  Instruction  Comparison  Elements  Size of elements 

_mm_cmpeq_epi8  PCMPEQB  Equality  16  8 
_mm_cmpeq_epi16  PCMPEQW  Equality  8  16 
_mm_cmpeq_epi32  PCMPEQD  Equality  4  32 
_mm_cmpgt_epi8  PCMPGTB  Greater than  16  8 
_mm_cmpgt_epi16  PCMPGTW  Greater than  8  16 
_mm_cmpgt_epi32  PCMPGTD  Greater than  4  32 
_mm_cmplt_epi8  PCMPGTBr  Less than  16  8 
_mm_cmplt_epi16  PCMPGTWr  Less than  8  16 
_mm_cmplt_epi32  PCMPGTDr  Less than  4  32 
Miscellaneous Operations Intrinsics
Intrinsic  Corresponding instruction  Operation 

_mm_packs_epi16  PACKSSWB  Packed saturation 
_mm_packs_epi32  PACKSSDW  Packed saturation 
_mm_packus_epi16  PACKUSWB  Packed saturation 
_mm_extract_epi16  PEXTRW  Extraction 
_mm_insert_epi16  PINSRW  Insertion 
_mm_movemask_epi8  PMOVMSKB  Mask creation 
_mm_shuffle_epi32  PSHUFD  Shuffle 
_mm_shufflehi_epi16  PSHUFHW  Shuffle 
_mm_shufflelo_epi16  PSHUFLW  Shuffle 
_mm_unpackhi_epi8  PUNPCKHBW  Interleave 
_mm_unpackhi_epi16  PUNPCKHWD  Interleave 
_mm_unpackhi_epi32  PUNPCKHDQ  Interleave 
_mm_unpackhi_epi64  PUNPCKHQDQ  Interleave 
_mm_unpacklo_epi8  PUNPCKLBW  Interleave 
_mm_unpacklo_epi16  PUNPCKLWD  Interleave 
_mm_unpacklo_epi32  PUNPCKLDQ  Interleave 
_mm_unpacklo_epi64  PUNPCKLQDQ  Interleave 
_mm_movepi64_pi64  MOVDQ2Q  Move 
_mm_movpi64_pi64  MOVQ2DQ  Move 
_mm_move_epi64  MOVQ  Move 
Cache Support Intrinsics
void _mm_stream_pd (double *p, __m128d a); MOVLPD
Stores the data in a
to the address p without polluting caches. The address p
must be 16byte aligned. If the cache line containing address p
is already in the cache, the cache will be updated.
p[0] := a0 p[1] := a1
Integer Load Operation
__m128i _mm_load_si128 (__m128i *p); MOVDQA
Loads 128bit value. Address p
must be 16byte aligned.
r := *p __m128i _mm_loadu_si128 (__m128i *p); MOVDQU
Loads 128bit value. Address p
does not need be 16byte aligned.
r := *p __m128i _mm_loadl_epi64(__m128i const*p); MOVQ
Load the lower 64 bits of the value pointed to by p into the lower 64 bits of the result, zeroing the upper 64 bits of the result.
r0:= *p[63:0] r1:=0x0
Integer Set Operation Intrinsics
Intrinsic  Corresponding instruction 

_mm_set_epi64  Composite 
_mm_set_epi32  Composite 
_mm_set_epi16  Composite 
_mm_set_epi8  Composite 
_mm_set1_epi64  Composite 
_mm_set1_epi32  Composite 
_mm_set1_epi16  Composite 
_mm_set1_epi8  Composite 
_mm_setr_epi64  Composite 
_mm_setr_epi32  Composite 
_mm_setr_epi16  Composite 
_mm_setr_epi8  Composite 
_mm_setzero_si128  PXOR 
Integer Store Operation Intrinsics
void _mm_store_si128 (__m128i *p, __m128i a); MOVDQA
Stores 128bit value. Address p
must be 16byte aligned.
*p := a void _mm_storeu_si128 (__m128i *p, __m128i a); MOVDQU
Stores 128bit value. Address p
does not need to be 16byte aligned.
*p := a void _mm_maskmoveu_si128(__m128i d, __m128i n, char *p); MASKMOVDQU
Conditionally store byte elements of d
to address p
. The high bit of each byte in the selector n
determines whether the corresponding byte in d
will be stored. Address p
does not need to be 16byte aligned.
if (n0[7]) p[0] := d0 if (n1[7]) p[1] := d1 ... if (n15[7]) p[15] := d15 void _mm_store1_epi64(__m128i *p, __m128i a); MOVQ
Stores the lower 64 bits of the value pointed to by p
.
*p[63:0]:=a0
Cache Support
void _mm_stream_si128(__m128i *p, __m128i a)
Stores the data in a
to the address p
without polluting the caches. If the cache line containing address p
is already in the cache, the cache will be updated. Address p
must be 16byte aligned.
*p := a void _mm_stream_si32(int *p, int a)
Stores the data in a
to the address p
without polluting the caches. If the cache line containing address p
is already in the cache, the cache will be updated.
*p := a void _mm_clflush(void const*p)
Cache line containing p
is flushed and invalidated from all caches in the coherency domain.
void _mm_lfence(void)
Guarantees that every load instruction that precedes, in program order, the load fence instruction is globally visible before any load instruction that follows the fence in program order.
void _mm_mfence(void)
Guarantees that every memory access that precedes, in program order, the memory fence instruction is globally visible before any memory instruction that follows the fence in program order.
void _mm_pause(void)
The execution of the next instruction is delayed an implementation specific amount of time. The instruction does not modify the architectural state.
Shuffle Function Macro
_MM_SHUFFLE2(x, y) /* expands to the value of */ (x<<1)  y
You can view the two integers as selectors for choosing which two words from the first input operand and which two words from the second are to be put into the result word.
View of Original and Result Words with Shuffle Function Macro