|
Floating-Point Intrinsics
Arithmetic Operation Intrinsics
Intrinsic name | Corresponding instruction | Operation | R0 value | R1 value |
---|---|---|---|---|
_mm_add_sd | ADDSD | Adds |
a0 [op] b0 |
a1 |
_mm_add_pd | ADDPD | Adds |
a0 [op] b0 |
a1 [op] b1 |
_mm_div_sd | DIVSD | Divides |
a0 [op] b0 |
a1 |
_mm_div_pd | DIVPD | Divides |
a0 [op] b0 |
a1 [op] b1 |
_mm_max_sd | MAXSD | Computes maximum |
a0 [op] b0 |
a1 |
_mm_min_pd | MAXPD | Computes maximum |
a0 [op] b0 |
a1 [op] b1 |
_mm_min_sd | MINSD | Computes minimum |
a0 [op] b0 |
a1 |
_mm_min_pd | MINPD | Computes minimum |
a0 [op] b0 |
a1 [op] b1 |
_mm_mul_sd | MULSD | Multiplies |
a0 [op] b0 |
a1 |
_mm_mul_pd | MULPD | Multiplies |
a0 [op] b0 |
a1 [op] b1 |
_mm_sqrt_sd | SQRTSD | Computes square root |
a0 [op] b0 |
a1 |
_mm_sqrt_pd | SQRTPD | Computes square root |
a0 [op] b0 |
a1 [op] b1 |
_mm_sub_sd | SUBSD | Subtracts |
a0 [op] b0 |
a1 |
_mm_sub_pd | SUBPD | Subtracts |
a0 [op] b0 |
a1 [op] b1 |
Logical Operations
__m128d _mm_andnot_pd (__m128d a, __m128d b); ANDNPD
Computes the bitwise AND
of the 128-bit value in b
and the bitwise NOT
of the 128-bit value in a
.
r0 := (~a0) & b0 r1 := (~a1) & b1 __m128d _mm_and_pd (__m128d a, __m128d b); ANDPD
Computes the bitwise AND
of the two double-precision, floating-point values of a
and b
.
r0 := a0 & b0 r1 := a1 & b1 __m128d _mm_or_pd (__m128d a, __m128d b); ORPD
Computes the bitwise OR
of the two double-precision, floating-point values of a
and b
.
r0 := a0 | b0 r1 := a1 | b1 __m128d _mm_xor_pd (__m128d a, __m128d b); XORPD
Computes the bitwise XOR
of the two double-precision, floating-point values of a
and b
.
r0 := a0 ^ b0 r1 := a1 ^ b1
Comparison Intrinsics
Intrinsic name | Corresponding instruction | Compare for |
---|---|---|
_mm_cmpeq_pd | CMPEQPD | Equality |
_mm_cmplt_pd | CMPLTPD | Less than |
_mm_cmple_pd | CMPLEPD | Less than or equal |
_mm_cmpgt_pd | CMPLTPDr | Greater than |
_mm_cmpge_pd | CMPLEPDr | Greater than or equal |
_mm_cmpord_pd | CMPORDPD | Ordered |
_mm_cmpunord_pd | CMPUNORDPD | Unordered |
_mm_cmpneq_pd | CMPNEQPD | Inequality |
_mm_cmpnlt_pd | CMPNLTPD | Not less than |
_mm_cmpnle_pd | CMPNLEPD | Not less than or equal |
_mm_cmpngt_pd | CMPNLTPDr | Not greater than |
_mm_cmpnge_pd | CMPLEPDr | Not greater than or equal |
_mm_cmpeq_sd | CMPEQSD | Equality |
_mm_cmplt_sd | CMPLTSD | Less than |
_mm_cmple_sd | CMPLESD | Less than or equal |
_mm_cmpgt_sd | CMPLTSDr | Greater than |
_mm_cmpge_sd | CMPLESDr | Greater than or equal |
_mm_cmpord_sd | CMPORDSD | Ordered |
_mm_cmpunord_sd | CMPUNORDSD | Unordered |
_mm_cmpneq_sd | CMPNEQSD | Inequality |
_mm_cmpnlt_sd | CMPNLTSD | Not less than |
_mm_cmpnle_sd | CMPNLESD | Not less than or equal |
_mm_cmpngt_sd | CMPNLTSDr | Not greater than |
_mm_cmpnge_sd | CMPNLESDR | Not greater than or equal |
_mm_comieq_sd | COMISD | Equality |
_mm_comilt_sd | COMISD | Less than |
_mm_comile_sd | COMISD | Less than or equal |
_mm_comigt_sd | COMISD | Greater than |
_mm_comige_sd | COMISD | Greater than or equal |
_mm_comineq_sd | COMISD | Not equal |
_mm_ucomieq_sd | UCOMISD | Equality |
_mm_ucomilt_sd | UCOMISD | Less than |
_mm_ucomile_sd | UCOMISD | Less than or equal |
_mm_ucomigt_sd | UCOMISD | Greater than |
_mm_ucomige_sd | UCOMISD | Greater than or equal |
_mm_ucomineq_sd | UCOMISD | Not equal |
Conversion Operations
Intrinsic name | Corresponding instruction | Return type | Parameters |
---|---|---|---|
_mm_cvtpd_ps | CVTPD2PS | __m128 | (__m128d a) |
_mm_cvtps_pd | CVTPS2PD | __m128d | (__m128 a) |
_mm_cvtepi32_pd | CVTDQ2PD | __m128d | (__m128i a) |
_mm_cvtpd_epi32 | CVTPD2DQ | __m128i | (__m128d a) |
_mm_cvtsd_si32 | CVTSD2SI | int | (__m128d a) |
_mm_cvtsd_ss | CVTSD2SS | __m128 | (__m128 a, __m128d b) |
_mm_cvtsi32_sd | CVTSI2SD | __m128d | (__m128d a, int b) |
_mm_cvtss_sd | CVTSS2SD | __m128d | (__m128d a, __m128 b) |
_mm_cvttpd_epi32 | CVTTPD2DQ | __m128i | (__m128d a) |
_mm_cvttsd_si32 | CVTTSD2SI | int | (__m128d a) |
_mm_cvtepi32_ps | CVTDQ2PS | __m128 | (__m128i a) |
_mm_cvtps_epi32 | CVTPS2DQ | __m128i | (__m128 a) |
_mm_cvttps_epi32 | CVTTPS2DQ | __m128i | (__m128 a) |
_mm_cvtpd_pi32 | CVTPD2PI | __m64 | (__m128d a) |
_mm_cvttpd_pi32 | CVTTPD2PI | __m64 | (__m128d a) |
_mm_cvtpi32_pd | CVTPI2PD | __m128d | (__m64 a) |
Miscellaneous Operations
__m128d _mm_unpackhi_pd (__m128d a, __m128d b); UNPCKHPD
Interleaves the upper double-precision, floating-point values of a
and b
.
r0 := a1 r1 := b1 __m128d _mm_unpacklo_pd (__m128d a, __m128d b); UNPCKLPD
Interleaves the lower double-precision, floating-point values of a
and b
.
r0 := a0 1 := b0 int _mm_movemask_pd (__m128d a); MOVMSKPD
Creates a two-bit mask from the sign bits of the two double-precision, floating-point values of a
.
r := sign(a1) << 1 | sign(a0) __m128d _mm_shuffle_pd (__m128d a, __m128d b, int i); SHUFPD
Selects two specific double-precision, floating-point values from a
and b
, based on the mask i
. The mask must be an immediate. See Macro Function for Shuffle Using Streaming SIMD Extensions 2 Instructions section for a description of the shuffle semantics.
Integer Intrinsics
Integer Arithmetic Operations
Intrinsic | Instruction | Operation |
---|---|---|
_mm_add_epi8 | PADDB | Addition |
_mm_add_epi16 | PADDW | Addition |
_mm_add_epi32 | PADDD | Addition |
_mm_add_si64 | PADDQ | Addition |
_mm_add_epi64 | PADDQ | Addition |
_mm_adds_epi8 | PADDSB | Addition |
_mm_adds_epi16 | PADDSW | Addition |
_mm_adds_epu8 | PADDUSB | Addition |
_mm_adds_epu16 | PADDUSW | Addition |
_mm_avg_epu8 | PAVGB | Computes average |
_mm_avg_epu16 | PAVGW | Computes average |
_mm_madd_epi16 | PMADDWD | Multiplication/addition |
_mm_max_epi16 | PMAXSW | Computes maxima |
_mm_max_epu8 | PMAXUB | Computes maxima |
_mm_min_epi16 | PMINSW | Computes minima |
_mm_min_epu8 | PMINUB | Computes minima |
_mm_mulhi_epi16 | PMULHW | Multiplication |
_mm_mulhi_epu16 | PMULHUW | Multiplication |
_mm_mullo_epi16 | PMULLW | Multiplication |
_mm_mul_su32 | PMULUDQ | Multiplication |
_mm_mul_epu32 | PMULUDQ | Multiplication |
_mm_sad_epu8 | PSADBW | Computes difference/adds |
_mm_sub_epi8 | PSUBB | Subtraction |
_mm_sub_epi16 | PSUBW | Subtraction |
_mm_sub_epi32 | PSUBD | Subtraction |
_mm_sub_si64 | PSUBQ | Subtraction |
_mm_sub_epi64 | PSUBQ | Subtraction |
_mm_subs_epi8 | PSUBSB | Subtraction |
_mm_subs_epi16 | PSUBSW | Subtraction |
_mm_subs_epu8 | PSUBUSB | Subtraction |
_mm_subs_epu16 | PSUBUSW | Subtraction |
Logical Operations Intrinsics
an explanation of the syntax used in code samples in this topic, see Floating-Point Intrinsics Using Streaming SIMD Extensions.
__m128i _mm_and_si128 (__m128i a, __m128i b); PAND
Computes the bitwise AND
of the 128-bit value in a
and the 128-bit value in b
.
r := a & b __m128i _mm_andnot_si128 (__m128i a, __m128i b); PANDN
Computes the bitwise AND
of the 128-bit value in b
and the bitwise NOT
of the 128-bit value in a
.
r := (~a) & b __m128i _mm_or_si128 (__m128i a, __m128i b); POR
Computes the bitwise OR
of the 128-bit value in a
and the 128-bit value in b
.
r := a | b __m128i _mm_xor_si128 ( __m128i a, __m128i b); PXOR
Computes the bitwise XOR
of the 128-bit value in a
and the 128-bit value in b
.
r := a ^ b
Shift Operation Intrinsics
Intrinsic shift | Direction shift | Type | Corresponding instruction |
---|---|---|---|
_mm_slli_si128 | Left | Logical | PSLLDQ |
_mm_slli_epi16 | Left | Logical | PSLLW |
_mm_sll_epi16 | Left | Logical | PSLLW |
_mm_slli_epi32 | Left | Logical | PSLLD |
_mm_sll_epi32 | Left | Logical | PSLLD |
_mm_slli_epi64 | Left | Logical | PSLLQ |
_mm_sll_epi64 | Left | Logical | PSLLQ |
_mm_srai_epi16 | Right | Arithmetic | PSRAW |
_mm_sra_epi16 | Right | Arithmetic | PSRAW |
_mm_srai_epi32 | Right | Arithmetic | PSRAD |
_mm_sra_epi32 | Right | Arithmetic | PSRAD |
_mm_srli_si128 | Right | Logical | PSRLDQ |
_mm_srli_epi16 | Right | Logical | PSRLW |
_mm_srl_epi16 | Right | Logical | PSRLW |
_mm_srli_epi32 | Right | Logical | PSRLD |
_mm_srl_epi32 | Right | Logical | PSRLD |
_mm_srli_epi64 | Right | Logical | PSRLQ |
_mm_srl_epi64 | Right | Logical | PSRLQ |
Conversion Intrinsics
__m128i _mm_cvtsi32_si128 (int a); MOVD
Moves 32-bit integer a
to the least significant 32 bits of an __m128
object one extending the upper bits.
r0 := a r1 := 0x0 ; r2 := 0x0 ; r3 := 0x0 int _mm_cvtsi128_si32 (__m128i a); MOVD
Moves the least significant 32 bits of a
to a 32-bit integer.
r := a0
Comparison Intrinsics
Intrinsic name | Instruction | Comparison | Elements | Size of elements |
---|---|---|---|---|
_mm_cmpeq_epi8 | PCMPEQB | Equality | 16 | 8 |
_mm_cmpeq_epi16 | PCMPEQW | Equality | 8 | 16 |
_mm_cmpeq_epi32 | PCMPEQD | Equality | 4 | 32 |
_mm_cmpgt_epi8 | PCMPGTB | Greater than | 16 | 8 |
_mm_cmpgt_epi16 | PCMPGTW | Greater than | 8 | 16 |
_mm_cmpgt_epi32 | PCMPGTD | Greater than | 4 | 32 |
_mm_cmplt_epi8 | PCMPGTBr | Less than | 16 | 8 |
_mm_cmplt_epi16 | PCMPGTWr | Less than | 8 | 16 |
_mm_cmplt_epi32 | PCMPGTDr | Less than | 4 | 32 |
Miscellaneous Operations Intrinsics
Intrinsic | Corresponding instruction | Operation |
---|---|---|
_mm_packs_epi16 | PACKSSWB | Packed saturation |
_mm_packs_epi32 | PACKSSDW | Packed saturation |
_mm_packus_epi16 | PACKUSWB | Packed saturation |
_mm_extract_epi16 | PEXTRW | Extraction |
_mm_insert_epi16 | PINSRW | Insertion |
_mm_movemask_epi8 | PMOVMSKB | Mask creation |
_mm_shuffle_epi32 | PSHUFD | Shuffle |
_mm_shufflehi_epi16 | PSHUFHW | Shuffle |
_mm_shufflelo_epi16 | PSHUFLW | Shuffle |
_mm_unpackhi_epi8 | PUNPCKHBW | Interleave |
_mm_unpackhi_epi16 | PUNPCKHWD | Interleave |
_mm_unpackhi_epi32 | PUNPCKHDQ | Interleave |
_mm_unpackhi_epi64 | PUNPCKHQDQ | Interleave |
_mm_unpacklo_epi8 | PUNPCKLBW | Interleave |
_mm_unpacklo_epi16 | PUNPCKLWD | Interleave |
_mm_unpacklo_epi32 | PUNPCKLDQ | Interleave |
_mm_unpacklo_epi64 | PUNPCKLQDQ | Interleave |
_mm_movepi64_pi64 | MOVDQ2Q | Move |
_mm_movpi64_pi64 | MOVQ2DQ | Move |
_mm_move_epi64 | MOVQ | Move |
Cache Support Intrinsics
void _mm_stream_pd (double *p, __m128d a); MOVLPD
Stores the data in a
to the address p without polluting caches. The address p
must be 16-byte aligned. If the cache line containing address p
is already in the cache, the cache will be updated.
p[0] := a0 p[1] := a1
Integer Load Operation
__m128i _mm_load_si128 (__m128i *p); MOVDQA
Loads 128-bit value. Address p
must be 16-byte aligned.
r := *p __m128i _mm_loadu_si128 (__m128i *p); MOVDQU
Loads 128-bit value. Address p
does not need be 16-byte aligned.
r := *p __m128i _mm_loadl_epi64(__m128i const*p); MOVQ
Load the lower 64 bits of the value pointed to by p into the lower 64 bits of the result, zeroing the upper 64 bits of the result.
r0:= *p[63:0] r1:=0x0
Integer Set Operation Intrinsics
Intrinsic | Corresponding instruction |
---|---|
_mm_set_epi64 | Composite |
_mm_set_epi32 | Composite |
_mm_set_epi16 | Composite |
_mm_set_epi8 | Composite |
_mm_set1_epi64 | Composite |
_mm_set1_epi32 | Composite |
_mm_set1_epi16 | Composite |
_mm_set1_epi8 | Composite |
_mm_setr_epi64 | Composite |
_mm_setr_epi32 | Composite |
_mm_setr_epi16 | Composite |
_mm_setr_epi8 | Composite |
_mm_setzero_si128 | PXOR |
Integer Store Operation Intrinsics
void _mm_store_si128 (__m128i *p, __m128i a); MOVDQA
Stores 128-bit value. Address p
must be 16-byte aligned.
*p := a void _mm_storeu_si128 (__m128i *p, __m128i a); MOVDQU
Stores 128-bit value. Address p
does not need to be 16-byte aligned.
*p := a void _mm_maskmoveu_si128(__m128i d, __m128i n, char *p); MASKMOVDQU
Conditionally store byte elements of d
to address p
. The high bit of each byte in the selector n
determines whether the corresponding byte in d
will be stored. Address p
does not need to be 16-byte aligned.
if (n0[7]) p[0] := d0 if (n1[7]) p[1] := d1 ... if (n15[7]) p[15] := d15 void _mm_store1_epi64(__m128i *p, __m128i a); MOVQ
Stores the lower 64 bits of the value pointed to by p
.
*p[63:0]:=a0
Cache Support
void _mm_stream_si128(__m128i *p, __m128i a)
Stores the data in a
to the address p
without polluting the caches. If the cache line containing address p
is already in the cache, the cache will be updated. Address p
must be 16-byte aligned.
*p := a void _mm_stream_si32(int *p, int a)
Stores the data in a
to the address p
without polluting the caches. If the cache line containing address p
is already in the cache, the cache will be updated.
*p := a void _mm_clflush(void const*p)
Cache line containing p
is flushed and invalidated from all caches in the coherency domain.
void _mm_lfence(void)
Guarantees that every load instruction that precedes, in program order, the load fence instruction is globally visible before any load instruction that follows the fence in program order.
void _mm_mfence(void)
Guarantees that every memory access that precedes, in program order, the memory fence instruction is globally visible before any memory instruction that follows the fence in program order.
void _mm_pause(void)
The execution of the next instruction is delayed an implementation specific amount of time. The instruction does not modify the architectural state.
Shuffle Function Macro
_MM_SHUFFLE2(x, y) /* expands to the value of */ (x<<1) | y
You can view the two integers as selectors for choosing which two words from the first input operand and which two words from the second are to be put into the result word.
View of Original and Result Words with Shuffle Function Macro