Stefano Tommesani

  • Increase font size
  • Default font size
  • Decrease font size
Home Programming SSE2 Intrinsics

SSE2 Intrinsics

Floating-Point Intrinsics

Arithmetic Operation Intrinsics

Intrinsic name Corresponding instruction Operation R0 value R1 value
_mm_add_sd ADDSD Adds
a0 [op] b0
a1
_mm_add_pd ADDPD Adds
a0 [op] b0
a1 [op] b1
_mm_div_sd DIVSD Divides
a0 [op] b0
a1
_mm_div_pd DIVPD Divides
a0 [op] b0
a1 [op] b1
_mm_max_sd MAXSD Computes maximum
a0 [op] b0
a1
_mm_min_pd MAXPD Computes maximum
a0 [op] b0
a1 [op] b1
_mm_min_sd MINSD Computes minimum
a0 [op] b0
a1
_mm_min_pd MINPD Computes minimum
a0 [op] b0
a1 [op] b1
_mm_mul_sd MULSD Multiplies
a0 [op] b0
a1
_mm_mul_pd MULPD Multiplies
a0 [op] b0
a1 [op] b1
_mm_sqrt_sd SQRTSD Computes square root
a0 [op] b0
a1
_mm_sqrt_pd SQRTPD Computes square root
a0 [op] b0
a1 [op] b1
_mm_sub_sd SUBSD Subtracts
a0 [op] b0
a1
_mm_sub_pd SUBPD Subtracts
a0 [op] b0
a1 [op] b1

 

Logical Operations

__m128d _mm_andnot_pd (__m128d a, __m128d b);
ANDNPD

Computes the bitwise AND of the 128-bit value in b and the bitwise NOT of the 128-bit value in a.

r0 := (~a0) & b0 
r1 := (~a1) & b1 
__m128d _mm_and_pd (__m128d a, __m128d b);
ANDPD 

Computes the bitwise AND of the two double-precision, floating-point values of a and b.

r0 := a0 & b0
r1 := a1 & b1
__m128d _mm_or_pd (__m128d a, __m128d b);
ORPD

Computes the bitwise OR of the two double-precision, floating-point values of a and b.

r0 := a0 | b0
r1 := a1 | b1
__m128d _mm_xor_pd (__m128d a, __m128d b);
XORPD

Computes the bitwise XOR of the two double-precision, floating-point values of a and b .

r0 := a0 ^ b0
r1 := a1 ^ b1

 

Comparison Intrinsics

Intrinsic name Corresponding instruction Compare for
_mm_cmpeq_pd CMPEQPD Equality
_mm_cmplt_pd CMPLTPD Less than
_mm_cmple_pd CMPLEPD Less than or equal
_mm_cmpgt_pd CMPLTPDr Greater than
_mm_cmpge_pd CMPLEPDr Greater than or equal
_mm_cmpord_pd CMPORDPD Ordered
_mm_cmpunord_pd CMPUNORDPD Unordered
_mm_cmpneq_pd CMPNEQPD Inequality
_mm_cmpnlt_pd CMPNLTPD Not less than
_mm_cmpnle_pd CMPNLEPD Not less than or equal
_mm_cmpngt_pd CMPNLTPDr Not greater than
_mm_cmpnge_pd CMPLEPDr Not greater than or equal
_mm_cmpeq_sd CMPEQSD Equality
_mm_cmplt_sd CMPLTSD Less than
_mm_cmple_sd CMPLESD Less than or equal
_mm_cmpgt_sd CMPLTSDr Greater than
_mm_cmpge_sd CMPLESDr Greater than or equal
_mm_cmpord_sd CMPORDSD Ordered
_mm_cmpunord_sd CMPUNORDSD Unordered
_mm_cmpneq_sd CMPNEQSD Inequality
_mm_cmpnlt_sd CMPNLTSD Not less than
_mm_cmpnle_sd CMPNLESD Not less than or equal
_mm_cmpngt_sd CMPNLTSDr Not greater than
_mm_cmpnge_sd CMPNLESDR Not greater than or equal
_mm_comieq_sd COMISD Equality
_mm_comilt_sd COMISD Less than
_mm_comile_sd COMISD Less than or equal
_mm_comigt_sd COMISD Greater than
_mm_comige_sd COMISD Greater than or equal
_mm_comineq_sd COMISD Not equal
_mm_ucomieq_sd UCOMISD Equality
_mm_ucomilt_sd UCOMISD Less than
_mm_ucomile_sd UCOMISD Less than or equal
_mm_ucomigt_sd UCOMISD Greater than
_mm_ucomige_sd UCOMISD Greater than or equal
_mm_ucomineq_sd UCOMISD Not equal

 

Conversion Operations

Intrinsic name Corresponding instruction Return type Parameters
_mm_cvtpd_ps CVTPD2PS __m128 (__m128d a)
_mm_cvtps_pd CVTPS2PD __m128d (__m128 a)
_mm_cvtepi32_pd CVTDQ2PD __m128d (__m128i a)
_mm_cvtpd_epi32 CVTPD2DQ __m128i (__m128d a)
_mm_cvtsd_si32 CVTSD2SI int (__m128d a)
_mm_cvtsd_ss CVTSD2SS __m128 (__m128 a, __m128d b)
_mm_cvtsi32_sd CVTSI2SD __m128d (__m128d a, int b)
_mm_cvtss_sd CVTSS2SD __m128d (__m128d a, __m128 b)
_mm_cvttpd_epi32 CVTTPD2DQ __m128i (__m128d a)
_mm_cvttsd_si32 CVTTSD2SI int (__m128d a)
_mm_cvtepi32_ps CVTDQ2PS __m128 (__m128i a)
_mm_cvtps_epi32 CVTPS2DQ __m128i (__m128 a)
_mm_cvttps_epi32 CVTTPS2DQ __m128i (__m128 a)
_mm_cvtpd_pi32 CVTPD2PI __m64 (__m128d a)
_mm_cvttpd_pi32 CVTTPD2PI __m64 (__m128d a)
_mm_cvtpi32_pd CVTPI2PD __m128d (__m64 a)

Miscellaneous Operations

__m128d _mm_unpackhi_pd (__m128d a, __m128d b);
UNPCKHPD

Interleaves the upper double-precision, floating-point values of a and b.

r0 := a1
r1 := b1
__m128d _mm_unpacklo_pd (__m128d a, __m128d b);
UNPCKLPD

Interleaves the lower double-precision, floating-point values of a and b.

r0 := a0
1 := b0
int _mm_movemask_pd (__m128d a);
MOVMSKPD

Creates a two-bit mask from the sign bits of the two double-precision, floating-point values of a.

r := sign(a1) << 1 | sign(a0)
__m128d _mm_shuffle_pd (__m128d a, __m128d b, int i);
SHUFPD

Selects two specific double-precision, floating-point values from a and b, based on the mask i. The mask must be an immediate. See Macro Function for Shuffle Using Streaming SIMD Extensions 2 Instructions section for a description of the shuffle semantics.

 

Integer Intrinsics

 

Integer Arithmetic Operations

Intrinsic Instruction Operation
_mm_add_epi8 PADDB Addition
_mm_add_epi16 PADDW Addition
_mm_add_epi32 PADDD Addition
_mm_add_si64 PADDQ Addition
_mm_add_epi64 PADDQ Addition
_mm_adds_epi8 PADDSB Addition
_mm_adds_epi16 PADDSW Addition
_mm_adds_epu8 PADDUSB Addition
_mm_adds_epu16 PADDUSW Addition
_mm_avg_epu8 PAVGB Computes average
_mm_avg_epu16 PAVGW Computes average
_mm_madd_epi16 PMADDWD Multiplication/addition
_mm_max_epi16 PMAXSW Computes maxima
_mm_max_epu8 PMAXUB Computes maxima
_mm_min_epi16 PMINSW Computes minima
_mm_min_epu8 PMINUB Computes minima
_mm_mulhi_epi16 PMULHW Multiplication
_mm_mulhi_epu16 PMULHUW Multiplication
_mm_mullo_epi16 PMULLW Multiplication
_mm_mul_su32 PMULUDQ Multiplication
_mm_mul_epu32 PMULUDQ Multiplication
_mm_sad_epu8 PSADBW Computes difference/adds
_mm_sub_epi8 PSUBB Subtraction
_mm_sub_epi16 PSUBW Subtraction
_mm_sub_epi32 PSUBD Subtraction
_mm_sub_si64 PSUBQ Subtraction
_mm_sub_epi64 PSUBQ Subtraction
_mm_subs_epi8 PSUBSB Subtraction
_mm_subs_epi16 PSUBSW Subtraction
_mm_subs_epu8 PSUBUSB Subtraction
_mm_subs_epu16 PSUBUSW Subtraction

 

Logical Operations Intrinsics

an explanation of the syntax used in code samples in this topic, see Floating-Point Intrinsics Using Streaming SIMD Extensions.

__m128i _mm_and_si128 (__m128i a, __m128i b);
PAND

Computes the bitwise AND of the 128-bit value in a and the 128-bit value in b.

r := a & b
__m128i _mm_andnot_si128 (__m128i a, __m128i b);
PANDN

Computes the bitwise AND of the 128-bit value in b and the bitwise NOT of the 128-bit value in a.

r := (~a) & b
__m128i _mm_or_si128 (__m128i a, __m128i b);
POR

Computes the bitwise OR of the 128-bit value in a and the 128-bit value in b.

r := a | b
__m128i _mm_xor_si128 ( __m128i a, __m128i b);
PXOR

Computes the bitwise XOR of the 128-bit value in a and the 128-bit value in b.

r := a ^ b

 

Shift Operation Intrinsics

Intrinsic shift Direction shift Type Corresponding instruction
_mm_slli_si128 Left Logical PSLLDQ
_mm_slli_epi16 Left Logical PSLLW
_mm_sll_epi16 Left Logical PSLLW
_mm_slli_epi32 Left Logical PSLLD
_mm_sll_epi32 Left Logical PSLLD
_mm_slli_epi64 Left Logical PSLLQ
_mm_sll_epi64 Left Logical PSLLQ
_mm_srai_epi16 Right Arithmetic PSRAW
_mm_sra_epi16 Right Arithmetic PSRAW
_mm_srai_epi32 Right Arithmetic PSRAD
_mm_sra_epi32 Right Arithmetic PSRAD
_mm_srli_si128 Right Logical PSRLDQ
_mm_srli_epi16 Right Logical PSRLW
_mm_srl_epi16 Right Logical PSRLW
_mm_srli_epi32 Right Logical PSRLD
_mm_srl_epi32 Right Logical PSRLD
_mm_srli_epi64 Right Logical PSRLQ
_mm_srl_epi64 Right Logical PSRLQ

 

Conversion Intrinsics

__m128i _mm_cvtsi32_si128 (int a);
MOVD

Moves 32-bit integer a to the least significant 32 bits of an __m128 object one extending the upper bits.

r0 := a
r1 := 0x0 ; r2 := 0x0 ; r3 := 0x0
int _mm_cvtsi128_si32 (__m128i a);
MOVD

Moves the least significant 32 bits of a to a 32-bit integer.

r := a0

 

Comparison Intrinsics

Intrinsic name Instruction Comparison Elements Size of elements
_mm_cmpeq_epi8 PCMPEQB Equality 16 8
_mm_cmpeq_epi16 PCMPEQW Equality 8 16
_mm_cmpeq_epi32 PCMPEQD Equality 4 32
_mm_cmpgt_epi8 PCMPGTB Greater than 16 8
_mm_cmpgt_epi16 PCMPGTW Greater than 8 16
_mm_cmpgt_epi32 PCMPGTD Greater than 4 32
_mm_cmplt_epi8 PCMPGTBr Less than 16 8
_mm_cmplt_epi16 PCMPGTWr Less than 8 16
_mm_cmplt_epi32 PCMPGTDr Less than 4 32

 

Miscellaneous Operations Intrinsics

Intrinsic Corresponding instruction Operation
_mm_packs_epi16 PACKSSWB Packed saturation
_mm_packs_epi32 PACKSSDW Packed saturation
_mm_packus_epi16 PACKUSWB Packed saturation
_mm_extract_epi16 PEXTRW Extraction
_mm_insert_epi16 PINSRW Insertion
_mm_movemask_epi8 PMOVMSKB Mask creation
_mm_shuffle_epi32 PSHUFD Shuffle
_mm_shufflehi_epi16 PSHUFHW Shuffle
_mm_shufflelo_epi16 PSHUFLW Shuffle
_mm_unpackhi_epi8 PUNPCKHBW Interleave
_mm_unpackhi_epi16 PUNPCKHWD Interleave
_mm_unpackhi_epi32 PUNPCKHDQ Interleave
_mm_unpackhi_epi64 PUNPCKHQDQ Interleave
_mm_unpacklo_epi8 PUNPCKLBW Interleave
_mm_unpacklo_epi16 PUNPCKLWD Interleave
_mm_unpacklo_epi32 PUNPCKLDQ Interleave
_mm_unpacklo_epi64 PUNPCKLQDQ Interleave
_mm_movepi64_pi64 MOVDQ2Q Move
_mm_movpi64_pi64 MOVQ2DQ Move
_mm_move_epi64 MOVQ Move

 

Cache Support Intrinsics

void _mm_stream_pd (double *p, __m128d a);
MOVLPD

Stores the data in a to the address p without polluting caches. The address p must be 16-byte aligned. If the cache line containing address p is already in the cache, the cache will be updated.

p[0] := a0
p[1] := a1 

 

Integer Load Operation

__m128i _mm_load_si128 (__m128i *p);
MOVDQA

Loads 128-bit value. Address p must be 16-byte aligned.

r := *p
__m128i _mm_loadu_si128 (__m128i *p);
MOVDQU

Loads 128-bit value. Address p does not need be 16-byte aligned.

r := *p
__m128i _mm_loadl_epi64(__m128i const*p);
MOVQ

Load the lower 64 bits of the value pointed to by p into the lower 64 bits of the result, zeroing the upper 64 bits of the result.

r0:= *p[63:0]
r1:=0x0

 

Integer Set Operation Intrinsics

Intrinsic Corresponding instruction
_mm_set_epi64 Composite
_mm_set_epi32 Composite
_mm_set_epi16 Composite
_mm_set_epi8 Composite
_mm_set1_epi64 Composite
_mm_set1_epi32 Composite
_mm_set1_epi16 Composite
_mm_set1_epi8 Composite
_mm_setr_epi64 Composite
_mm_setr_epi32 Composite
_mm_setr_epi16 Composite
_mm_setr_epi8 Composite
_mm_setzero_si128 PXOR

 

Integer Store Operation Intrinsics

void _mm_store_si128 (__m128i *p, __m128i a);
MOVDQA

Stores 128-bit value. Address p must be 16-byte aligned.

*p := a
void _mm_storeu_si128 (__m128i *p, __m128i a);
MOVDQU

Stores 128-bit value. Address p does not need to be 16-byte aligned.

*p := a
void _mm_maskmoveu_si128(__m128i d, __m128i n, char *p);
MASKMOVDQU

Conditionally store byte elements of d to address p. The high bit of each byte in the selector n determines whether the corresponding byte in d will be stored. Address p does not need to be 16-byte aligned.

if (n0[7]) p[0] := d0
if (n1[7]) p[1] := d1
...
if (n15[7]) p[15] := d15
void _mm_store1_epi64(__m128i *p, __m128i a);
MOVQ

Stores the lower 64 bits of the value pointed to by p.

*p[63:0]:=a0

 

Cache Support

void _mm_stream_si128(__m128i *p, __m128i a)

Stores the data in a to the address p without polluting the caches. If the cache line containing address p is already in the cache, the cache will be updated. Address p must be 16-byte aligned.

*p := a
void _mm_stream_si32(int *p, int a)

Stores the data in a to the address p without polluting the caches. If the cache line containing address p is already in the cache, the cache will be updated.

*p := a
void _mm_clflush(void const*p)

Cache line containing p is flushed and invalidated from all caches in the coherency domain.

void _mm_lfence(void)

Guarantees that every load instruction that precedes, in program order, the load fence instruction is globally visible before any load instruction that follows the fence in program order.

void _mm_mfence(void)

Guarantees that every memory access that precedes, in program order, the memory fence instruction is globally visible before any memory instruction that follows the fence in program order.

void _mm_pause(void)

The execution of the next instruction is delayed an implementation specific amount of time. The instruction does not modify the architectural state.

 

Shuffle Function Macro

_MM_SHUFFLE2(x, y)
/* expands to the value of */
(x<<1) | y

You can view the two integers as selectors for choosing which two words from the first input operand and which two words from the second are to be put into the result word.

View of Original and Result Words with Shuffle Function Macro

 

Quote this article on your site

To create link towards this article on your website,
copy and paste the text below in your page.




Preview :

SSE2 Intrinsics
Thursday, 27 May 2010

Powered by QuoteThis © 2008
Last Updated on Monday, 27 May 2013 15:35  
View Stefano Tommesani's profile on LinkedIn

Latest Articles

Castle on the hill of crappy audio quality 19 March 2017, 01.53 Audio
Castle on the hill of crappy audio quality
As the yearly dynamic range day is close (March 31st), let's have a look at one of the biggest audio massacres of the year, Ed Sheeran's "Castle on the hill". First time I heard the song, I thought my headphones just got
Necessary evil: testing private methods 29 January 2017, 21.41 Testing
Necessary evil: testing private methods
Some might say that testing private methods should be avoided because it means not testing the contract, that is the interface implemented by the class, but the internal implementation of the class itself. Still, not all
I am right and you are wrong 28 December 2016, 14.23 Web
I am right and you are wrong
Have you ever convinced anyone that disagreed with you about a deeply held belief? Better yet, have you changed your mind lately on an important topic after discussing with someone else that did not share your point of
How Commercial Insight changes R&D 06 November 2016, 01.21 Web
How Commercial Insight changes R&D
The CEB's Commercial Insight is based on three pillars: Be credible/relevant – Demonstrate an understanding of the customer’s world, substantiating claims with real-world evidence. Be frame-breaking – Disrupt the
Windows Forms smells funny, but... 07 April 2016, 15.38 Software
Windows Forms smells funny, but...
In the "2016 .NET Community Report" just released by Telerik, the answers to the question "What technology would you choose if building for Windows Desktop?" were as follows: So roughly half of new desktop developments would

Translate