SSE Intrinsics

Thursday, 27 May 2010 14:19 Stefano Tommesani
Print

Packed Arithmetic Intrinsics

Intrinsic Instruction Operation R0 R1 R2 R3
_mm_add_ss ADDSS Adds
a0 [op] b0
a1
a2
a3
_mm_add_ps ADDPS Adds
a0 [op] b0
a1 [op] b1
a2 [op] b2
a3 [op] b3
_mm_sub_ss SUBSS Subtracts
a0 [op] b0
a1
a2
a3
_mm_sub_ps SUBPS Subtracts
a0 [op] b0
a1 [op] b1
a2 [op] b2
a3 [op] b3
_mm_mul_ss MULSS Multiplies
a0 [op] b0
a1
a2
a3
_mm_mul_ps MULPS Multiplies
a0 [op] b0
a1 [op] b1
a2 [op] b2
a3 [op] b3
_mm_div_ss DIVSS Divides
a0 [op] b0
a1
a2
a3
_mm_div_ps DIVPS Divides
a0 [op] b0
a1 [op] b1
a2 [op] b2
a3 [op] b3
_mm_sqrt_ss SQRTSS Computes squared root
[op] a0
a1
a2
a3
_mm_sqrt_ps SQRTPS Computes squared root
[op] a0
[op] b1
[op] b2
[op] b3
_mm_rcp_ss RCPSS Computes reciprocal
[op] a0
a1
a2
a3
_mm_rcp_ps RCPPS Computes reciprocal
[op] a0
[op] b1
[op] b2
[op] b3
_mm_rsqrt_ss RSQRTSS Computes reciprocal square root
[op] a0
a1
a2
a3
_mm_rsqrt_ps RSQRTPS Computes reciprocal squared root
[op] a0
[op] b1
[op] b2
[op] b3
_mm_min_ss MINSS Computes minimum
[op]( a0,b0)
a1
a2
a3
_mm_min_ps MINPS Computes minimum
[op]( a0,b0)
[op] (a1, b1)
[op] (a2, b2)
[op] (a3, b3)
_mm_max_ss MAXSS Computes maximum
[op]( a0,b0)
a1
a2
a3
_mm_max_ps MAXPS Computes maximum
[op]( a0,b0)
[op] (a1, b1)
[op] (a2, b2)
[op] (a3, b3)

 

Logical Intrinsics

Intrinsic name Operation Corresponding instruction
_mm_and_ps Bitwise AND ANDPS
_mm_andnot_ps Logical NOT ANDNPS
_mm_or_ps Bitwise OR ORPS
_mm_xor_ps Bitwise Exclusive OR XORPS

Compare Intrinsics

Intrinsic name Comparison Corresponding instruction
_mm_cmpeq_ss Equal CMPEQSS
_mm_cmpeq_ps Equal CMPEQPS
_mm_cmplt_ss Less than CMPLTSS
_mm_cmplt_ps Less than CMPLTPS
_mm_cmple_ss Less than or equal CMPLESS
_mm_cmple_ps Less than or equal CMPLEPS
_mm_cmpgt_ss Greater than CMPLTSS
_mm_cmpgt_ps Greater than CMPLTPS
_mm_cmpge_ss Greater than or equal CMPLESS
_mm_cmpge_ps Greater than or equal CMPLEPS
_mm_cmpneq_ss Not equal CMPNEQSS
_mm_cmpneq_ps Not equal CMPNEQPS
_mm_cmpnlt_ss Not less than CMPNLTSS
_mm_cmpnlt_ps Not less than CMPNLTPS
_mm_cmpnle_ss Not less than or equal CMPNLESS
_mm_cmple_ps Not less than or equal CMPNLEPS
_mm_cmpngt_ss Not greater than CMPNLTSS
_mm_cmpngt_ps Not greater than CMPNLTPS
_mm_cmpnge_ss Not greater than or equal CMPNLESS
_mm_cmpnge_ps Not greater than or equal CMPNLEPS
_mm_cmpord_ss Ordered CMPORDSS
_mm_cmpord_ps Ordered CMPORDPS
_mm_cmpunord_ss Unordered CMPUNORDSS
_mm_cmpunord_ps Unordered CMPUNORDPS
_mm_comieq_ss Equal COMISS
_mm_comilt_ss Less than COMISS
_mm_comile_ss Less than or equal COMISS
_mm_comigt_ss Greater than COMISS
_mm_comige_ss Greater than or equal COMISS
_mm_comineq_ss Not equal COMISS
_mm_ucomieq_ss Equal UCOMISS
_mm_ucomilt_ss Less than UCOMISS
_mm_ucomile_ss Less than or equal UCOMISS
_mm_ucomigt_ss Greater than UCOMISS
_mm_ucomige_ss Greater than or equal UCOMISS
_mm_ucomineq_ss Not equal UCOMISS

 

Conversion Operations

Intrinsic name Corresponding instruction
_mm_cvtss_si32 CVTSS2SI
_mm_cvtps_pi32 CVTPS2PI
_mm_cvttss_si32 CVTTSS2SI
_mm_cvttps_pi32 CVTTPS2PI
_mm_cvtsi32_ss CVTSI2SS
_mm_cvtpi32_ps CVTTPS2PI
_mm_cvtpi16_ps Composite
_mm_cvtpu16_ps Composite
_mm_cvtpi8_ps Composite
_mm_cvtpu8_ps Composite
_mm_cvtpi32x2_ps Composite
_mm_cvtps_pi16 Composite
_mm_cvtps_pi8 Composite

 

Miscellaneous Intrinsics

Intrinsic name Operation Corresponding instruction
_mm_shuffle_ps Shuffles SHUFPS
_mm_shuffle_pi16 Shuffles PSHUFW
_mm_unpackhi_ps Unpacks high UNPCKHPS
_mm_unpacklo_ps Unpacks low UNPCKLPS
_mm_loadh_pi Loads high MOVHPS reg, mem
_mm_storeh_pi Stores high MOVHPS mem, reg
_mm_movehl_ps Moves high to low MOVHLPS
_mm_movelh_ps Moves low to high MOVLHPS
_mm_loadl_pi Loads low MOVLPS reg, mem
_mm_storel_pi Stores low MOVLPS mem, reg
_mm_movemask_ps Creates four-bit mask MOVMSKPS
_mm_getcsr Returns register contents STMXCSR
_mm_setcsr Sets control register LDMXCSR

 

Memory and Initialization Load Operations

Intrinsic name Operation Corresponding instruction
_mm_load_ss Loads the low value and clears the three high values MOVSS
_mm_load1_ps Loads one value into all four words MOVSS + Shuffling
_mm_load_ps Loads four values, address aligned MOVAPS
_mm_loadu_ps Loads four values, address unaligned MOVUPS
_mm_loadr_ps Loads four values, in reverse order MOVAPS + Shuffling

 

Memory and Initialization Set Operations

Intrinsic name Operation Corresponding instruction
_mm_set_ss Sets the low value and clears the three high values Composite
_mm_set1_ps Sets all four words with the same value Composite
_mm_set_ps Sets four values, address aligned Composite
_mm_setr_ps Sets four values, in reverse order Composite
_mm_setzero_ps Clears all four values Composite

 

Memory and Initialization Store Operations

Intrinsic name Operation Corresponding instruction
_mm_store_ss Stores the low value MOVSS
_mm_store1_ps Stores the low value across all four words MOVSS + Shuffling
_mm_store_ps Stores four values, address aligned MOVAPS
_mm_storeu_ps Stores four values, address unaligned MOVUPS
_mm_storer_ps Stores four values, in reverse order MOVAPS + Shuffling
_mm_move_ss Sets the low word, and passes in three high values MOVSS

 

Integer Intrinsics

Intrinsic name Operation Corresponding instruction
_mm_extract_pi16 Extracts one of four words PEXTRW
_mm_insert_pi16 Inserts a word PINSRW
_mm_max_pi16 Computes the maximum PMAXSW
_mm_max_pu8 Computes the maximum, unsigned PMAXUB
_mm_min_pi16 Computes the minimum PMINSW
_mm_min_pu8 Computes the minimum, unsigned PMINUB
_mm_movemask_pi8 Creates an 8-bit mask PMOVMSKB
_mm_mulhi_pu16 Multiplies, returning high bits PMULHUW
_mm_shuffle_pi16 Returns a combination of four words PSHUFW
_mm_maskmove_si64 Computes conditional store MASKMOVQ
_mm_avg_pu8 Computes rounded average PAVGB
_mm_avg_pu16 Computes rounded average PAVGW
_mm_sad_pu8 Computes sum of absolute differences PSADBW

 

Cache support

void _mm_prefetch(char * p , int i );
PREFETCH

Loads one cache line of data from address p to a location closer to the processor. The value i specifies the type of prefetch operation: the constants _MM_HINT_T0, _MM_HINT_T1, _MM_HINT_T2, and _MM_HINT_NTA, corresponding to the type of prefetch instruction, should be used.

void _mm_stream_pi(__m64 * p , __m64 a );
MOVNTQ

Stores the data in a to the address p without polluting the caches. This intrinsic requires you to empty the multimedia state for the MMX register. See Understanding the EMMS Instruction section.

void _mm_stream_ps(float * p , __m128 a );
MOVNTPS

Stores the data in a to the address p without polluting the caches. The address must be 16-byte aligned.

void _mm_sfence(void);
SFENCE

Guarantees that every preceding store is globally visible before any subsequent store.

 

 

 

 

Last Updated on Monday, 27 May 2013 15:09