## SSE Intrinsics

Thursday, 27 May 2010 14:19 Stefano Tommesani

## Packed Arithmetic Intrinsics

Intrinsic Instruction Operation R0 R1 R2 R3
```a0 [op] b0
```
```a1
```
```a2
```
```a3
```
```a0 [op] b0
```
```a1 [op] b1
```
```a2 [op] b2
```
```a3 [op] b3
```
_mm_sub_ss SUBSS Subtracts
```a0 [op] b0
```
```a1
```
```a2
```
```a3
```
_mm_sub_ps SUBPS Subtracts
```a0 [op] b0
```
```a1 [op] b1
```
```a2 [op] b2
```
```a3 [op] b3
```
_mm_mul_ss MULSS Multiplies
```a0 [op] b0
```
```a1
```
```a2
```
```a3
```
_mm_mul_ps MULPS Multiplies
```a0 [op] b0
```
```a1 [op] b1
```
```a2 [op] b2
```
```a3 [op] b3
```
_mm_div_ss DIVSS Divides
```a0 [op] b0
```
```a1
```
```a2
```
```a3
```
_mm_div_ps DIVPS Divides
```a0 [op] b0
```
```a1 [op] b1
```
```a2 [op] b2
```
```a3 [op] b3
```
_mm_sqrt_ss SQRTSS Computes squared root
```[op] a0
```
```a1
```
```a2
```
```a3
```
_mm_sqrt_ps SQRTPS Computes squared root
```[op] a0
```
```[op] b1
```
```[op] b2
```
```[op] b3
```
_mm_rcp_ss RCPSS Computes reciprocal
```[op] a0
```
```a1
```
```a2
```
```a3
```
_mm_rcp_ps RCPPS Computes reciprocal
```[op] a0
```
```[op] b1
```
```[op] b2
```
```[op] b3
```
_mm_rsqrt_ss RSQRTSS Computes reciprocal square root
```[op] a0
```
```a1
```
```a2
```
```a3
```
_mm_rsqrt_ps RSQRTPS Computes reciprocal squared root
```[op] a0
```
```[op] b1
```
```[op] b2
```
```[op] b3
```
_mm_min_ss MINSS Computes minimum
```[op]( a0,b0)
```
```a1
```
```a2
```
```a3
```
_mm_min_ps MINPS Computes minimum
```[op]( a0,b0)
```
```[op] (a1, b1)
```
```[op] (a2, b2)
```
```[op] (a3, b3)
```
_mm_max_ss MAXSS Computes maximum
```[op]( a0,b0)
```
```a1
```
```a2
```
```a3
```
_mm_max_ps MAXPS Computes maximum
```[op]( a0,b0)
```
```[op] (a1, b1)
```
```[op] (a2, b2)
```
```[op] (a3, b3)
```

## Logical Intrinsics

Intrinsic name Operation Corresponding instruction
_mm_and_ps Bitwise AND ANDPS
_mm_andnot_ps Logical NOT ANDNPS
_mm_or_ps Bitwise OR ORPS
_mm_xor_ps Bitwise Exclusive OR XORPS

## Compare Intrinsics

Intrinsic name Comparison Corresponding instruction
_mm_cmpeq_ss Equal CMPEQSS
_mm_cmpeq_ps Equal CMPEQPS
_mm_cmplt_ss Less than CMPLTSS
_mm_cmplt_ps Less than CMPLTPS
_mm_cmple_ss Less than or equal CMPLESS
_mm_cmple_ps Less than or equal CMPLEPS
_mm_cmpgt_ss Greater than CMPLTSS
_mm_cmpgt_ps Greater than CMPLTPS
_mm_cmpge_ss Greater than or equal CMPLESS
_mm_cmpge_ps Greater than or equal CMPLEPS
_mm_cmpneq_ss Not equal CMPNEQSS
_mm_cmpneq_ps Not equal CMPNEQPS
_mm_cmpnlt_ss Not less than CMPNLTSS
_mm_cmpnlt_ps Not less than CMPNLTPS
_mm_cmpnle_ss Not less than or equal CMPNLESS
_mm_cmple_ps Not less than or equal CMPNLEPS
_mm_cmpngt_ss Not greater than CMPNLTSS
_mm_cmpngt_ps Not greater than CMPNLTPS
_mm_cmpnge_ss Not greater than or equal CMPNLESS
_mm_cmpnge_ps Not greater than or equal CMPNLEPS
_mm_cmpord_ss Ordered CMPORDSS
_mm_cmpord_ps Ordered CMPORDPS
_mm_cmpunord_ss Unordered CMPUNORDSS
_mm_cmpunord_ps Unordered CMPUNORDPS
_mm_comieq_ss Equal COMISS
_mm_comilt_ss Less than COMISS
_mm_comile_ss Less than or equal COMISS
_mm_comigt_ss Greater than COMISS
_mm_comige_ss Greater than or equal COMISS
_mm_comineq_ss Not equal COMISS
_mm_ucomieq_ss Equal UCOMISS
_mm_ucomilt_ss Less than UCOMISS
_mm_ucomile_ss Less than or equal UCOMISS
_mm_ucomigt_ss Greater than UCOMISS
_mm_ucomige_ss Greater than or equal UCOMISS
_mm_ucomineq_ss Not equal UCOMISS

## Conversion Operations

Intrinsic name Corresponding instruction
_mm_cvtss_si32 CVTSS2SI
_mm_cvtps_pi32 CVTPS2PI
_mm_cvttss_si32 CVTTSS2SI
_mm_cvttps_pi32 CVTTPS2PI
_mm_cvtsi32_ss CVTSI2SS
_mm_cvtpi32_ps CVTTPS2PI
_mm_cvtpi16_ps Composite
_mm_cvtpu16_ps Composite
_mm_cvtpi8_ps Composite
_mm_cvtpu8_ps Composite
_mm_cvtpi32x2_ps Composite
_mm_cvtps_pi16 Composite
_mm_cvtps_pi8 Composite

## Miscellaneous Intrinsics

Intrinsic name Operation Corresponding instruction
_mm_shuffle_ps Shuffles SHUFPS
_mm_shuffle_pi16 Shuffles PSHUFW
_mm_unpackhi_ps Unpacks high UNPCKHPS
_mm_unpacklo_ps Unpacks low UNPCKLPS
_mm_storeh_pi Stores high MOVHPS mem, reg
_mm_movehl_ps Moves high to low MOVHLPS
_mm_movelh_ps Moves low to high MOVLHPS
_mm_storel_pi Stores low MOVLPS mem, reg
_mm_getcsr Returns register contents STMXCSR
_mm_setcsr Sets control register LDMXCSR

## Memory and Initialization Load Operations

Intrinsic name Operation Corresponding instruction
_mm_load_ss Loads the low value and clears the three high values MOVSS

## Memory and Initialization Set Operations

Intrinsic name Operation Corresponding instruction
_mm_set_ss Sets the low value and clears the three high values Composite
_mm_set1_ps Sets all four words with the same value Composite
_mm_set_ps Sets four values, address aligned Composite
_mm_setr_ps Sets four values, in reverse order Composite
_mm_setzero_ps Clears all four values Composite

## Memory and Initialization Store Operations

Intrinsic name Operation Corresponding instruction
_mm_store_ss Stores the low value MOVSS
_mm_store1_ps Stores the low value across all four words MOVSS + Shuffling
_mm_store_ps Stores four values, address aligned MOVAPS
_mm_storeu_ps Stores four values, address unaligned MOVUPS
_mm_storer_ps Stores four values, in reverse order MOVAPS + Shuffling
_mm_move_ss Sets the low word, and passes in three high values MOVSS

## Integer Intrinsics

Intrinsic name Operation Corresponding instruction
_mm_extract_pi16 Extracts one of four words PEXTRW
_mm_insert_pi16 Inserts a word PINSRW
_mm_max_pi16 Computes the maximum PMAXSW
_mm_max_pu8 Computes the maximum, unsigned PMAXUB
_mm_min_pi16 Computes the minimum PMINSW
_mm_min_pu8 Computes the minimum, unsigned PMINUB
_mm_mulhi_pu16 Multiplies, returning high bits PMULHUW
_mm_shuffle_pi16 Returns a combination of four words PSHUFW
_mm_avg_pu8 Computes rounded average PAVGB
_mm_avg_pu16 Computes rounded average PAVGW

## Cache support

```void _mm_prefetch(char * p , int i );
PREFETCH
```

Loads one cache line of data from address `p `to a location closer to the processor. The value `i `specifies the type of prefetch operation: the constants `_MM_HINT_T0`, `_MM_HINT_T1`, `_MM_HINT_T2`, and `_MM_HINT_NTA`, corresponding to the type of prefetch instruction, should be used.

```void _mm_stream_pi(__m64 * p , __m64 a );
MOVNTQ
```

Stores the data in `a `to the address `p `without polluting the caches. This intrinsic requires you to empty the multimedia state for the MMX register. See Understanding the EMMS Instruction section.

```void _mm_stream_ps(float * p , __m128 a );
MOVNTPS
```

Stores the data in `a `to the address `p `without polluting the caches. The address must be 16-byte aligned.

```void _mm_sfence(void);
SFENCE
```

Guarantees that every preceding store is globally visible before any subsequent store.

Last Updated on Monday, 27 May 2013 15:09