
The MMX technology supports both saturating and wraparound modes. In wraparound mode, results that overflow or underflow are truncated and only the lower (least significant) bits of the result are returned. In saturation mode, results of an operation that overflow or underflow are clipped (saturated) to a datarange limit for the data type. The result of an operation that exceeds the range of a data type saturates to the maximum value of the range, while a result that is less than the range of a data type saturates to the minimum value of the range. This method of handling overflow and underflow is useful in many applications, such as color calculations.
PADDB mm,mm/m64 
The PADD (Packed Add) instructions add the data elements of the source operand to the data elements of the destination register, and the result is written to the destination register. If the result exceeds the datarange limit for the data type, it wraps around. PADD support packed byte (PADDB), packed word (PADDW), and packed doubleword (PADDD) data types.

PADDB instruction with 64bit operands: DEST[7..0] ← DEST[7..0] + SRC[7..0]; * repeat add operation for 2nd through 7th byte *; DEST[63..56] ← DEST[63..56] + SRC[63..56]; PADDW instruction with 64bit operands: DEST[15..0] ← DEST[15..0] + SRC[15..0]; * repeat add operation for 2nd and 3th word *; DEST[63..48] ← DEST[63..48] + SRC[63..48]; PADDD instruction with 64bit operands: DEST[31..0] ← DEST[31..0] + SRC[31..0]; DEST[63..32] ← DEST[63..32] + SRC[63..32]; 
PADDB __m64 _mm_add_pi8(__m64 m1, __m64 m2)
PADDW __m64 _mm_addw_pi16(__m64 m1, __m64 m2) PADDD __m64 _mm_add_pi32(__m64 m1, __m64 m2) 
PADDSB mm, mm/m64 PADDSW mm, mm/m64 
The PADDS (Packed Add with Saturation) instructions add the packed signed data elements of the source operand to the packed signed data elements of the destination operand and saturate the result. PADDS support packed byte (PADDSB) and packed word (PADDSW) data types.

PADDSB instruction with 64bit operands: DEST[7..0] ← SaturateToSignedByte(DEST[7..0] + SRC (7..0]) ; * repeat add operation for 2nd through 7th bytes *; DEST[63..56] ← SaturateToSignedByte(DEST[63..56] + SRC[63..56] ); PADDSW instruction with 64bit operands 
PADDSB __m64 _mm_adds_pi8(__m64 m1, __m64 m2)
PADDSW __m64 _mm_adds_pi16(__m64 m1, __m64 m2) 
PADDUSB mm, mm/m64 
The PADDUS (Packed Add Unsigned with Saturation) instructions add the packed unsigned data elements of the source operand to the packed unsigned data elements of the destination operand and saturate the results. PADDUS support packed byte (PADDUSB) and packed word (PADDUSW) data types. 
PADDUSB instruction with 64bit operands: DEST[7..0] ← SaturateToUnsignedByte(DEST[7..0] + SRC (7..0] ); * repeat add operation for 2nd through 7th bytes *: DEST[63..56] ← SaturateToUnsignedByte(DEST[63..56] + SRC[63..56] PADDUSW instruction with 64bit operands: 
PADDUSB __m64 _mm_adds_pu8(__m64 m1, __m64 m2)
PADDUSW __m64 _mm_adds_pu16(__m64 m1, __m64 m2) 
PSUBB mm, mm/m64 
The PSUB (Packed Subtract) instructions subtract the data elements of the source operand from the data elements of the destination operand. If the result is larger or smaller than the datarange limit for the data type, it wraps around. PSUB support packed byte (PSUBB), packed word (PSUBW), and packed doubleword (PSUBD) data types.

PSUBB instruction with 64bit operands: DEST[7..0] ← DEST[7..0] − SRC[7..0]; * repeat subtract operation for 2nd through 7th byte *; DEST[63..56] ← DEST[63..56] − SRC[63..56];

PSUBB __m64 _mm_sub_pi8(__m64 m1, __m64 m2)
PSUBW __m64 _mm_sub_pi16(__m64 m1, __m64 m2) PSUBD __m64 _mm_sub_pi32(__m64 m1, __m64 m2) 
PSUBSB mm, mm/m64 
The PSUBS (Packed Subtract with Saturation) instructions subtract the signed data elements of the source operand from the signed data elements of the destination operand, then the results are saturated to the limits of a signed data element and written to the destination operand. PSUBS support packed byte (PSUBSB) and packed word (PSUBSW) data types.

PSUBSB instruction with 64bit operands: DEST[7..0] ← SaturateToSignedByte(DEST[7..0] − SRC (7..0]) ; * repeat subtract operation for 2nd through 7th bytes *; DEST[63..56] ← SaturateToSignedByte(DEST[63..56] − SRC[63..56] ); PSUBSW instruction with 64bit operands 
PSUBSB __m64 _mm_subs_pi8(__m64 m1, __m64 m2)
PSUBSW __m64 _mm_subs_pi16(__m64 m1, __m64 m2) 
PSUBUSB mm, mm/m64 PSUBUSW mm, mm/m64 
The PSUBUS (Packed Subtract Unsigned with Saturation) instructions subtract the unsigned data elements of the source operand from the unsigned data elements of the destination register, then the results are saturated to the limits of an unsigned data element and written to the destination operand. PSUBUS support packed byte (PSUBUSB) and packed word (PSUBUSW) data types. 
PSUBUSB instruction with 64bit operands: DEST[7..0] ← SaturateToUnsignedByte(DEST[7..0] − SRC (7..0] ); * repeat add operation for 2nd through 7th bytes *: DEST[63..56] ← SaturateToUnsignedByte(DEST[63..56] − SRC[63..56] PSUBUSW instruction with 64bit operands: 
PSUBUSB __m64 _mm_sub_pu8(__m64 m1, __m64 m2)
PSUBUSW __m64 _mm_sub_pu16(__m64 m1, __m64 m2) 
As an example of saturated arithmetic, let us consider the absolute difference of two arrays of bytes: there are no IF statements in MMX, but it is necessary to implement the following algorithm:
if (a > b)
then c = a b
else c = b a
This algorithm can be coded using saturated substractions: subtracting a from b and b from a, a zero result and the desired absolute difference are obtained, but since it is impossible to know which is which, the final result is achieved by ORing them together:
c = (a b) OR (b a)
Assuming that the MMX registers named MM0 and MM1 hold the source vectors, the following code will compute the absolute difference and store it into MM0:
MOVQ MM2, MM0 make a copy of MM0
PSUBUSB MM0, MM1 compute difference one way
PSUBUSB MM1, MM2 compute difference the other way
POR MM0, MM1 OR them together
PMULHW mm, mm/m64 
The PMULHW (Packed Multiply High) and PMULLW (Packed Multiply Low) instructions multiply the four signed words of the source and destination operands and write the highorder or loworder 16 bits of the 32bit intermediate results to the destination operand.

PMULHW instruction with 64bit operands: TEMP0[310] ← DEST[150] * SRC[150]; * Signed multiplication * TEMP1[310] ← DEST[3116] * SRC[3116]; TEMP2[310] ← DEST[4732] * SRC[4732]; TEMP3[310] ← DEST[6348] * SRC[6348]; DEST[150] ← TEMP0[3116]; DEST[3116] ← TEMP1[3116]; DEST[4732] ← TEMP2[3116]; DEST[6348] ← TEMP3[3116]; PMULLW instruction with 64bit operands: 
PMULHW __m64 _mm_mulhi_pi16 (__m64 m1, __m64 m2)
PMULLW __m64 _mm_mullo_pi16(__m64 m1, __m64 m2) 
PMADDWD mm, mm/m64 
The PMADDWD (Packed Multiply and Add) instruction multiplies the four signed words of the destination operand by the four signed words of the source operand. The two highorder words are summed and stored in the upper doubleword of the destination operand, and the two loworder words are summed and stored in the lower doubleword of the destination operand.

PMADDWD instruction with 64bit operands: DEST[31..0] ← (DEST[15..0] * SRC[15..0]) + (DEST[31..16] * SRC[31..16]); DEST[63..32] ← (DEST[47..32] * SRC[47..32]) + (DEST[63..48] * SRC[63..48]); 
PMADDWD __m64 _mm_madd_pi16(__m64 m1, __m64 m2) 
Complex multiplication is an operation which requires four multiplications and two additions, leading naturally to the use of the PMADDWD instruction. In order to use this instruction it is necessary to format the data into four 16bit values, each holding a read or imaginary component: the constant vector can be outlined as [Re Im Im Re].
The following code fragment multiplies the complex number stored in the MMX register MM0 by the complex constant hold in register MM1 with the pattern explained above. The real component of the complex product is given by
Re(Data)*Re(Const) Im(Data)*Im(Const)
and the imaginary component of the complex product by
Re(Data)*Im(Const) + Im(Data)*Re(Const).
PUNPCKLDQ MM0, MM0 convert the data in the [Re Im Re Im] format
PMADDWD MM0, MM1 perform the complex multiply
Note that the output is a packed word, so a pack instruction may be used to convert the result to 16bit, matching the format of the input.