Intel’s Streaming SIMD Extensions, better known as SSE, are often associated with 128-bit floating-point operations on XMM registers. That is the part of SSE most developers remember today: four single-precision floating-point values packed into one register and processed with one instruction.
However, the original SSE instruction set did more than add XMM floating-point operations.
It also extended the older MMX instruction set with several useful 64-bit SIMD integer instructions. These instructions still operate on MMX registers, but they provide operations that were either missing from the original MMX instruction set or required several instructions to emulate.
This article focuses on those original SSE integer instructions:
PAVGB
PAVGW
PEXTRW
PINSRW
PMAXSW
PMAXUB
PMINSW
PMINUB
PMOVMSKB
PMULHUW
PSADBW
PSHUFW
These instructions were especially useful for image processing, video encoding, audio processing, color manipulation, motion estimation, and other multimedia workloads.
They are historically important because they filled several gaps in MMX. They also show the direction SIMD programming was taking before SSE2 moved packed integer SIMD into 128-bit XMM registers.
SSE and MMX: What Changed?
MMX introduced eight 64-bit SIMD registers:
mm0 mm1 mm2 mm3 mm4 mm5 mm6 mm7
Each MMX register can hold multiple packed integer values. For example, one 64-bit MMX register can hold:
8 unsigned bytes
4 signed or unsigned 16-bit words
2 signed or unsigned 32-bit doublewords
1 64-bit value
This made MMX useful for media processing, but the original MMX instruction set had several limitations. Some common operations needed awkward instruction sequences.
For example:
- averaging unsigned pixels required manual unpacking and correction;
- finding unsigned byte minimums and maximums required workarounds;
- extracting a single word from an MMX register was cumbersome;
- inserting a word into an MMX register required several steps;
- computing sum of absolute differences for motion estimation was expensive;
- shuffling words inside a register was limited.
The original SSE integer instructions addressed these problems.
They did not replace MMX. Instead, they extended it.
That distinction matters: these instructions still operate on MMX registers, not XMM registers.
Important Modern Note: These Are Still MMX-Register Instructions
Although these instructions belong to the SSE generation, most of them operate on __m64 values and MMX registers.
That means they still have the same MMX state-management issue discussed in MMX programming: after using MMX registers, you should call _mm_empty() before returning to x87 floating-point code.
For example:
#include <xmmintrin.h>
void process_with_sse_integer_mmx(void)
{
__m64 a = _mm_set1_pi16(100);
__m64 b = _mm_set1_pi16(200);
__m64 result = _mm_avg_pu16(a, b);
// Required after MMX-register code if later code may use x87 floating point.
_mm_empty();
}
This surprises many programmers because the instructions are “SSE” instructions, but they still use the MMX register file.
A practical rule is:
If your code uses __m64, treat it as MMX-register code and clean up with _mm_empty() before leaving the MMX section.
Packed Average: PAVGB and PAVGW
The PAVG instructions compute the rounded average of unsigned packed integers.
There are two variants:
PAVGB average packed unsigned bytes
PAVGW average packed unsigned words
The corresponding intrinsics are:
__m64 _mm_avg_pu8(__m64 a, __m64 b);
__m64 _mm_avg_pu16(__m64 a, __m64 b);
PAVGB operates on eight unsigned bytes:
a0 a1 a2 a3 a4 a5 a6 a7
b0 b1 b2 b3 b4 b5 b6 b7
The result is:
(a0 + b0 + 1) >> 1
(a1 + b1 + 1) >> 1
...
(a7 + b7 + 1) >> 1
The + 1 means the result is rounded upward.
For example:
average(10, 20) = (10 + 20 + 1) >> 1 = 15
average(10, 21) = (10 + 21 + 1) >> 1 = 16
This is useful in image processing because averaging pixels is common in blending, filtering, interpolation, and motion compensation.
Example: Averaging Packed Bytes
#include <xmmintrin.h>
void average_pixels_example(void)
{
__m64 a = _mm_set_pi8(80, 70, 60, 50, 40, 30, 20, 10);
__m64 b = _mm_set_pi8(8, 7, 6, 5, 4, 3, 2, 1);
__m64 avg = _mm_avg_pu8(a, b);
_mm_empty();
}
This computes eight byte averages in one instruction.
Without PAVGB, the same operation is more complicated. You need to avoid overflow while adding unsigned bytes, usually by unpacking to wider elements or applying bit tricks. PAVGB made this common pixel operation much simpler.
Extracting a Word: PEXTRW
PEXTRW extracts one 16-bit word from an MMX register and copies it into a general-purpose integer register.
The intrinsic is:
int _mm_extract_pi16(__m64 a, int n);
The index n selects which 16-bit word to extract. Since a 64-bit MMX register contains four 16-bit words, the valid logical indexes are:
0 1 2 3
For example:
#include <xmmintrin.h>
int extract_word_example(void)
{
__m64 values = _mm_set_pi16(4000, 3000, 2000, 1000);
int word0 = _mm_extract_pi16(values, 0);
int word1 = _mm_extract_pi16(values, 1);
int word2 = _mm_extract_pi16(values, 2);
int word3 = _mm_extract_pi16(values, 3);
_mm_empty();
return word0 + word1 + word2 + word3;
}
This is useful when an SIMD computation produces packed results, but scalar code needs to inspect or use one element.
Examples include:
- extracting a pixel component;
- reading an intermediate score;
- pulling out a table index;
- moving part of a SIMD result into scalar control logic.
Before PEXTRW, extracting a single word from an MMX register required more awkward sequences involving stores, shifts, or unpacking.
Inserting a Word: PINSRW
PINSRW performs the opposite operation of PEXTRW.
It inserts a 16-bit word from an integer register or memory into one selected word position inside an MMX register, while leaving the other words unchanged.
The intrinsic is:
__m64 _mm_insert_pi16(__m64 a, int d, int n);
Example:
#include <xmmintrin.h>
void insert_word_example(void)
{
__m64 values = _mm_setzero_si64();
values = _mm_insert_pi16(values, 1000, 0);
values = _mm_insert_pi16(values, 2000, 1);
values = _mm_insert_pi16(values, 3000, 2);
values = _mm_insert_pi16(values, 4000, 3);
_mm_empty();
}
This builds an MMX register from scalar 16-bit values.
PINSRW is useful when scalar code and SIMD code need to cooperate. For example, a function may compute a few values using scalar code and then pack them into an MMX register for parallel processing.
Packed Maximum: PMAXSW and PMAXUB
The PMAX instructions compute the element-wise maximum of packed integers.
There are two original SSE integer variants:
PMAXSW maximum of packed signed 16-bit words
PMAXUB maximum of packed unsigned 8-bit bytes
The corresponding intrinsics are:
__m64 _mm_max_pi16(__m64 a, __m64 b);
__m64 _mm_max_pu8(__m64 a, __m64 b);
PMAXSW compares four signed 16-bit values.
PMAXUB compares eight unsigned 8-bit values.
This is useful for clamping, thresholding, masking, image compositing, and signal processing.
Example: Clamp Pixel Values to a Lower Bound
Suppose you want every unsigned byte value to be at least 16. With PMAXUB, this is straightforward:
#include <xmmintrin.h>
void clamp_low_example(void)
{
__m64 pixels = _mm_set_pi8(4, 20, 8, 90, 120, 2, 200, 15);
__m64 minimum = _mm_set1_pi8(16);
__m64 clamped = _mm_max_pu8(pixels, minimum);
_mm_empty();
}
Every byte lower than 16 becomes 16. Every byte already greater than or equal to 16 remains unchanged.
For image processing, this kind of operation is common when enforcing a floor on brightness, alpha, mask values, or intermediate filter results.
Packed Minimum: PMINSW and PMINUB
The PMIN instructions compute the element-wise minimum of packed integers.
There are two original SSE integer variants:
PMINSW minimum of packed signed 16-bit words
PMINUB minimum of packed unsigned 8-bit bytes
The corresponding intrinsics are:
__m64 _mm_min_pi16(__m64 a, __m64 b);
__m64 _mm_min_pu8(__m64 a, __m64 b);
These are the natural counterparts to PMAXSW and PMAXUB.
Example: Clamp Pixel Values to an Upper Bound
Suppose you want every unsigned byte value to be at most 235:
#include <xmmintrin.h>
void clamp_high_example(void)
{
__m64 pixels = _mm_set_pi8(255, 240, 220, 180, 100, 80, 60, 40);
__m64 maximum = _mm_set1_pi8((char)235);
__m64 clamped = _mm_min_pu8(pixels, maximum);
_mm_empty();
}
Every byte greater than 235 becomes 235. Every byte already lower than or equal to 235 remains unchanged.
Example: Clamp to a Range
To clamp unsigned byte values to the range [16, 235], use both instructions:
#include <xmmintrin.h>
void clamp_range_example(void)
{
__m64 pixels = _mm_set_pi8(255, 240, 220, 180, 10, 8, 60, 40);
__m64 low = _mm_set1_pi8(16);
__m64 high = _mm_set1_pi8((char)235);
__m64 result = _mm_max_pu8(pixels, low);
result = _mm_min_pu8(result, high);
_mm_empty();
}
This is a compact SIMD implementation of a very common image-processing operation.
Moving a Byte Mask: PMOVMSKB
PMOVMSKB extracts the most significant bit of each byte in an MMX register and packs those bits into an integer mask.
The intrinsic is:
int _mm_movemask_pi8(__m64 a);
An MMX register contains eight bytes. PMOVMSKB collects the sign bit of each byte and produces an 8-bit mask.
Conceptually:
result bit 0 = high bit of byte 0
result bit 1 = high bit of byte 1
...
result bit 7 = high bit of byte 7
This is extremely useful after SIMD comparisons.
Many SIMD comparison instructions produce elements that are either all zero bits or all one bits. For bytes, that means each byte is either:
0x00
or:
0xFF
The high bit of 0xFF is 1, while the high bit of 0x00 is 0. PMOVMSKB turns that SIMD comparison result into a scalar bitmask that ordinary code can test.
Example: Detect Whether Any Byte Has Its High Bit Set
#include <xmmintrin.h>
int has_high_bit_set_example(void)
{
__m64 values = _mm_set_pi8(0x01, 0x02, 0x7F, (char)0x80,
0x10, 0x20, 0x30, 0x40);
int mask = _mm_movemask_pi8(values);
_mm_empty();
return mask != 0;
}
If any byte has its most significant bit set, the mask will be non-zero.
This kind of operation is useful in:
- scanning bytes;
- detecting threshold results;
- checking alpha masks;
- accelerating string or buffer processing;
- converting SIMD comparisons into scalar control flow.
Unsigned Multiply High: PMULHUW
The original MMX instruction set included signed high-word multiplication, but unsigned high-word multiplication was missing.
SSE added:
PMULHUW
The intrinsic is:
__m64 _mm_mulhi_pu16(__m64 a, __m64 b);
PMULHUW multiplies four pairs of unsigned 16-bit integers. Each multiplication produces a 32-bit intermediate result. The instruction keeps the high 16 bits of each 32-bit result.
Conceptually:
result[i] = (a[i] * b[i]) >> 16
This is useful for fixed-point arithmetic.
For example, if you represent a scale factor as a 16-bit fixed-point fraction, multiplying by that scale and keeping the high half approximates a normalized multiply.
Example: Fixed-Point Scaling
#include <xmmintrin.h>
void unsigned_multiply_high_example(void)
{
__m64 values = _mm_set_pi16(60000, 40000, 20000, 10000);
__m64 scale = _mm_set1_pi16((short)32768); // Approximately 0.5 in 16-bit fixed point.
__m64 result = _mm_mulhi_pu16(values, scale);
_mm_empty();
}
This kind of operation is useful in:
- color scaling;
- alpha blending;
- lighting calculations;
- texture mapping;
- audio volume scaling;
- fixed-point filters.
Before PMULHUW, unsigned fixed-point multiplication often required extra correction steps or more complicated instruction sequences.
Sum of Absolute Differences: PSADBW
PSADBW is one of the most important original SSE integer instructions for video processing.
The instruction name means:
Packed Sum of Absolute Differences of Bytes
The intrinsic is:
__m64 _mm_sad_pu8(__m64 a, __m64 b);
It takes two MMX registers, each containing eight unsigned bytes.
For each byte position, it computes:
abs(a[i] - b[i])
Then it sums the eight absolute differences and stores the result in the low 16 bits of the destination.
Conceptually:
result = abs(a0 - b0)
+ abs(a1 - b1)
+ abs(a2 - b2)
+ abs(a3 - b3)
+ abs(a4 - b4)
+ abs(a5 - b5)
+ abs(a6 - b6)
+ abs(a7 - b7)
This is exactly the kind of operation used in motion estimation for video compression.
When a video encoder searches for a block in a previous or future frame that best matches the current block, it often compares candidate blocks using a sum of absolute differences. The smaller the sum, the better the match.
Example: Compare Two 8-Byte Blocks
#include <xmmintrin.h>
int sad_8_bytes_example(void)
{
__m64 blockA = _mm_set_pi8(10, 20, 30, 40, 50, 60, 70, 80);
__m64 blockB = _mm_set_pi8(12, 18, 33, 39, 55, 58, 73, 79);
__m64 sad = _mm_sad_pu8(blockA, blockB);
int result = _mm_cvtsi64_si32(sad) & 0xFFFF;
_mm_empty();
return result;
}
Without PSADBW, the same operation requires several instructions:
- subtract bytes;
- take absolute values;
- widen partial results;
- sum the differences;
- reduce the partial sums.
PSADBW compresses this common video-processing operation into a single instruction.
Packed Shuffle Word: PSHUFW
PSHUFW rearranges the four 16-bit words inside an MMX register.
The intrinsic is:
__m64 _mm_shuffle_pi16(__m64 a, int n);
The immediate control byte selects which source word goes into each destination word.
A 64-bit MMX register contains four 16-bit words:
w0 w1 w2 w3
PSHUFW can duplicate, reorder, or broadcast those words.
Example: Reverse Four Words
#include <xmmintrin.h>
void reverse_words_example(void)
{
__m64 values = _mm_set_pi16(4, 3, 2, 1);
__m64 reversed = _mm_shuffle_pi16(values, _MM_SHUFFLE(0, 1, 2, 3));
_mm_empty();
}
Example: Broadcast One Word
To replicate one 16-bit word across all four positions:
#include <xmmintrin.h>
void broadcast_word_example(void)
{
__m64 values = _mm_set_pi16(400, 300, 200, 100);
__m64 broadcast = _mm_shuffle_pi16(values, _MM_SHUFFLE(3, 3, 3, 3));
_mm_empty();
}
This is useful when one component must be compared or combined with several other components.
For example, in image processing, you may need to replicate an alpha value across multiple lanes so it can be applied to red, green, and blue components.
Before PSHUFW, this kind of rearrangement required multiple MMX instructions. With PSHUFW, it can often be done in one instruction.
Cache and Store-Ordering Instructions
The original SSE generation also introduced instructions related to cache control, non-temporal stores, prefetching, and store ordering.
Some examples are:
PREFETCHNTA
PREFETCHT0
PREFETCHT1
PREFETCHT2
MOVNTQ
MOVNTPS
MASKMOVQ
SFENCE
These are not arithmetic instructions, but they matter for high-performance media processing.
SIMD arithmetic is only part of the performance story. Many image and video algorithms move large amounts of memory. If memory access is inefficient, SIMD execution units may sit idle.
Prefetch and non-temporal store instructions were intended to help programmers control memory behavior in streaming workloads.
For example, if an algorithm writes a large output buffer that will not be read again soon, non-temporal stores can reduce cache pollution.
However, these instructions should be used carefully. Modern CPUs have sophisticated prefetchers and cache hierarchies, and manual prefetching can easily make code slower if used incorrectly.
Why These Instructions Mattered
The original SSE integer instructions were valuable because they targeted operations that appeared constantly in multimedia code.
Image Blending
PAVGB and PAVGW made averaging pixels easy and efficient.
This helped with:
- interpolation;
- simple blending;
- filtering;
- anti-aliasing;
- half-pixel operations in video processing.
Clamping and Thresholding
PMINUB, PMAXUB, PMINSW, and PMAXSW made packed min/max operations directly available.
This helped with:
- brightness limits;
- color channel clamping;
- mask processing;
- signed signal limits;
- saturation-like operations.
Motion Estimation
PSADBW directly accelerated one of the core operations in block-based video compression.
This was especially important because motion estimation can dominate the runtime of a video encoder.
Fixed-Point Arithmetic
PMULHUW improved unsigned fixed-point multiplication.
This helped with:
- lighting;
- blending;
- color transforms;
- texture mapping;
- audio and video filters.
SIMD-to-Scalar Decisions
PMOVMSKB made it much easier to convert SIMD comparison results into scalar branch decisions.
This helped with:
- scanning;
- early exits;
- mask generation;
- threshold detection;
- vectorized search loops.
Data Rearrangement
PSHUFW made it easier to reorganize packed 16-bit values inside an MMX register.
This helped with:
- channel replication;
- component swizzling;
- alpha processing;
- preparing operands for later SIMD operations.
Limitations of SSE Integer Instructions
These instructions were useful, but they also had important limitations.
They Use 64-Bit MMX Registers
The original SSE integer instructions discussed here operate on MMX registers, not XMM registers.
That means they process only:
8 bytes
4 words
2 doublewords
at a time.
SSE2 later introduced 128-bit integer operations on XMM registers, which doubled the SIMD width for many integer workloads.
They Require MMX State Cleanup
Because they use MMX registers, code using these instructions should call _mm_empty() before returning to x87 floating-point code.
This is one of the historical inconveniences of MMX-register SIMD.
They Are Mostly Legacy for New Code
For new code, SSE2 or later is usually a better target.
On x86-64, SSE2 is part of the baseline architecture, so 128-bit XMM integer SIMD is normally available. Later instruction sets such as SSSE3, SSE4.1, AVX2, and AVX-512 add even more useful operations.
That does not make the original SSE integer instructions irrelevant. They are still important when reading, maintaining, or porting older optimized code.
SSE Integer Instructions vs SSE2 Integer Instructions
A common source of confusion is the difference between:
SSE integer instructions
and:
SSE2 integer instructions
The original SSE integer instructions operate mostly on MMX registers and __m64.
SSE2 introduced many packed integer operations on XMM registers and __m128i.
For example, original SSE has:
__m64 _mm_avg_pu8(__m64 a, __m64 b);
SSE2 has the 128-bit version:
__m128i _mm_avg_epu8(__m128i a, __m128i b);
The SSE2 version processes sixteen bytes at once instead of eight and uses XMM registers instead of MMX registers.
For modern code, the SSE2 version is usually preferable.
Example: MMX-Register SSE vs SSE2
Here is the original SSE/MMX-register style:
#include <xmmintrin.h>
void average_with_sse_integer(void)
{
__m64 a = _mm_set_pi8(80, 70, 60, 50, 40, 30, 20, 10);
__m64 b = _mm_set_pi8(8, 7, 6, 5, 4, 3, 2, 1);
__m64 result = _mm_avg_pu8(a, b);
_mm_empty();
}
Here is the SSE2/XMM-register style:
#include <emmintrin.h>
void average_with_sse2(void)
{
__m128i a = _mm_set_epi8(160, 150, 140, 130, 120, 110, 100, 90,
80, 70, 60, 50, 40, 30, 20, 10);
__m128i b = _mm_set_epi8(16, 15, 14, 13, 12, 11, 10, 9,
8, 7, 6, 5, 4, 3, 2, 1);
__m128i result = _mm_avg_epu8(a, b);
// No _mm_empty() required for XMM-only SSE2 code.
}
The SSE2 version is wider and avoids the MMX/x87 cleanup issue.
CPU Detection and Compatibility
Historically, programs that used these instructions needed a runtime CPU check.
The original SSE instructions were introduced with the Pentium III generation, and AMD also supported SSE in later Athlon processors. On very old systems, a program could not assume SSE support.
A typical old multimedia application might include:
- a plain scalar implementation;
- an MMX implementation;
- an SSE-enhanced MMX implementation;
- later, an SSE2 implementation.
At runtime, the program would choose the best available implementation based on CPUID feature flags.
For modern x86-64 software, SSE2 is normally assumed, so these original SSE integer instructions are mostly relevant for legacy code, retro-computing, old codecs, old image-processing libraries, or educational purposes.
Practical Guidance for Modern Developers
If you are writing new SIMD code today, prefer this order:
- Start with clear scalar code.
- Let the compiler auto-vectorize if possible.
- Use SSE2 or later intrinsics for manual SIMD.
- Use AVX2 or AVX-512 when the target hardware justifies it.
- Use original MMX-register SSE integer instructions only when maintaining old code or targeting historical CPUs.
The original SSE integer instructions are still worth learning because they explain many patterns that later SIMD instruction sets kept and expanded:
- packed averaging;
- packed min/max;
- fixed-point multiplication;
- sum of absolute differences;
- shuffle operations;
- movemask extraction.
These ideas remain important in modern SIMD programming.
The register size changed. The instruction names changed. The same core problems are still present.
Common Mistakes
Mistake 1: Assuming All SSE Uses XMM Registers
Not all original SSE instructions operate on XMM registers.
The floating-point instructions use XMM registers, but the 64-bit integer instructions discussed here operate on MMX registers.
If you see __m64, you are dealing with MMX-register SIMD.
Mistake 2: Forgetting _mm_empty()
Because these instructions operate on MMX registers, remember to clean up the MMX state:
_mm_empty();
Call this once after the MMX-register SIMD block is finished.
Mistake 3: Using These Instructions for New 64-Bit Code Without Considering SSE2
If you are writing new code for modern x86 processors, SSE2 or later is usually better.
For example, prefer __m128i operations over __m64 operations when possible.
Mistake 4: Overusing Manual Prefetch
Manual prefetch instructions can help in some streaming memory workloads, but they are not magic.
Modern CPUs already perform hardware prefetching. Manual prefetch should be tested carefully and kept only if profiling shows a real improvement.
Mistake 5: Ignoring Data Layout
SIMD performance depends heavily on data layout.
Packed byte and word instructions are most effective when the data is already arranged in the format the instruction expects. If most of the time is spent rearranging data, the SIMD arithmetic may not provide much benefit.
Summary Table
| Instruction | Intrinsic | Operates On | Purpose |
|---|---|---|---|
PAVGB | _mm_avg_pu8 | 8 unsigned bytes | Rounded average |
PAVGW | _mm_avg_pu16 | 4 unsigned words | Rounded average |
PEXTRW | _mm_extract_pi16 | 1 selected word | Extract word to integer |
PINSRW | _mm_insert_pi16 | 1 selected word | Insert word from integer |
PMAXSW | _mm_max_pi16 | 4 signed words | Packed maximum |
PMAXUB | _mm_max_pu8 | 8 unsigned bytes | Packed maximum |
PMINSW | _mm_min_pi16 | 4 signed words | Packed minimum |
PMINUB | _mm_min_pu8 | 8 unsigned bytes | Packed minimum |
PMOVMSKB | _mm_movemask_pi8 | 8 bytes | Extract byte sign mask |
PMULHUW | _mm_mulhi_pu16 | 4 unsigned words | High half of unsigned multiply |
PSADBW | _mm_sad_pu8 | 8 unsigned bytes | Sum of absolute differences |
PSHUFW | _mm_shuffle_pi16 | 4 words | Shuffle packed words |
Final Thoughts
The original SSE integer instructions were an important step in the evolution of x86 SIMD programming.
They made MMX much more practical for real multimedia workloads by adding operations that were directly useful in image processing, video encoding, color manipulation, fixed-point arithmetic, and packed data rearrangement.
Today, most new code should use SSE2, SSSE3, SSE4.1, AVX2, or newer SIMD instruction sets instead. These later extensions offer wider registers, better integer support, and a cleaner programming model.
However, understanding these original SSE integer instructions is still valuable.
They explain how many classic optimized media routines worked, why certain SIMD idioms exist, and how later SIMD instruction sets evolved from the limitations of MMX.
If you encounter old code using __m64, PAVGB, PMINUB, PSADBW, or PSHUFW, remember the two most important points:
These are SSE-era instructions, but they operate on MMX registers.
And:
After using them, call _mm_empty() before returning to code that may use x87 floating-point instructions.


