SSSE3: the SIMD extension that made integer shuffling and packed arithmetic much better

June 24, 2020 - By Stefano Tommesani

SSSE3 is one of the most useful but most poorly named SIMD extensions in the x86 family. The name stands for Supplemental Streaming SIMD Extensions 3, and the extra “S” is important. SSSE3 is not the same thing as SSE3. It is a later extension, with a very different focus.

SSE3 refined floating-point SIMD with horizontal add and subtract instructions, duplicated floating-point moves, alternating add/subtract operations, and a few non-SIMD instructions such as MONITOR and MWAIT. SSSE3, by contrast, is mostly about packed integer SIMD.

It adds instructions that make byte manipulation, integer horizontal arithmetic, sign handling, absolute values, alignment, and fixed-point multiply operations much more efficient.

For multimedia, compression, image processing, audio, text processing, and codec workloads, SSSE3 was often more practically useful than SSE3. Its most famous instruction, PSHUFB, became one of the most powerful SIMD byte-manipulation tools available before AVX2 and AVX-512.

Where SSSE3 fits in the SIMD timeline

The x86 SIMD timeline up to SSSE3 looks like this:

Instruction set	Main role	Main contribution
MMX	First-generation packed integer SIMD	64-bit integer vectors using MMX registers
SSE	128-bit floating-point SIMD	XMM registers and packed single-precision floating-point operations
SSE2	General-purpose XMM SIMD foundation	Double-precision floating point and packed integer SIMD in XMM registers
SSE3	Floating-point and synchronization refinement	Horizontal floating-point add/subtract, duplicated moves, `LDDQU`, `FISTTP`, `MONITOR`, `MWAIT`
SSSE3	Packed integer SIMD refinement	Byte shuffle, integer horizontal add/subtract, absolute value, sign manipulation, alignment, fixed-point multiply helpers
SSE4.1 / SSE4.2	Further specialization	Blends, dot products, insert/extract operations, string/text processing helpers, more integer operations
AVX / AVX2	Wider SIMD and newer encoding	256-bit vectors, three-operand forms, broader integer SIMD in AVX2
AVX-512	Masked and wider SIMD	512-bit vectors, mask registers, richer instruction families

SSSE3 is still part of the 128-bit SSE/XMM era. It does not introduce wider vectors. It does not introduce AVX-style three-operand instructions. It does not replace SSE2.

Instead, SSSE3 fills in several missing operations that programmers often needed when writing integer SIMD code.

SSSE3 vs SSE3: the naming trap

The most important clarification is this:

SSE3 and SSSE3 are separate instruction sets.

A processor that supports SSE3 does not necessarily support SSSE3. Software must check the correct CPU feature bit before using SSSE3 instructions.

The difference in focus is also significant:

Feature	SSE3	SSSE3
Main focus	Floating-point refinements	Packed integer refinements
Famous instructions	`HADDPS`, `ADDSUBPS`, `MOVDDUP`	`PSHUFB`, `PMADDUBSW`, `PALIGNR`
Data types	Mostly packed single/double floating-point values	Mostly packed bytes, words, and doublewords
Typical uses	Reductions, complex floating-point math	Image processing, video codecs, audio codecs, text processing, fixed-point math
Intrinsics header	`pmmintrin.h`	`tmmintrin.h`
CPUID feature	SSE3	SSSE3

This naming confusion has caused bugs in real software. Checking for SSE3 is not enough if your code uses PSHUFB, PMADDUBSW, or PALIGNR. Those are SSSE3 instructions.

What SSSE3 added

SSSE3 is commonly described as adding 32 instructions, but this depends on how the instructions are counted. A clearer way to describe it is:

SSSE3 added a set of new packed integer operations.
Many of those operations exist in both 64-bit MMX and 128-bit XMM forms.
The core instruction mnemonics are fewer than 32, but Intel documentation often counts the MMX and XMM forms separately.

For modern code, the XMM forms are the important ones. MMX is mostly of historical interest and should generally be avoided in new code.

The key SSSE3 instruction groups are:

Category	Instructions	Purpose
Byte shuffle	`PSHUFB`	Rearrange bytes inside a vector using a per-byte control mask
Horizontal add	`PHADDW`, `PHADDD`, `PHADDSW`	Add adjacent integer elements horizontally
Horizontal subtract	`PHSUBW`, `PHSUBD`, `PHSUBSW`	Subtract adjacent integer elements horizontally
Absolute value	`PABSB`, `PABSW`, `PABSD`	Compute absolute values of signed packed integers
Sign manipulation	`PSIGNB`, `PSIGNW`, `PSIGND`	Conditionally negate, zero, or preserve packed integers based on the sign of another vector
Multiply-add	`PMADDUBSW`	Multiply unsigned bytes by signed bytes, add adjacent pairs, and saturate to signed words
Rounded high multiply	`PMULHRSW`	Fixed-point signed word multiply with rounding and scaling
Byte alignment	`PALIGNR`	Concatenate two vectors and extract a byte-aligned window

This is a very practical collection. Unlike SSE3, which mostly improved a few floating-point idioms, SSSE3 directly addressed many awkward integer SIMD patterns.

Why SSSE3 mattered

Before SSSE3, integer SIMD programming with SSE2 was powerful but often awkward.

SSE2 could add, subtract, multiply, compare, pack, unpack, and shift packed integers. However, some essential operations still required long instruction sequences.

For example:

Rearranging bytes inside a vector required multiple unpack, shift, and logical operations.
Computing absolute values required compare and subtract tricks.
Horizontal integer sums required shuffle and add sequences.
Aligning bytes across adjacent vectors required several instructions.
Fixed-point multiply-and-round operations were clumsy.
Sign-dependent negation required compare masks and conditional logic.

SSSE3 turned several of these patterns into direct instructions.

That is why SSSE3 was important for codecs, image filters, audio processing, and other packed-data workloads. These domains often work with bytes and 16-bit words rather than floating-point values. SSSE3 gave them tools that SSE2 was missing.

`PSHUFB`: the star of SSSE3

The most famous SSSE3 instruction is PSHUFB, packed shuffle bytes.

PSHUFB performs a byte-wise shuffle inside a vector. Each byte in the control mask selects which source byte should be placed into the corresponding destination byte. If the high bit of a control byte is set, the corresponding output byte becomes zero.

Conceptually, for a 128-bit XMM register:

source = [s0, s1, s2, s3, ..., s15]
mask   = [m0, m1, m2, m3, ..., m15]

result[i] = source[mask[i] & 0x0F]

If mask[i] has bit 7 set:
result[i] = 0

This made byte-level table rearrangement far easier.

A simple intrinsic example:

#include <tmmintrin.h>

__m128i reverse_bytes_16(__m128i v)
{
    const __m128i mask = _mm_set_epi8(
        0, 1, 2, 3,
        4, 5, 6, 7,
        8, 9, 10, 11,
        12, 13, 14, 15
    );

    return _mm_shuffle_epi8(v, mask);
}

Because _mm_set_epi8 lists values from the most significant byte to the least significant byte, this mask reverses the byte order of the 16-byte vector.

PSHUFB is useful for:

Pixel channel rearrangement.
Byte-order conversion.
Lookup-table style operations.
Nibble unpacking.
Text and parser acceleration.
Cryptographic and checksum kernels.
Compression and decompression algorithms.
Video codec inner loops.

Before AVX2, PSHUFB was one of the most powerful byte-manipulation instructions available in mainstream x86 SIMD.

Absolute value instructions: `PABSB`, `PABSW`, `PABSD`

SSSE3 added packed integer absolute value instructions:

Instruction	Element size
`PABSB`	8-bit signed integers
`PABSW`	16-bit signed integers
`PABSD`	32-bit signed integers

These compute the absolute value of each signed integer element independently.

Example:

input  = [-5,  7, -2,  9]
output = [ 5,  7,  2,  9]

With SSE2, absolute value usually required a sequence involving comparisons, masks, XOR, and subtraction. SSSE3 made it direct.

Example with 16-bit integers:

#include <tmmintrin.h>

__m128i abs_i16_ssse3(__m128i v)
{
    return _mm_abs_epi16(v);
}

This is especially useful in:

Image differences.
Audio sample processing.
Motion estimation.
Sum of absolute differences preparation.
Signal-processing kernels.
Error metrics.

For image and video workloads, absolute differences are extremely common. SSSE3 did not replace every specialized instruction, but it made many absolute-value patterns much cleaner.

Sign manipulation: `PSIGNB`, `PSIGNW`, `PSIGND`

The PSIGN* instructions apply the sign of one vector to the magnitude in another vector.

For each element:

If the corresponding sign element is negative, the value is negated.
If the corresponding sign element is zero, the result is zero.
If the corresponding sign element is positive, the value is preserved.

The instruction variants are:

Instruction	Element size
`PSIGNB`	8-bit signed integers
`PSIGNW`	16-bit signed integers
`PSIGND`	32-bit signed integers

Example:

values = [10, 10, 10, 10]
signs  = [-1,  0,  1, -5]

result = [-10, 0, 10, -10]

This is useful when working with transforms, prediction residuals, signed magnitudes, and algorithms where sign and magnitude are handled separately.

The intrinsic form for 16-bit integers is:

#include <tmmintrin.h>

__m128i apply_sign_i16_ssse3(__m128i values, __m128i signs)
{
    return _mm_sign_epi16(values, signs);
}

Without SSSE3, this kind of operation typically required multiple compare, mask, XOR, and subtract instructions.

Horizontal integer add and subtract

SSE3 introduced horizontal add and subtract for floating-point values. SSSE3 brought a similar idea to packed integers.

The horizontal add instructions are:

Instruction	Operation
`PHADDW`	Horizontally add adjacent 16-bit words
`PHADDD`	Horizontally add adjacent 32-bit doublewords
`PHADDSW`	Horizontally add adjacent signed 16-bit words with saturation

The horizontal subtract instructions are:

Instruction	Operation
`PHSUBW`	Horizontally subtract adjacent 16-bit words
`PHSUBD`	Horizontally subtract adjacent 32-bit doublewords
`PHSUBSW`	Horizontally subtract adjacent signed 16-bit words with saturation

For example, PHADDW conceptually transforms adjacent pairs:

[a0, a1, a2, a3, a4, a5, a6, a7]
=>

[a0+a1, a2+a3, a4+a5, a6+a7]

A simple intrinsic example:

#include <tmmintrin.h>

__m128i pairwise_add_i16_ssse3(__m128i a, __m128i b)
{
    return _mm_hadd_epi16(a, b);
}

These instructions are useful for:

Partial reductions.
Audio mixing.
Image filters.
Transform stages.
Dot-product preparation.
Summing neighboring samples.

However, as with SSE3 floating-point horizontal operations, they are not automatically faster in every case. A carefully written shuffle-and-add sequence may be competitive on some processors. The benefit depends on the microarchitecture and the surrounding code.

`PMADDUBSW`: a very useful multiply-add

PMADDUBSW is one of the most important SSSE3 arithmetic instructions.

It performs this operation:

Take packed unsigned bytes from one operand.
Take packed signed bytes from the other operand.
Multiply corresponding byte pairs.
Add adjacent products.
Saturate the result to signed 16-bit words.

Conceptually:

unsigned bytes: [a0, a1, a2, a3, ...]
signed bytes:   [b0, b1, b2, b3, ...]

result words:

[a0*b0 + a1*b1,
a2*b2 + a3*b3,
…]

This is extremely useful in fixed-point arithmetic, image processing, filtering, and codec kernels.

Example intrinsic:

#include <tmmintrin.h>

__m128i madd_unsigned_signed_bytes_ssse3(
    __m128i unsigned_bytes,
    __m128i signed_bytes)
{
    return _mm_maddubs_epi16(unsigned_bytes, signed_bytes);
}

Typical uses include:

Color conversion.
Convolution filters.
Audio transforms.
Video codec prediction.
Packed dot-product-like operations.
Quantized integer math.

This instruction became especially relevant in algorithms that work with 8-bit data and produce 16-bit intermediate results.

`PMULHRSW`: rounded fixed-point multiply

PMULHRSW performs a signed 16-bit packed multiply, keeps the high part of the result, applies rounding, and produces scaled 16-bit results.

This is useful for fixed-point arithmetic. Many multimedia algorithms represent fractional values as integers with an implied scaling factor. For example, a 16-bit value might represent a fixed-point number rather than a plain integer.

Without a rounded high multiply instruction, programmers often had to combine multiply, add, shift, and pack operations manually. PMULHRSW makes that pattern more direct.

Example intrinsic:

#include <tmmintrin.h>

__m128i rounded_fixed_multiply_i16_ssse3(__m128i a, __m128i b)
{
    return _mm_mulhrs_epi16(a, b);
}

This is useful in:

Audio processing.
Image scaling.
Video transforms.
Fixed-point filters.
Codec inner loops.

For floating-point-heavy code, this instruction may not matter. For fixed-point SIMD code, it can be very valuable.

`PALIGNR`: byte alignment across two vectors

PALIGNR concatenates two vectors and extracts a byte-shifted window from the result.

Conceptually:

combined = concatenate(second_operand, first_operand)
result   = combined shifted right by an immediate byte count

This is useful when processing sliding windows of bytes.

For example, many filters and codecs need neighboring bytes from adjacent vector loads. Before SSSE3, constructing these shifted windows often required several shift and OR instructions. PALIGNR made the operation much simpler.

Example:

#include <tmmintrin.h>

__m128i align_bytes_ssse3(__m128i previous, __m128i current)
{
    // Extract a vector shifted by 4 bytes across previous/current.
    return _mm_alignr_epi8(current, previous, 4);
}

PALIGNR is useful for:

Sliding-window filters.
Video block processing.
String scanning.
Hashing and checksums.
Compression algorithms.
Unaligned stream processing.

It is one of the instructions that made SSSE3 particularly helpful in byte-oriented algorithms.

SSSE3 and MMX forms

Many SSSE3 instructions have both MMX and XMM forms. This reflects the transition period in SIMD history.

For modern code, prefer XMM forms. MMX has several disadvantages:

MMX registers alias the x87 floating-point register file.
Mixing MMX and x87 requires extra state-management care.
MMX is only 64 bits wide.
XMM registers are the normal SIMD path in modern x86 and x86-64 code.
Compilers and libraries generally optimize around XMM, AVX, and later vector models.

SSSE3’s MMX forms are historically interesting, but new code should normally use the 128-bit XMM intrinsics.

Detecting SSSE3 support

SSSE3 support is detected with CPUID.

The relevant feature is:

CPUID leaf 1
ECX bit 9 = SSSE3 support

This is separate from SSE3:

CPUID leaf 1
ECX bit 0 = SSE3 support

Do not check SSE3 when you need SSSE3.

In C or C++, you normally have three choices:

Compile a binary that requires SSSE3.
Use compiler-specific target attributes for selected functions.
Use runtime dispatch and select the SSSE3 path only when available.

With GCC or Clang, SSSE3 can be enabled with:

-mssse3

The common intrinsic header is:

#include <tmmintrin.h>

The name tmmintrin.h comes from older Intel compiler naming history. It is not obvious, but it is the standard header for SSSE3 intrinsics.

Example: byte reversal with `PSHUFB`

Here is a simple example that reverses 16 bytes in a vector:

#include <tmmintrin.h>

__m128i reverse16_ssse3(__m128i x)
{
    const __m128i reverse_mask = _mm_set_epi8(
        0, 1, 2, 3,
        4, 5, 6, 7,
        8, 9, 10, 11,
        12, 13, 14, 15
    );

    return _mm_shuffle_epi8(x, reverse_mask);
}

This example is small, but it shows the power of PSHUFB. A byte permutation that would otherwise require several operations can be represented by a single shuffle mask.

The same idea can be extended to:

Swapping color channels.
Expanding packed nibbles.
Reordering small lookup-table results.
Formatting bytes for hashing.
Preparing input for vectorized parsers.

Example: sum of absolute values

SSSE3 can also simplify absolute-value based computations.

#include <tmmintrin.h>
#include <emmintrin.h>

__m128i abs_and_pairwise_sum_i16_ssse3(__m128i values)
{
    __m128i abs_values = _mm_abs_epi16(values);

    // Add adjacent signed 16-bit values:
    // [a0+a1, a2+a3, a4+a5, a6+a7, ...]
    return _mm_hadd_epi16(abs_values, _mm_setzero_si128());
}

This kind of pattern appears in signal processing, error metrics, and transform code.

In real code, you would usually continue reducing the partial sums, widen to avoid overflow where needed, or accumulate into 32-bit lanes.

Example: fixed-point byte multiply-add

PMADDUBSW is very useful when unsigned input bytes are multiplied by signed coefficients.

#include <tmmintrin.h>

__m128i filter_step_ssse3(
    __m128i pixels_unsigned_u8,
    __m128i coefficients_signed_i8)
{
    return _mm_maddubs_epi16(
        pixels_unsigned_u8,
        coefficients_signed_i8);
}

This is not a full filter by itself, but it is the central operation in many fixed-point kernels: multiply small integer samples by coefficients, add adjacent pairs, and produce wider intermediate results.

This is the kind of operation that made SSSE3 especially valuable for multimedia code.

SSSE3 compared with previous SIMD instruction sets

SSSE3 is best understood as a practical extension to the SSE2 integer SIMD model. It did not replace earlier instruction sets, but it made several common operations much easier.

SSSE3 vs MMX

MMX introduced packed integer SIMD, but it used 64-bit MMX registers and shared state with x87 floating point. It was useful for early multimedia workloads, but awkward by modern standards.

SSSE3 can be seen as a much more mature integer SIMD extension. It provides byte shuffles, absolute values, horizontal integer operations, and fixed-point helpers in the XMM register model.

For new code, SSSE3 is far preferable to MMX.

SSSE3 vs SSE

SSE’s main contribution was 128-bit XMM registers and packed single-precision floating-point arithmetic. It was not primarily an integer extension.

SSSE3 is almost the opposite: it is mostly about integer data. It is far more useful than SSE for byte-level manipulation, packed 16-bit arithmetic, and fixed-point multimedia code.

SSSE3 vs SSE2

SSE2 is the true baseline for modern x86 SIMD. It added packed integer operations to XMM registers and made 128-bit SIMD generally useful.

SSSE3 builds directly on that foundation. It does not replace SSE2; it fills in missing pieces.

SSE2 can perform many integer SIMD operations, but SSSE3 makes some of the most common patterns shorter and clearer:

Task	SSE2 approach	SSSE3 improvement
Byte rearrangement	Multiple unpack/shift/logical operations	`PSHUFB`
Absolute value	Compare/mask/subtract sequence	`PABSB`, `PABSW`, `PABSD`
Pairwise integer sums	Shuffle and add	`PHADDW`, `PHADDD`
Sign-dependent negation	Compare and mask logic	`PSIGNB`, `PSIGNW`, `PSIGND`
Sliding byte alignment	Shifts and ORs	`PALIGNR`
Fixed-point rounded multiply	Multiply, add, shift sequence	`PMULHRSW`
Unsigned/signed byte multiply-add	Multiple unpack and multiply steps	`PMADDUBSW`

SSE2 is broader and more fundamental. SSSE3 is smaller but extremely practical.

SSSE3 vs SSE3

SSE3 is mostly floating-point oriented. SSSE3 is mostly integer oriented.

SSE3 helps with:

Horizontal floating-point addition and subtraction.
Complex floating-point arithmetic patterns.
Duplicate floating-point moves.
Unaligned load cases.
x87 truncating conversion.
Low-level monitor/wait instructions.

SSSE3 helps with:

Byte shuffling.
Integer absolute values.
Integer horizontal addition and subtraction.
Fixed-point multiply and rounding.
Sign manipulation.
Byte alignment across vectors.

For multimedia and byte-heavy workloads, SSSE3 often matters more than SSE3.

Practical uses of SSSE3

SSSE3 is useful in many domains:

Image processing.
Video decoding and encoding.
Audio codecs.
Compression and decompression.
Cryptography support code.
Checksums and hashing.
Text parsing.
Pixel format conversion.
Color-space conversion.
Fixed-point digital signal processing.
Motion estimation.
Block transforms.
Small table lookups.

The common theme is packed integer data. If your algorithm works with bytes, 16-bit samples, signed coefficients, packed pixels, or small fixed-point values, SSSE3 is worth understanding.

Performance considerations

SSSE3 instructions are powerful, but they should still be used carefully.

`PSHUFB` is powerful but lane-local

In the 128-bit SSSE3 form, PSHUFB shuffles bytes within a 128-bit XMM register.

Later AVX2 versions operate within 128-bit lanes inside a 256-bit register, not across the entire 256-bit vector. This lane-local behavior matters when porting algorithms to wider vectors.

Horizontal operations are not always fastest

PHADDW, PHADDD, PHSUBW, and PHSUBD can make code shorter, but shorter code is not always faster code.

On some CPUs, explicit shuffle and add sequences may perform just as well or better.

Saturation must be intentional

Instructions such as PHADDSW, PHSUBSW, and PMADDUBSW use signed saturation.

This is useful in multimedia code, but it changes numerical behavior. Do not use saturating operations unless saturation is part of the intended algorithm.

Data layout matters more than instruction count

SSSE3 is excellent when data is already packed in a SIMD-friendly format.

If the algorithm constantly requires scattered reads or complicated rearrangement, the benefit may be reduced.

Runtime dispatch is still important

If your program must support old CPUs, provide a fallback path. A good structure is:

Scalar baseline.
SSE2 implementation.
SSSE3 implementation.
SSE4.1 or AVX2 implementation where useful.

This is especially important for libraries distributed to unknown machines.

Common mistakes

Confusing SSE3 and SSSE3

This is the big one. SSE3 and SSSE3 are different. Check the right CPUID bit and use the correct compiler target.

Assuming SSSE3 is guaranteed on x86-64

x86-64 guarantees SSE2, not SSSE3. Many modern CPUs support SSSE3, but software that needs broad compatibility should still detect it.

Forgetting that `PSHUFB` can zero bytes

If the high bit of a shuffle-control byte is set, the output byte is zero.

This is a feature, not a bug, but it can surprise programmers who expected a simple modulo-16 byte index.

Using MMX forms in new code

Use XMM intrinsics unless there is a very specific reason not to. MMX is a legacy path.

Ignoring signedness in `PMADDUBSW`

PMADDUBSW multiplies unsigned bytes from one operand by signed bytes from the other.

Operand interpretation matters. Reversing the logical role of the operands can break the algorithm.

Should you use SSSE3 today?

Yes, if your target machines support it and your workload benefits from packed integer operations.

SSSE3 remains relevant because many byte-oriented SIMD idioms still map naturally to its instructions. Even when writing AVX2 code, the AVX2 versions of some operations are conceptual extensions of SSSE3 ideas, especially byte shuffling and packed integer multiply-add patterns.

Use SSSE3 when:

You need byte rearrangement with PSHUFB.
You are optimizing image, audio, or video code.
You use fixed-point arithmetic.
You need packed absolute values.
You need efficient sliding windows over byte streams.
SSE2 code is dominated by shuffle/mask sequences that SSSE3 can simplify.

Do not use SSSE3 blindly when:

The code is not performance-critical.
The compiler already generates good vector code.
Your bottleneck is memory bandwidth.
You need to support CPUs without SSSE3 and cannot add runtime dispatch.
AVX2 or newer extensions are available and produce a better implementation.

Conclusion

SSSE3 was a major practical improvement for integer SIMD programming on x86.

MMX introduced packed integer SIMD, but it was limited by 64-bit registers and awkward x87 interaction. SSE introduced the XMM register model. SSE2 made XMM registers useful for general integer SIMD. SSE3 refined floating-point operations. SSSE3 then added the missing integer tools that many multimedia and data-processing algorithms needed.

Its most important contribution was PSHUFB, but the extension is broader than that. Absolute value, sign manipulation, horizontal integer arithmetic, byte alignment, and fixed-point multiply helpers all made SSSE3 a valuable step in the evolution of SIMD.

SSSE3 is not as famous as AVX or as foundational as SSE2, but for packed byte and word processing, it was one of the most important SSE-era improvements. It made many real algorithms shorter, clearer, and faster, and its influence is still visible in modern SIMD programming today.

Where SSSE3 fits in the SIMD timeline

SSSE3 vs SSE3: the naming trap

What SSSE3 added

Why SSSE3 mattered

PSHUFB: the star of SSSE3

Absolute value instructions: PABSB, PABSW, PABSD

Sign manipulation: PSIGNB, PSIGNW, PSIGND

Horizontal integer add and subtract

PMADDUBSW: a very useful multiply-add

PMULHRSW: rounded fixed-point multiply

PALIGNR: byte alignment across two vectors

SSSE3 and MMX forms

Detecting SSSE3 support

Example: byte reversal with PSHUFB

Example: sum of absolute values

Example: fixed-point byte multiply-add

SSSE3 compared with previous SIMD instruction sets

SSSE3 vs MMX

SSSE3 vs SSE

SSSE3 vs SSE2

SSSE3 vs SSE3

Practical uses of SSSE3

Performance considerations

PSHUFB is powerful but lane-local

Horizontal operations are not always fastest

Saturation must be intentional

Data layout matters more than instruction count

Runtime dispatch is still important

Common mistakes

Confusing SSE3 and SSSE3

Assuming SSSE3 is guaranteed on x86-64

Forgetting that PSHUFB can zero bytes

Using MMX forms in new code

Ignoring signedness in PMADDUBSW

Should you use SSSE3 today?

Conclusion

Related Posts

AltaLux 2.0: a new multiscale engine and a simpler way to enhance images

AltaLux 1.9.1.92: major update for performance, correctness, and documentation

Intel AVX-512 instruction set: 512-bit SIMD, masks, and high-throughput vector computing

`PSHUFB`: the star of SSSE3

Absolute value instructions: `PABSB`, `PABSW`, `PABSD`

Sign manipulation: `PSIGNB`, `PSIGNW`, `PSIGND`

`PMADDUBSW`: a very useful multiply-add

`PMULHRSW`: rounded fixed-point multiply

`PALIGNR`: byte alignment across two vectors

Example: byte reversal with `PSHUFB`

`PSHUFB` is powerful but lane-local

Forgetting that `PSHUFB` can zero bytes

Ignoring signedness in `PMADDUBSW`