SIMD on x64/x86

Map of SIMD Instruction Sets and CPUs

The original version of this article was written in 2000, when the practical SIMD landscape on x86 processors was still small enough to fit in a compact table.

At that time, the important questions were simple:

  • Does the CPU support MMX?
  • Does it support Extended MMX?
  • Does it support SSE?
  • Does it support SSE2?
  • Does it support AMD 3DNow!?

That map was useful because the market was transitioning from scalar x86 code to the first generation of multimedia SIMD code.

Today the situation is much larger.

Modern x86 CPUs may support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, FMA, F16C, AVX-512, AVX-VNNI, AVX10, AMX, and many smaller specialized extensions. AMD also had its own historical path with 3DNow!, SSE4a, FMA4, XOP, and later AVX-512 support in Zen 4 and Zen 5.

This article updates the old map into a modern guide to SIMD instruction sets on Intel and AMD processors.

The goal is not to list every instruction mnemonic. The goal is to explain the major SIMD families, when they appeared, which processor generations support them, and what a programmer should check before using them.

Important Caveat: Always Check CPUID

Processor generation tables are useful, but they are not a replacement for runtime feature detection.

Different SKUs, steppings, BIOS settings, operating systems, virtual machines, and cloud instances may expose different instruction sets.

For production software, always check CPU capabilities at runtime using CPUID. For AVX, AVX2, AVX-512, and AVX10, also verify that the operating system supports saving and restoring the required extended register state. That usually means checking OSXSAVE and XGETBV, not only the raw CPU feature bit.

Use the tables in this article as a historical and practical map, not as the final authority for one specific machine.

What SIMD Means on x86

SIMD means:

Single Instruction, Multiple Data

A SIMD instruction applies one operation to several data elements packed into a vector register.

For example, instead of adding one pair of 32-bit integers:

a + b

a SIMD instruction can add several pairs at once:

a0 + b0
a1 + b1
a2 + b2
a3 + b3
...

The wider the vector register, the more elements can be processed by one instruction.

A simplified map looks like this:

SIMD familyMain register typeRegister widthTypical use
MMXmm0mm764-bitPacked integer multimedia
SSExmm0-…128-bitSingle-precision floating point, later integer and double
AVXymm0-…256-bitWider floating point
AVX2ymm0-…256-bitWider integer SIMD
AVX-512zmm0-… plus mask registers512-bitHPC, AI, media, compression, crypto
AVX10versioned AVX-family ISA128/256/512-bit modelFuture Intel converged vector ISA
AMXtile registerstile/matrix stateMatrix multiplication, AI acceleration

MMX, SSE, AVX, and AVX-512 are SIMD instruction sets.

AMX is different. It is not traditional lane-based SIMD. It is a tile/matrix extension. It belongs in the modern performance map because it is part of the same trend: moving more data-parallel work directly into the CPU.

The Short Timeline

EraIntelAMDSIMD milestone
1997Pentium MMX, Pentium IIK6MMX
1998-1999Pentium IIIK6-2, AthlonSSE on Intel, 3DNow! on AMD
2000-2001Pentium 4Athlon XP, Athlon 64 laterSSE2 starts becoming important
2004-2006Prescott, Core, Core 2Athlon 64, OpteronSSE3, SSSE3
2007-2009Penryn, NehalemPhenom, Phenom IISSE4.1, SSE4.2, SSE4a on AMD
2011Sandy BridgeBulldozerAVX
2012-2013Ivy Bridge, HaswellPiledriverF16C, FMA, AVX2
2015-2017Skylake, Skylake-X, Xeon PhiZenAVX2 becomes mainstream; AVX-512 appears on Intel high-end/server
2019-2021Ice Lake, Tiger Lake, Rocket LakeZen 2, Zen 3Wider AVX-512 coverage on some Intel CPUs
2022-2024Sapphire Rapids, Meteor Lake, Arrow Lake, Xeon 6Zen 4Intel AMX; AMD adds AVX-512 in Zen 4
2024-2026Granite Rapids, Xeon 6, Core Ultra generationsZen 5, EPYC TurinAVX-512 expands on AMD; Intel moves toward AVX10

MMX

MMX was the first widely adopted SIMD extension on x86.

It introduced eight 64-bit registers:

mm0 mm1 mm2 mm3 mm4 mm5 mm6 mm7

These registers can hold packed integers:

Data typeElements per MMX register
8-bit integers8
16-bit integers4
32-bit integers2
64-bit integer1

MMX was useful for:

  • image processing;
  • audio processing;
  • video decoding;
  • graphics effects;
  • color conversion;
  • packed integer arithmetic.

The main limitation of MMX is that it aliases the old x87 floating-point register stack. After using MMX, code must execute EMMS, or the _mm_empty() intrinsic in C/C++, before returning to x87 floating-point code.

Typical processors with MMX:

VendorProcessor generation
IntelPentium MMX, Pentium II, Celeron, Pentium III, Pentium 4
AMDK6, K6-2, K6-III, Athlon, Duron, Athlon XP, Athlon 64, Opteron

MMX is now obsolete for new code, but it still matters when reading old multimedia libraries and early SIMD code.

AMD 3DNow!

3DNow! was AMD’s alternative SIMD extension introduced with the K6-2.

Unlike MMX, which was mainly integer-oriented, 3DNow! added packed single-precision floating-point operations using MMX registers.

This gave AMD a way to accelerate 3D games and multimedia software before SSE was widely available.

Main 3DNow! variants:

ExtensionDescription
3DNow!Original AMD packed floating-point SIMD extension
Enhanced 3DNow!Added more DSP/media-oriented instructions
3DNow! ProfessionalMarketing name used around the Athlon XP era, including SSE support

Typical AMD processors with 3DNow!:

Processor generationSIMD support
K6-2MMX, 3DNow!
K6-IIIMMX, 3DNow!
AthlonMMX, Extended MMX, Enhanced 3DNow!
DuronMMX, Extended MMX, Enhanced 3DNow!
Athlon XPMMX, Extended MMX, Enhanced 3DNow!, SSE

3DNow! is no longer relevant for new code. Modern AMD processors support the Intel-compatible SSE/AVX instruction families instead.

SSE

SSE stands for:

Streaming SIMD Extensions

It was introduced by Intel with the Pentium III.

SSE added 128-bit XMM registers:

xmm0 xmm1 xmm2 xmm3 ...

The original SSE instruction set mainly focused on packed single-precision floating-point arithmetic.

One XMM register can hold four 32-bit floats:

float0 float1 float2 float3

That made SSE useful for:

  • 3D graphics;
  • geometry processing;
  • game engines;
  • audio processing;
  • image filters;
  • physics calculations.

SSE also added a few MMX-related integer instructions and memory/cache control instructions, but its most important architectural change was the introduction of the XMM register file.

Typical first-generation SSE processors:

VendorProcessor generation
IntelPentium III, Celeron II
AMDAthlon XP and later

SSE2

SSE2 was introduced with the Intel Pentium 4 and became one of the most important x86 SIMD extensions ever.

SSE2 expanded XMM usage to include:

  • packed double-precision floating-point;
  • packed integer operations;
  • 128-bit integer SIMD;
  • better replacement paths for old MMX code.

One XMM register can hold:

Data typeElements per XMM register
8-bit integers16
16-bit integers8
32-bit integers4
64-bit integers2
32-bit floats4
64-bit doubles2

SSE2 is especially important because it is part of the baseline for x86-64. If you are writing 64-bit x86 code, SSE2 is normally assumed to be available.

Typical processors with SSE2:

VendorProcessor generation
IntelPentium 4 and later
AMDAthlon 64, Opteron, Sempron 64 and later

For new code, SSE2 is a much better minimum target than MMX.

SSE3

SSE3 was introduced with later Pentium 4 Prescott processors.

It added a small set of instructions, including:

  • horizontal add/subtract operations;
  • improved complex-number support;
  • some thread synchronization and memory-related instructions.

SSE3 was not as large a jump as SSE2, but it filled practical gaps.

Typical processors with SSE3:

VendorProcessor generation
IntelPentium 4 Prescott, Pentium D, Core, Core 2
AMDLater Athlon 64, Opteron, Phenom and newer

SSSE3

SSSE3 stands for:

Supplemental Streaming SIMD Extensions 3

Despite the similar name, SSSE3 is not the same as SSE3.

SSSE3 added very useful integer SIMD instructions, especially for media processing. The most famous is probably PSHUFB, a byte shuffle instruction that became extremely valuable for:

  • codecs;
  • text processing;
  • lookup-table tricks;
  • cryptography;
  • pixel manipulation;
  • byte rearrangement.

Typical Intel processors with SSSE3:

VendorProcessor generation
IntelCore 2 and later
AMDLater Bulldozer-family and Zen-family processors

AMD did not support SSSE3 in the earliest Athlon 64 and Phenom generations.

SSE4.1

SSE4.1 was introduced with Intel Penryn, the 45 nm Core 2 generation.

It added many practical instructions for integer and floating-point SIMD, including:

  • packed blends;
  • dot product;
  • rounding;
  • packed min/max improvements;
  • integer widening and packing operations;
  • better support for media and graphics workloads.

Typical processors with SSE4.1:

VendorProcessor generation
IntelPenryn Core 2, Nehalem and later
AMDBulldozer and later, Zen and later

SSE4.1 is very useful for image processing and video code because it reduces the number of instructions needed for common packed operations.

SSE4.2

SSE4.2 was introduced with Intel Nehalem.

It added:

  • string and text comparison instructions;
  • CRC32;
  • additional comparison support.

SSE4.2 is especially associated with faster string processing and checksum code.

Typical processors with SSE4.2:

VendorProcessor generation
IntelNehalem, Westmere and later
AMDBulldozer and later, Zen and later

SSE4a

SSE4a is an AMD-specific extension.

Despite the name, it is not the same as Intel SSE4.1 or SSE4.2.

SSE4a appeared in AMD K10-family processors such as Phenom and Opteron Barcelona. It added a small number of instructions, including unaligned streaming load/store support and extract/insert operations.

Typical AMD processors with SSE4a:

VendorProcessor generation
AMDK10 / Barcelona / Phenom / Phenom II / some Opteron generations
IntelNot supported

Because SSE4a is AMD-specific and relatively small, portable software rarely targets it as a primary SIMD path.

AES-NI and PCLMULQDQ

AES-NI and PCLMULQDQ are not usually described as “SSE versions,” but they are closely related to the SIMD evolution of x86.

AES-NI accelerates AES encryption and decryption.

PCLMULQDQ performs carry-less multiplication, useful for cryptography, CRC algorithms, and Galois field arithmetic.

Typical support:

VendorProcessor generation
IntelWestmere and later, broadly
AMDBulldozer and later, broadly in modern Ryzen and EPYC

These instructions are very important for cryptographic performance.

AVX

AVX stands for:

Advanced Vector Extensions

AVX introduced 256-bit YMM registers.

Conceptually, each YMM register extends an XMM register:

xmm0 = lower 128 bits of ymm0
ymm0 = 256-bit register

AVX also introduced the VEX instruction encoding, which improved instruction encoding and allowed three-operand non-destructive forms.

For example, older SSE code often overwrote one input:

a = a + b

AVX can encode:

c = a + b

This is a major improvement for register allocation and instruction scheduling.

AVX mainly extended floating-point SIMD to 256 bits:

Data typeElements per YMM register
32-bit floats8
64-bit doubles4

Typical processors with AVX:

VendorProcessor generation
IntelSandy Bridge and later
AMDBulldozer and later, Zen and later

Important note: AVX requires operating system support for saving and restoring YMM state. Runtime detection must check both CPU support and OS support.

F16C

F16C added conversion instructions between 16-bit half-precision floating-point and 32-bit single-precision floating-point.

It is not a full half-precision arithmetic extension. It is mainly a conversion extension.

Typical processors with F16C:

VendorProcessor generation
IntelIvy Bridge and later
AMDPiledriver and later, Zen and later

F16C became important for graphics, machine learning data conversion, and storage of compact floating-point data.

FMA3

FMA means:

Fused Multiply-Add

A fused multiply-add computes:

a * b + c

as one fused operation.

FMA improves both performance and numerical behavior for many floating-point workloads.

FMA3 is the dominant x86 FMA form today.

Typical processors with FMA3:

VendorProcessor generation
IntelHaswell and later
AMDPiledriver and later, Zen and later

FMA3 is very important for:

  • matrix multiplication;
  • DSP;
  • physics;
  • machine learning;
  • linear algebra;
  • scientific computing.

FMA4

FMA4 was an AMD extension introduced around the Bulldozer era.

It used a four-operand form, which was elegant from a programmer’s point of view, but it did not become the long-term x86 standard.

Typical processors with FMA4:

VendorProcessor generation
AMDBulldozer-family processors
IntelNot supported

FMA4 is now considered a historical AMD-specific path. New software should use FMA3.

XOP

XOP was another AMD-specific SIMD extension from the Bulldozer era.

It added integer vector operations, including permutes, shifts, comparisons, and multiply-accumulate style operations.

Typical processors with XOP:

VendorProcessor generation
AMDBulldozer-family processors
IntelNot supported

XOP did not become a portable x86 SIMD target. Modern AMD Zen processors do not continue the XOP direction as a primary programming model.

AVX2

AVX2 was introduced by Intel Haswell.

It extended most 128-bit integer SIMD operations from SSE2/SSSE3/SSE4 to 256-bit YMM registers.

This was a major milestone because AVX1 was mainly about floating-point SIMD, while AVX2 made 256-bit integer SIMD broadly useful.

AVX2 added or expanded support for:

  • 256-bit integer arithmetic;
  • 256-bit logical operations;
  • packed shifts;
  • packed compares;
  • gathers;
  • wider byte/word/dword processing;
  • stronger general-purpose SIMD support.

Typical processors with AVX2:

VendorProcessor generation
IntelHaswell, Broadwell, Skylake and later
AMDExcavator partially, Zen and later

AVX2 is currently one of the best practical SIMD targets for portable high-performance x86 code because it is widely available across modern Intel and AMD processors.

AVX-VNNI

AVX-VNNI brings Vector Neural Network Instructions to AVX-width encodings, without requiring full AVX-512.

VNNI-style instructions are useful for integer dot products, especially in neural-network inference.

There are two related ideas:

ExtensionMeaning
AVX-512 VNNIVNNI instructions in the AVX-512 family
AVX-VNNIVNNI-style instructions available without requiring the full AVX-512 programming model

AVX-VNNI matters because Intel client CPUs after the removal of client AVX-512 still needed efficient AI/inference instructions.

Typical support appears in newer Intel client and server generations, but exact support should be checked with CPUID.

AVX-IFMA

AVX-IFMA provides integer fused multiply-add style operations outside the older AVX-512-only naming path.

It is useful for big-number arithmetic, cryptography, and workloads that benefit from packed integer multiply-add operations.

As with other newer AVX-family subsets, support is generation- and SKU-dependent. Check CPUID.

AVX-VNNI-INT8 and AVX-VNNI-INT16

These are newer AVX-family extensions focused on integer dot-product operations for low-precision AI and inference workloads.

They target common neural-network data types:

ExtensionMain focus
AVX-VNNI-INT88-bit integer neural-network operations
AVX-VNNI-INT1616-bit integer neural-network operations

These are part of the modern trend toward CPU-side AI acceleration without requiring every workload to move to a GPU or NPU.

AVX-NE-CONVERT

AVX-NE-CONVERT is another newer extension aimed at efficient conversion involving low-precision numerical formats.

This belongs to the same broad family of AI-oriented CPU instructions as VNNI, BF16, FP16, and future AVX10/ACE work.

For portable software, this is not a baseline assumption. It is a specialized path selected by CPUID.

AVX-512

AVX-512 is not a single instruction set in the same simple sense as SSE2 or AVX2.

It is a family of extensions based around 512-bit ZMM registers and mask registers.

A 512-bit ZMM register can hold:

Data typeElements per ZMM register
8-bit integers64
16-bit integers32
32-bit integers16
64-bit integers8
32-bit floats16
64-bit doubles8

AVX-512 also introduced mask registers:

k0 k1 k2 k3 k4 k5 k6 k7

These mask registers allow predicated operations, meaning each lane can be conditionally written without needing separate blend instructions.

AVX-512 is important for:

  • high-performance computing;
  • scientific simulations;
  • AI inference;
  • data compression;
  • database processing;
  • media processing;
  • cryptography;
  • genomics;
  • large-scale analytics.

AVX-512 Variants

AVX-512 is made of many feature subsets. A CPU can support some subsets and not others.

This is one of the reasons AVX-512 software often needs careful runtime dispatch.

Core AVX-512 subsets

ExtensionDescription
AVX-512FFoundation; required base for AVX-512 implementations
AVX-512CDConflict detection
AVX-512ERExponential and reciprocal instructions, mainly Xeon Phi
AVX-512PFPrefetch instructions, mainly Xeon Phi
AVX-512DQDoubleword and quadword instructions
AVX-512BWByte and word instructions
AVX-512VLAllows many AVX-512 instructions to operate on 128-bit and 256-bit vectors

Integer, byte, and bit manipulation subsets

ExtensionDescription
AVX-512IFMAInteger fused multiply-add
AVX-512VBMIVector byte manipulation
AVX-512VBMI2Additional byte/word manipulation
AVX-512BITALGBit algorithms
AVX-512VPOPCNTDQVector population count for doubleword/quadword elements

AI and numerical-format subsets

ExtensionDescription
AVX-512VNNIVector neural-network instructions
AVX-512BF16bfloat16 dot-product and conversion support
AVX-512FP16Half-precision floating-point arithmetic

Crypto and Galois-field related vector extensions

ExtensionDescription
VAESVector AES
VPCLMULQDQVector carry-less multiply
GFNIGalois Field New Instructions

These may be available with AVX, AVX2, or AVX-512 encodings depending on the processor.

Specialized or limited AVX-512 subsets

ExtensionDescription
AVX-5124VNNIWXeon Phi-oriented neural-network instructions
AVX-5124FMAPSXeon Phi-oriented fused multiply-accumulate instructions
AVX-512VP2INTERSECTVector pair intersection

Some of these appeared only in narrow product families or were not widely adopted.

AVX-512 on Intel CPUs

Intel introduced AVX-512 first in Xeon Phi and then in Xeon server and high-end desktop processors.

A simplified Intel AVX-512 map:

Intel generationAVX-512 status
Xeon Phi Knights LandingEarly AVX-512: F, CD, ER, PF
Xeon Phi Knights MillAdded specialized AI/HPC subsets such as 4VNNIW and 4FMAPS
Skylake-X / Skylake-SPAVX-512F, CD, BW, DQ, VL
Cascade Lake XeonAdded AVX-512VNNI
Cooper Lake XeonAdded AVX-512BF16
Ice Lake client/serverBroader AVX-512 support including VNNI and byte/bit manipulation subsets
Tiger LakeClient AVX-512 on many SKUs
Rocket LakeDesktop AVX-512 on supported SKUs
Alder Lake and later mainstream hybrid client CPUsAVX-512 not officially supported
Sapphire Rapids XeonAVX-512 plus AMX, BF16, FP16 support
Emerald Rapids XeonSimilar server-class AVX-512/AMX direction
Granite Rapids / Xeon 6 P-coreServer/workstation AVX-512 and transition toward AVX10
Xeon 6 E-core-only linesAVX-512 support is not the same as P-core Xeon; check SKU documentation

The most important practical point is that Intel client CPUs and Intel server CPUs diverged.

For several years, high-end Intel servers had strong AVX-512 support while many mainstream client CPUs did not.

AVX-512 on AMD CPUs

AMD did not support AVX-512 in Zen, Zen+, Zen 2, or Zen 3.

AMD added AVX-512 support with Zen 4.

A simplified AMD AVX-512 map:

AMD generationAVX-512 status
Zen / Zen+No AVX-512
Zen 2No AVX-512
Zen 3No AVX-512
Zen 4AVX-512 support added
Zen 5 / EPYC TurinAVX-512 support continues and becomes stronger

AMD Zen 4 processors support a practical subset of AVX-512 that includes important features such as AVX-512F, DQ, IFMA, CD, BW, VL, VBMI, VNNI, BITALG, VPOPCNTDQ, BF16, and related vector crypto/Galois-field extensions depending on the exact model.

Zen 5, including 5th Gen AMD EPYC Turin, continues the AVX-512 direction and is especially relevant for HPC, AI, analytics, and cloud workloads.

The important lesson is that AVX-512 is no longer Intel-only. Modern portable high-performance code should consider AVX-512 dispatch paths for both recent Intel server CPUs and recent AMD Zen 4 / Zen 5 CPUs.

AVX10

AVX10 is Intel’s attempt to simplify the future of the AVX family.

The problem with AVX-512 is fragmentation. There are many feature bits, and software has to check which subset is available.

AVX10 moves toward a versioned model.

Instead of thinking only in terms of dozens of separate AVX-512 feature flags, AVX10 introduces an AVX10 version number.

The idea is:

AVX10.1
AVX10.2
future AVX10 versions

A later version is expected to include the earlier version’s capabilities.

AVX10 is also designed to make the AVX-512-style programming model available across future Intel P-core and E-core processors.

Important AVX10 points:

  • AVX10 is based on the AVX-512 programming model.
  • It uses versioned enumeration.
  • It is intended to reduce feature-detection complexity.
  • Future new Intel vector instructions are expected to be enumerated under AVX10 rather than by adding more AVX-512 feature flags.
  • AVX10.1 is a transition version.
  • AVX10.2 adds new instructions, including AI data type and conversion support.

For developers, AVX10 is important because it points to the future direction of Intel vector programming.

However, for software that must run on existing machines, AVX2 and AVX-512 runtime dispatch remain essential.

AMX

AMX stands for:

Advanced Matrix Extensions

AMX is not traditional SIMD. It introduces tile registers and tile operations designed for matrix multiplication and AI workloads.

Important AMX subsets include:

ExtensionDescription
AMX-TILETile register architecture
AMX-INT88-bit integer matrix operations
AMX-BF16bfloat16 matrix operations
AMX-FP16FP16 matrix operations on newer/future processors
AMX-COMPLEXComplex-number tile operations on newer/future processors

AMX is especially relevant for:

  • neural-network inference;
  • matrix multiplication;
  • deep learning kernels;
  • server-side AI;
  • dense numerical compute.

Intel introduced AMX in Sapphire Rapids Xeon processors.

AMX requires operating system support because it adds new architectural state. Like AVX and AVX-512, detecting the CPU feature alone is not enough; the OS must support saving and restoring the state.

ACE: AI Compute Extensions

ACE, or AI Compute Extensions, is a newer x86 ecosystem direction jointly associated with Intel and AMD.

ACE is focused on AI-oriented matrix acceleration and reduced-precision numerical formats. It is intended to provide a more consistent cross-vendor target for future x86 AI workloads.

ACE is not a replacement for SSE, AVX, or AVX-512. It belongs to the same broader evolution: the CPU is gaining more native support for dense data-parallel and matrix-heavy workloads.

For the purpose of a SIMD map, ACE should be considered a future-facing matrix/AI extension rather than a classic lane-based SIMD family.

Intel Processor Generations and SIMD Support

The following table summarizes major Intel generations and their most important SIMD support.

This table is intentionally practical rather than exhaustive. Some product lines, steppings, low-end SKUs, embedded parts, and disabled features differ.

Intel processor generationApprox. eraImportant SIMD support
Pentium1993No MMX in original Pentium
Pentium MMX1997MMX
Pentium II1997MMX
Original Celeron1998MMX
Pentium III1999MMX, SSE
Celeron II2000MMX, SSE
Pentium 4 Willamette / Northwood2000-2002MMX, SSE, SSE2
Pentium 4 Prescott2004MMX, SSE, SSE2, SSE3
Pentium D2005SSE2, SSE3
Pentium M2003-2005SSE2; later models added SSE3
Core Solo / Core Duo2006SSE2, SSE3
Core 2 Merom / Conroe2006SSE2, SSE3, SSSE3
Core 2 Penryn2007SSSE3, SSE4.1
Nehalem2008SSE4.1, SSE4.2
Westmere2010SSE4.2, AES-NI, PCLMULQDQ
Sandy Bridge2011AVX
Ivy Bridge2012AVX, F16C
Haswell2013AVX2, FMA3
Broadwell2014-2015AVX2, FMA3
Skylake client2015AVX2, FMA3
Skylake-X / Skylake-SP2017AVX-512F, CD, BW, DQ, VL
Kaby Lake / Coffee Lake / Comet Lake2016-2020AVX2, FMA3
Cannon Lake2018AVX-512 on limited client products
Cascade Lake Xeon2019AVX-512VNNI
Cooper Lake Xeon2020AVX-512BF16
Ice Lake client/server2019-2021Broad AVX-512 support on many SKUs
Tiger Lake2020AVX-512 on many client SKUs
Rocket Lake2021AVX-512 on supported desktop SKUs
Alder Lake2021AVX2, FMA3, AVX-VNNI on many SKUs; AVX-512 not officially supported
Raptor Lake2022-2023AVX2, FMA3, AVX-VNNI on many SKUs; no official AVX-512
Sapphire Rapids Xeon2023AVX-512, BF16, FP16, AMX
Emerald Rapids Xeon2023AVX-512, AMX server-class support
Meteor Lake / Core Ultra Series 12023-2024AVX2/FMA-class client SIMD; no mainstream AVX-512
Sierra Forest Xeon 6 E-core2024E-core server line; AVX-512 support differs from P-core Xeon
Granite Rapids Xeon 6 P-core2024-2025AVX-512, AMX, AVX10 transition generation
Arrow Lake / Core Ultra 2002024-2025AVX2/FMA-class client SIMD; no mainstream AVX-512
Lunar Lake / Core Ultra 200V2024-2025AVX2/FMA-class client SIMD; AI acceleration also via NPU
Xeon 600 workstation / Xeon 6 workstation2026Server/workstation-class AVX-512 and AMX on P-core products
Panther Lake / Core Ultra Series 32026 generationCheck final SKU documentation; client hybrid direction, not a simple AVX-512 baseline

AMD Processor Generations and SIMD Support

The following table summarizes major AMD generations and their most important SIMD support.

Again, check CPUID for exact systems.

AMD processor generationApprox. eraImportant SIMD support
K51996No MMX baseline
K61997MMX
K6-21998MMX, 3DNow!
K6-III1999MMX, 3DNow!
Athlon1999MMX, Enhanced 3DNow!, Extended MMX
Duron2000MMX, Enhanced 3DNow!, Extended MMX
Athlon XP2001MMX, Enhanced 3DNow!, SSE
Athlon 64 / Opteron2003SSE, SSE2; later revisions added SSE3
Sempron 642004SSE, SSE2; later SSE3 depending model
K10 / Barcelona / Phenom2007SSE3, SSE4a, 3DNow! legacy support
Phenom II / Athlon II2008-2009SSE3, SSE4a
Bobcat2011SSE2/SSE3/SSSE3-class low-power support depending model
Bulldozer2011AVX, SSE4.1, SSE4.2, XOP, FMA4
Piledriver2012AVX, FMA3, FMA4, F16C, XOP
Steamroller2014AVX/FMA-class Bulldozer-family SIMD
Excavator2015AVX2 support in some products
Jaguar / Puma2013-2014SSE4.x/AVX-class low-power SIMD, depending model
Zen2017SSE4.2, AVX, AVX2, FMA3, AES, F16C
Zen+2018Similar to Zen
Zen 22019AVX2, FMA3; no AVX-512
Zen 32020AVX2, FMA3; no AVX-512
Zen 42022AVX-512 support added
Zen 4c2023AVX-512 support in dense-core server/client variants where exposed
Zen 52024AVX-512 support continues; stronger vector capability
5th Gen EPYC Turin2024-2026AVX-512, including BF16/FP16-related support depending SKU/platform
Future Zen generations2026+Check AMD documentation and CPUID; ACE/AVX10 ecosystem direction may matter in future

Practical SIMD Baselines for Software

The best SIMD target depends on what kind of software you are writing.

If you need maximum compatibility

Use scalar code plus SSE2.

SSE2 is a safe baseline for x86-64 and works on a very wide range of Intel and AMD processors.

Good for:

  • general-purpose libraries;
  • small utilities;
  • long-tail compatibility;
  • software that must run on old machines.

If you target reasonably modern desktops and laptops

Use AVX2 plus a fallback.

AVX2 is a strong practical baseline for many modern machines from the last decade.

Good for:

  • image processing;
  • compression;
  • video processing;
  • game engines;
  • data scanning;
  • numerical kernels;
  • DSP;
  • high-performance C/C++ libraries.

If you target recent servers or high-performance workstations

Use AVX-512 with runtime dispatch.

AVX-512 can provide significant gains, but support varies across generations and vendors.

Good for:

  • HPC;
  • machine learning;
  • analytics;
  • genomics;
  • cryptography;
  • compression;
  • vectorized database operations;
  • scientific computing.

If you target AI matrix workloads on recent Intel Xeon

Consider AMX.

AMX is not a general replacement for AVX-512. It is especially useful for matrix multiplication and deep learning kernels.

Good for:

  • neural-network inference;
  • BF16 matrix multiplication;
  • INT8 inference;
  • server-side AI.

If you target future Intel vector code

Track AVX10.

AVX10 is designed to simplify the future AVX programming model, but existing deployed systems still require AVX2 and AVX-512 dispatch paths.

Why Width Is Not Everything

It is tempting to rank SIMD instruction sets only by vector width:

MMX    = 64-bit
SSE    = 128-bit
AVX2   = 256-bit
AVX-512 = 512-bit

But real performance depends on much more than width.

Important factors include:

  • instruction latency;
  • instruction throughput;
  • number of execution ports;
  • load/store bandwidth;
  • cache behavior;
  • memory alignment;
  • downclocking behavior;
  • register pressure;
  • compiler quality;
  • data layout;
  • branch behavior;
  • whether the workload is compute-bound or memory-bound.

A 512-bit instruction is not automatically twice as fast as a 256-bit instruction. If the workload is memory-bound, wider vectors may not help much. If the CPU executes a 512-bit operation internally as multiple narrower operations, peak throughput may be different from the architectural width.

Always benchmark on the target CPU.

Data Layout Matters

SIMD code works best when data is arranged in a vector-friendly layout.

For example, consider pixels stored as:

RGB RGB RGB RGB

This is convenient for scalar code, but it can be awkward for SIMD code if you want to process all red values together, then all green values, then all blue values.

A SIMD-friendly layout may look like this:

RRRR GGGG BBBB

or use separate arrays:

R[] G[] B[]

This is the classic difference between:

Array of Structures

and:

Structure of Arrays

Instruction set support matters, but data layout often matters just as much.

Runtime Dispatch Strategy

A good modern x86 library often contains multiple implementations of the same hot loop.

For example:

scalar
SSE2
SSSE3 or SSE4.1
AVX2
AVX-512
AMX or specialized AI path

At startup or first use, the library checks CPU features and selects the best implementation.

A simplified dispatch order might be:

if AMX and workload is matrix-heavy:
    use AMX path
else if AVX-512 suitable subset is available:
    use AVX-512 path
else if AVX2 and FMA are available:
    use AVX2/FMA path
else if SSE4.1 is available:
    use SSE4.1 path
else if SSSE3 is available:
    use SSSE3 path
else if SSE2 is available:
    use SSE2 path
else:
    use scalar path

For AVX and later, remember that CPU support alone is not enough. The operating system must support the extended register state.

Recommended Feature Checks

For old SSE code, checking the CPUID feature bit is usually enough.

For AVX and later, check both the CPU and the OS.

A practical checklist:

Feature familyWhat to check
MMXCPUID MMX
SSECPUID SSE
SSE2CPUID SSE2
SSE3/SSSE3/SSE4.xCPUID feature bits
AVXCPUID AVX, OSXSAVE, XGETBV for XMM/YMM state
AVX2AVX checks plus CPUID AVX2
FMA/F16CAVX checks plus CPUID FMA/F16C
AVX-512AVX checks plus CPUID AVX-512 bits plus XGETBV for opmask/ZMM state
AMXCPUID AMX bits plus OS support for tile state
AVX10CPUID AVX10 bit and AVX10 version enumeration

Do not assume that a CPU family name is enough.

Complete SIMD Family Map

The following table summarizes the major x86 SIMD and SIMD-adjacent instruction-set families.

Instruction setVendor originRegister width/modelMain purposeModern relevance
MMXIntel64-bit MMXPacked integer multimediaLegacy
Extended MMX / MMX+AMD / SSE-era64-bit MMXExtra integer/media operationsLegacy
3DNow!AMD64-bit MMXPacked floating-pointObsolete
Enhanced 3DNow!AMD64-bit MMXMore media/DSP operationsObsolete
SSEIntel128-bit XMMPacked single-precision FPHistorical baseline
SSE2Intel128-bit XMMInteger and double FP SIMDx86-64 baseline
SSE3Intel128-bit XMMHorizontal ops, complex arithmetic helpersCommon
SSSE3Intel128-bit XMMByte shuffle, integer media opsVery useful
SSE4aAMD128-bit XMMSmall AMD-specific extensionNiche
SSE4.1Intel128-bit XMMBlends, dot product, rounding, media opsUseful
SSE4.2Intel128-bit XMMText/string compare, CRC32Common
AES-NIIntelXMM-basedAES crypto accelerationVery important
PCLMULQDQIntelXMM-basedCarry-less multiplyVery important
AVXIntel256-bit YMMWider FP SIMD, VEX encodingCommon
F16CIntelXMM/YMMFP16/FP32 conversionCommon
FMA3Intel/AMDXMM/YMMFused multiply-addCommon
FMA4AMDXMM/YMMFour-operand FMAHistorical
XOPAMDXMM/YMMAMD-specific vector opsHistorical
AVX2Intel256-bit YMM256-bit integer SIMDModern baseline
AVX-VNNIIntelXMM/YMMNeural-network dot productsNewer client/server
AVX-IFMAIntelXMM/YMMInteger multiply-addSpecialized
AVX-VNNI-INT8IntelXMM/YMMINT8 AI operationsNew/future-facing
AVX-VNNI-INT16IntelXMM/YMMINT16 AI operationsNew/future-facing
AVX-NE-CONVERTIntelXMM/YMMLow-precision conversionNew/future-facing
AVX-512FIntel512-bit ZMMAVX-512 foundationServer/HPC/AI
AVX-512CDIntel512-bit ZMMConflict detectionServer/HPC
AVX-512ERIntel512-bit ZMMExp/reciprocal, Xeon PhiLimited
AVX-512PFIntel512-bit ZMMPrefetch, Xeon PhiLimited
AVX-512DQIntel512-bit ZMMDword/qword operationsCommon AVX-512 subset
AVX-512BWIntel512-bit ZMMByte/word operationsCommon AVX-512 subset
AVX-512VLIntel128/256-bit forms of AVX-512 opsMakes AVX-512 more flexibleImportant
AVX-512IFMAIntel512-bit ZMMInteger fused multiply-addCrypto/bignum
AVX-512VBMIIntel512-bit ZMMByte manipulationText/media
AVX-512VBMI2Intel512-bit ZMMMore byte/word manipulationText/media
AVX-512VNNIIntel512-bit ZMMNeural-network inferenceAI
AVX-512BITALGIntel512-bit ZMMBit algorithmsSpecialized
AVX-512VPOPCNTDQIntel512-bit ZMMVector popcountData/search/analytics
AVX-512BF16Intel/AMD512-bit ZMMbfloat16 operationsAI
AVX-512FP16Intel/AMD512-bit ZMMFP16 arithmeticAI/HPC/media
AVX-512VP2INTERSECTIntel512-bit ZMMVector pair intersectionLimited
AVX-5124VNNIWIntel512-bit ZMMXeon Phi AI instructionsHistorical/limited
AVX-5124FMAPSIntel512-bit ZMMXeon Phi FMA instructionsHistorical/limited
VAESIntel/AMDXMM/YMM/ZMM depending CPUVector AESCrypto
VPCLMULQDQIntel/AMDXMM/YMM/ZMM depending CPUVector carry-less multiplyCrypto
GFNIIntel/AMDXMM/YMM/ZMM depending CPUGalois-field operationsCrypto/coding
AVX10.1IntelAVX-512-style versioned ISATransition from AVX-512Future/current transition
AVX10.2IntelAVX10 versioned ISANew AI/data movement/conversion opsFuture-facing
AMX-TILEIntelTile stateMatrix/tile baseAI/server
AMX-INT8IntelTile stateINT8 matrix operationsAI/server
AMX-BF16IntelTile stateBF16 matrix operationsAI/server
AMX-FP16IntelTile stateFP16 matrix operationsNewer/future server
ACEIntel/AMD ecosystemMatrix/AVX10-related modelAI matrix accelerationFuture-facing

Practical Recommendations

For old code

If the code uses MMX or 3DNow!, consider rewriting it using SSE2 or AVX2.

MMX and 3DNow! were important historically, but they are poor targets for modern code.

For portable 64-bit x86 code

Use SSE2 as the minimum SIMD baseline.

SSE2 is available on x86-64 and avoids the old MMX/x87 state-sharing problem.

For modern desktop software

Use AVX2 and FMA when available.

AVX2 is widely supported across Intel Haswell-and-newer and AMD Zen-and-newer processors.

For high-performance server software

Add AVX-512 dispatch paths.

Recent Intel Xeon and AMD EPYC processors can benefit significantly from AVX-512, especially for compute-heavy workloads.

For AI and matrix-heavy workloads

Consider AMX on supported Intel Xeon processors, and track future ACE developments.

For AMD Zen 4 and Zen 5, AVX-512 is the important CPU-side vector path today.

For future Intel vector code

Track AVX10.

AVX10 is intended to reduce fragmentation and provide a more consistent future AVX programming model.

Summary

The x86 SIMD map has grown from a simple MMX/SSE/3DNow! table into a complex family tree.

The big historical steps are:

  1. MMX introduced packed integer SIMD on x86.
  2. 3DNow! gave AMD an early packed floating-point SIMD path.
  3. SSE introduced 128-bit XMM registers.
  4. SSE2 made 128-bit integer and double-precision SIMD central to x86-64.
  5. SSSE3 and SSE4.x added many practical media, text, and integer operations.
  6. AVX introduced 256-bit registers and better instruction encoding.
  7. AVX2 made 256-bit integer SIMD broadly useful.
  8. FMA and F16C improved numerical and conversion-heavy workloads.
  9. AVX-512 introduced 512-bit vectors, mask registers, and many specialized subsets.
  10. AMX added matrix/tile acceleration for AI workloads.
  11. AVX10 points toward a more unified future Intel vector ISA.
  12. ACE points toward future cross-vendor AI matrix acceleration.

For developers, the most important practical lesson is simple:

Do not choose a SIMD path from the processor name alone. Detect the actual instruction sets at runtime.

For broad compatibility, start with scalar and SSE2.

For modern performance, add AVX2.

For recent servers and high-end compute, add AVX-512.

For AI matrix workloads, consider AMX where available.

And for future Intel platforms, keep an eye on AVX10.

References