The original version of this article was written in 2000, when the practical SIMD landscape on x86 processors was still small enough to fit in a compact table.
At that time, the important questions were simple:
- Does the CPU support MMX?
- Does it support Extended MMX?
- Does it support SSE?
- Does it support SSE2?
- Does it support AMD 3DNow!?
That map was useful because the market was transitioning from scalar x86 code to the first generation of multimedia SIMD code.
Today the situation is much larger.
Modern x86 CPUs may support MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, FMA, F16C, AVX-512, AVX-VNNI, AVX10, AMX, and many smaller specialized extensions. AMD also had its own historical path with 3DNow!, SSE4a, FMA4, XOP, and later AVX-512 support in Zen 4 and Zen 5.
This article updates the old map into a modern guide to SIMD instruction sets on Intel and AMD processors.
The goal is not to list every instruction mnemonic. The goal is to explain the major SIMD families, when they appeared, which processor generations support them, and what a programmer should check before using them.
Important Caveat: Always Check CPUID
Processor generation tables are useful, but they are not a replacement for runtime feature detection.
Different SKUs, steppings, BIOS settings, operating systems, virtual machines, and cloud instances may expose different instruction sets.
For production software, always check CPU capabilities at runtime using CPUID. For AVX, AVX2, AVX-512, and AVX10, also verify that the operating system supports saving and restoring the required extended register state. That usually means checking OSXSAVE and XGETBV, not only the raw CPU feature bit.
Use the tables in this article as a historical and practical map, not as the final authority for one specific machine.
What SIMD Means on x86
SIMD means:
Single Instruction, Multiple Data
A SIMD instruction applies one operation to several data elements packed into a vector register.
For example, instead of adding one pair of 32-bit integers:
a + b
a SIMD instruction can add several pairs at once:
a0 + b0
a1 + b1
a2 + b2
a3 + b3
...
The wider the vector register, the more elements can be processed by one instruction.
A simplified map looks like this:
| SIMD family | Main register type | Register width | Typical use |
|---|---|---|---|
| MMX | mm0–mm7 | 64-bit | Packed integer multimedia |
| SSE | xmm0-… | 128-bit | Single-precision floating point, later integer and double |
| AVX | ymm0-… | 256-bit | Wider floating point |
| AVX2 | ymm0-… | 256-bit | Wider integer SIMD |
| AVX-512 | zmm0-… plus mask registers | 512-bit | HPC, AI, media, compression, crypto |
| AVX10 | versioned AVX-family ISA | 128/256/512-bit model | Future Intel converged vector ISA |
| AMX | tile registers | tile/matrix state | Matrix multiplication, AI acceleration |
MMX, SSE, AVX, and AVX-512 are SIMD instruction sets.
AMX is different. It is not traditional lane-based SIMD. It is a tile/matrix extension. It belongs in the modern performance map because it is part of the same trend: moving more data-parallel work directly into the CPU.
The Short Timeline
| Era | Intel | AMD | SIMD milestone |
|---|---|---|---|
| 1997 | Pentium MMX, Pentium II | K6 | MMX |
| 1998-1999 | Pentium III | K6-2, Athlon | SSE on Intel, 3DNow! on AMD |
| 2000-2001 | Pentium 4 | Athlon XP, Athlon 64 later | SSE2 starts becoming important |
| 2004-2006 | Prescott, Core, Core 2 | Athlon 64, Opteron | SSE3, SSSE3 |
| 2007-2009 | Penryn, Nehalem | Phenom, Phenom II | SSE4.1, SSE4.2, SSE4a on AMD |
| 2011 | Sandy Bridge | Bulldozer | AVX |
| 2012-2013 | Ivy Bridge, Haswell | Piledriver | F16C, FMA, AVX2 |
| 2015-2017 | Skylake, Skylake-X, Xeon Phi | Zen | AVX2 becomes mainstream; AVX-512 appears on Intel high-end/server |
| 2019-2021 | Ice Lake, Tiger Lake, Rocket Lake | Zen 2, Zen 3 | Wider AVX-512 coverage on some Intel CPUs |
| 2022-2024 | Sapphire Rapids, Meteor Lake, Arrow Lake, Xeon 6 | Zen 4 | Intel AMX; AMD adds AVX-512 in Zen 4 |
| 2024-2026 | Granite Rapids, Xeon 6, Core Ultra generations | Zen 5, EPYC Turin | AVX-512 expands on AMD; Intel moves toward AVX10 |
MMX
MMX was the first widely adopted SIMD extension on x86.
It introduced eight 64-bit registers:
mm0 mm1 mm2 mm3 mm4 mm5 mm6 mm7
These registers can hold packed integers:
| Data type | Elements per MMX register |
|---|---|
| 8-bit integers | 8 |
| 16-bit integers | 4 |
| 32-bit integers | 2 |
| 64-bit integer | 1 |
MMX was useful for:
- image processing;
- audio processing;
- video decoding;
- graphics effects;
- color conversion;
- packed integer arithmetic.
The main limitation of MMX is that it aliases the old x87 floating-point register stack. After using MMX, code must execute EMMS, or the _mm_empty() intrinsic in C/C++, before returning to x87 floating-point code.
Typical processors with MMX:
| Vendor | Processor generation |
|---|---|
| Intel | Pentium MMX, Pentium II, Celeron, Pentium III, Pentium 4 |
| AMD | K6, K6-2, K6-III, Athlon, Duron, Athlon XP, Athlon 64, Opteron |
MMX is now obsolete for new code, but it still matters when reading old multimedia libraries and early SIMD code.
AMD 3DNow!
3DNow! was AMD’s alternative SIMD extension introduced with the K6-2.
Unlike MMX, which was mainly integer-oriented, 3DNow! added packed single-precision floating-point operations using MMX registers.
This gave AMD a way to accelerate 3D games and multimedia software before SSE was widely available.
Main 3DNow! variants:
| Extension | Description |
|---|---|
| 3DNow! | Original AMD packed floating-point SIMD extension |
| Enhanced 3DNow! | Added more DSP/media-oriented instructions |
| 3DNow! Professional | Marketing name used around the Athlon XP era, including SSE support |
Typical AMD processors with 3DNow!:
| Processor generation | SIMD support |
|---|---|
| K6-2 | MMX, 3DNow! |
| K6-III | MMX, 3DNow! |
| Athlon | MMX, Extended MMX, Enhanced 3DNow! |
| Duron | MMX, Extended MMX, Enhanced 3DNow! |
| Athlon XP | MMX, Extended MMX, Enhanced 3DNow!, SSE |
3DNow! is no longer relevant for new code. Modern AMD processors support the Intel-compatible SSE/AVX instruction families instead.
SSE
SSE stands for:
Streaming SIMD Extensions
It was introduced by Intel with the Pentium III.
SSE added 128-bit XMM registers:
xmm0 xmm1 xmm2 xmm3 ...
The original SSE instruction set mainly focused on packed single-precision floating-point arithmetic.
One XMM register can hold four 32-bit floats:
float0 float1 float2 float3
That made SSE useful for:
- 3D graphics;
- geometry processing;
- game engines;
- audio processing;
- image filters;
- physics calculations.
SSE also added a few MMX-related integer instructions and memory/cache control instructions, but its most important architectural change was the introduction of the XMM register file.
Typical first-generation SSE processors:
| Vendor | Processor generation |
|---|---|
| Intel | Pentium III, Celeron II |
| AMD | Athlon XP and later |
SSE2
SSE2 was introduced with the Intel Pentium 4 and became one of the most important x86 SIMD extensions ever.
SSE2 expanded XMM usage to include:
- packed double-precision floating-point;
- packed integer operations;
- 128-bit integer SIMD;
- better replacement paths for old MMX code.
One XMM register can hold:
| Data type | Elements per XMM register |
|---|---|
| 8-bit integers | 16 |
| 16-bit integers | 8 |
| 32-bit integers | 4 |
| 64-bit integers | 2 |
| 32-bit floats | 4 |
| 64-bit doubles | 2 |
SSE2 is especially important because it is part of the baseline for x86-64. If you are writing 64-bit x86 code, SSE2 is normally assumed to be available.
Typical processors with SSE2:
| Vendor | Processor generation |
|---|---|
| Intel | Pentium 4 and later |
| AMD | Athlon 64, Opteron, Sempron 64 and later |
For new code, SSE2 is a much better minimum target than MMX.
SSE3
SSE3 was introduced with later Pentium 4 Prescott processors.
It added a small set of instructions, including:
- horizontal add/subtract operations;
- improved complex-number support;
- some thread synchronization and memory-related instructions.
SSE3 was not as large a jump as SSE2, but it filled practical gaps.
Typical processors with SSE3:
| Vendor | Processor generation |
|---|---|
| Intel | Pentium 4 Prescott, Pentium D, Core, Core 2 |
| AMD | Later Athlon 64, Opteron, Phenom and newer |
SSSE3
SSSE3 stands for:
Supplemental Streaming SIMD Extensions 3
Despite the similar name, SSSE3 is not the same as SSE3.
SSSE3 added very useful integer SIMD instructions, especially for media processing. The most famous is probably PSHUFB, a byte shuffle instruction that became extremely valuable for:
- codecs;
- text processing;
- lookup-table tricks;
- cryptography;
- pixel manipulation;
- byte rearrangement.
Typical Intel processors with SSSE3:
| Vendor | Processor generation |
|---|---|
| Intel | Core 2 and later |
| AMD | Later Bulldozer-family and Zen-family processors |
AMD did not support SSSE3 in the earliest Athlon 64 and Phenom generations.
SSE4.1
SSE4.1 was introduced with Intel Penryn, the 45 nm Core 2 generation.
It added many practical instructions for integer and floating-point SIMD, including:
- packed blends;
- dot product;
- rounding;
- packed min/max improvements;
- integer widening and packing operations;
- better support for media and graphics workloads.
Typical processors with SSE4.1:
| Vendor | Processor generation |
|---|---|
| Intel | Penryn Core 2, Nehalem and later |
| AMD | Bulldozer and later, Zen and later |
SSE4.1 is very useful for image processing and video code because it reduces the number of instructions needed for common packed operations.
SSE4.2
SSE4.2 was introduced with Intel Nehalem.
It added:
- string and text comparison instructions;
- CRC32;
- additional comparison support.
SSE4.2 is especially associated with faster string processing and checksum code.
Typical processors with SSE4.2:
| Vendor | Processor generation |
|---|---|
| Intel | Nehalem, Westmere and later |
| AMD | Bulldozer and later, Zen and later |
SSE4a
SSE4a is an AMD-specific extension.
Despite the name, it is not the same as Intel SSE4.1 or SSE4.2.
SSE4a appeared in AMD K10-family processors such as Phenom and Opteron Barcelona. It added a small number of instructions, including unaligned streaming load/store support and extract/insert operations.
Typical AMD processors with SSE4a:
| Vendor | Processor generation |
|---|---|
| AMD | K10 / Barcelona / Phenom / Phenom II / some Opteron generations |
| Intel | Not supported |
Because SSE4a is AMD-specific and relatively small, portable software rarely targets it as a primary SIMD path.
AES-NI and PCLMULQDQ
AES-NI and PCLMULQDQ are not usually described as “SSE versions,” but they are closely related to the SIMD evolution of x86.
AES-NI accelerates AES encryption and decryption.
PCLMULQDQ performs carry-less multiplication, useful for cryptography, CRC algorithms, and Galois field arithmetic.
Typical support:
| Vendor | Processor generation |
|---|---|
| Intel | Westmere and later, broadly |
| AMD | Bulldozer and later, broadly in modern Ryzen and EPYC |
These instructions are very important for cryptographic performance.
AVX
AVX stands for:
Advanced Vector Extensions
AVX introduced 256-bit YMM registers.
Conceptually, each YMM register extends an XMM register:
xmm0 = lower 128 bits of ymm0
ymm0 = 256-bit register
AVX also introduced the VEX instruction encoding, which improved instruction encoding and allowed three-operand non-destructive forms.
For example, older SSE code often overwrote one input:
a = a + b
AVX can encode:
c = a + b
This is a major improvement for register allocation and instruction scheduling.
AVX mainly extended floating-point SIMD to 256 bits:
| Data type | Elements per YMM register |
|---|---|
| 32-bit floats | 8 |
| 64-bit doubles | 4 |
Typical processors with AVX:
| Vendor | Processor generation |
|---|---|
| Intel | Sandy Bridge and later |
| AMD | Bulldozer and later, Zen and later |
Important note: AVX requires operating system support for saving and restoring YMM state. Runtime detection must check both CPU support and OS support.
F16C
F16C added conversion instructions between 16-bit half-precision floating-point and 32-bit single-precision floating-point.
It is not a full half-precision arithmetic extension. It is mainly a conversion extension.
Typical processors with F16C:
| Vendor | Processor generation |
|---|---|
| Intel | Ivy Bridge and later |
| AMD | Piledriver and later, Zen and later |
F16C became important for graphics, machine learning data conversion, and storage of compact floating-point data.
FMA3
FMA means:
Fused Multiply-Add
A fused multiply-add computes:
a * b + c
as one fused operation.
FMA improves both performance and numerical behavior for many floating-point workloads.
FMA3 is the dominant x86 FMA form today.
Typical processors with FMA3:
| Vendor | Processor generation |
|---|---|
| Intel | Haswell and later |
| AMD | Piledriver and later, Zen and later |
FMA3 is very important for:
- matrix multiplication;
- DSP;
- physics;
- machine learning;
- linear algebra;
- scientific computing.
FMA4
FMA4 was an AMD extension introduced around the Bulldozer era.
It used a four-operand form, which was elegant from a programmer’s point of view, but it did not become the long-term x86 standard.
Typical processors with FMA4:
| Vendor | Processor generation |
|---|---|
| AMD | Bulldozer-family processors |
| Intel | Not supported |
FMA4 is now considered a historical AMD-specific path. New software should use FMA3.
XOP
XOP was another AMD-specific SIMD extension from the Bulldozer era.
It added integer vector operations, including permutes, shifts, comparisons, and multiply-accumulate style operations.
Typical processors with XOP:
| Vendor | Processor generation |
|---|---|
| AMD | Bulldozer-family processors |
| Intel | Not supported |
XOP did not become a portable x86 SIMD target. Modern AMD Zen processors do not continue the XOP direction as a primary programming model.
AVX2
AVX2 was introduced by Intel Haswell.
It extended most 128-bit integer SIMD operations from SSE2/SSSE3/SSE4 to 256-bit YMM registers.
This was a major milestone because AVX1 was mainly about floating-point SIMD, while AVX2 made 256-bit integer SIMD broadly useful.
AVX2 added or expanded support for:
- 256-bit integer arithmetic;
- 256-bit logical operations;
- packed shifts;
- packed compares;
- gathers;
- wider byte/word/dword processing;
- stronger general-purpose SIMD support.
Typical processors with AVX2:
| Vendor | Processor generation |
|---|---|
| Intel | Haswell, Broadwell, Skylake and later |
| AMD | Excavator partially, Zen and later |
AVX2 is currently one of the best practical SIMD targets for portable high-performance x86 code because it is widely available across modern Intel and AMD processors.
AVX-VNNI
AVX-VNNI brings Vector Neural Network Instructions to AVX-width encodings, without requiring full AVX-512.
VNNI-style instructions are useful for integer dot products, especially in neural-network inference.
There are two related ideas:
| Extension | Meaning |
|---|---|
| AVX-512 VNNI | VNNI instructions in the AVX-512 family |
| AVX-VNNI | VNNI-style instructions available without requiring the full AVX-512 programming model |
AVX-VNNI matters because Intel client CPUs after the removal of client AVX-512 still needed efficient AI/inference instructions.
Typical support appears in newer Intel client and server generations, but exact support should be checked with CPUID.
AVX-IFMA
AVX-IFMA provides integer fused multiply-add style operations outside the older AVX-512-only naming path.
It is useful for big-number arithmetic, cryptography, and workloads that benefit from packed integer multiply-add operations.
As with other newer AVX-family subsets, support is generation- and SKU-dependent. Check CPUID.
AVX-VNNI-INT8 and AVX-VNNI-INT16
These are newer AVX-family extensions focused on integer dot-product operations for low-precision AI and inference workloads.
They target common neural-network data types:
| Extension | Main focus |
|---|---|
| AVX-VNNI-INT8 | 8-bit integer neural-network operations |
| AVX-VNNI-INT16 | 16-bit integer neural-network operations |
These are part of the modern trend toward CPU-side AI acceleration without requiring every workload to move to a GPU or NPU.
AVX-NE-CONVERT
AVX-NE-CONVERT is another newer extension aimed at efficient conversion involving low-precision numerical formats.
This belongs to the same broad family of AI-oriented CPU instructions as VNNI, BF16, FP16, and future AVX10/ACE work.
For portable software, this is not a baseline assumption. It is a specialized path selected by CPUID.
AVX-512
AVX-512 is not a single instruction set in the same simple sense as SSE2 or AVX2.
It is a family of extensions based around 512-bit ZMM registers and mask registers.
A 512-bit ZMM register can hold:
| Data type | Elements per ZMM register |
|---|---|
| 8-bit integers | 64 |
| 16-bit integers | 32 |
| 32-bit integers | 16 |
| 64-bit integers | 8 |
| 32-bit floats | 16 |
| 64-bit doubles | 8 |
AVX-512 also introduced mask registers:
k0 k1 k2 k3 k4 k5 k6 k7
These mask registers allow predicated operations, meaning each lane can be conditionally written without needing separate blend instructions.
AVX-512 is important for:
- high-performance computing;
- scientific simulations;
- AI inference;
- data compression;
- database processing;
- media processing;
- cryptography;
- genomics;
- large-scale analytics.
AVX-512 Variants
AVX-512 is made of many feature subsets. A CPU can support some subsets and not others.
This is one of the reasons AVX-512 software often needs careful runtime dispatch.
Core AVX-512 subsets
| Extension | Description |
|---|---|
| AVX-512F | Foundation; required base for AVX-512 implementations |
| AVX-512CD | Conflict detection |
| AVX-512ER | Exponential and reciprocal instructions, mainly Xeon Phi |
| AVX-512PF | Prefetch instructions, mainly Xeon Phi |
| AVX-512DQ | Doubleword and quadword instructions |
| AVX-512BW | Byte and word instructions |
| AVX-512VL | Allows many AVX-512 instructions to operate on 128-bit and 256-bit vectors |
Integer, byte, and bit manipulation subsets
| Extension | Description |
|---|---|
| AVX-512IFMA | Integer fused multiply-add |
| AVX-512VBMI | Vector byte manipulation |
| AVX-512VBMI2 | Additional byte/word manipulation |
| AVX-512BITALG | Bit algorithms |
| AVX-512VPOPCNTDQ | Vector population count for doubleword/quadword elements |
AI and numerical-format subsets
| Extension | Description |
|---|---|
| AVX-512VNNI | Vector neural-network instructions |
| AVX-512BF16 | bfloat16 dot-product and conversion support |
| AVX-512FP16 | Half-precision floating-point arithmetic |
Crypto and Galois-field related vector extensions
| Extension | Description |
|---|---|
| VAES | Vector AES |
| VPCLMULQDQ | Vector carry-less multiply |
| GFNI | Galois Field New Instructions |
These may be available with AVX, AVX2, or AVX-512 encodings depending on the processor.
Specialized or limited AVX-512 subsets
| Extension | Description |
|---|---|
| AVX-5124VNNIW | Xeon Phi-oriented neural-network instructions |
| AVX-5124FMAPS | Xeon Phi-oriented fused multiply-accumulate instructions |
| AVX-512VP2INTERSECT | Vector pair intersection |
Some of these appeared only in narrow product families or were not widely adopted.
AVX-512 on Intel CPUs
Intel introduced AVX-512 first in Xeon Phi and then in Xeon server and high-end desktop processors.
A simplified Intel AVX-512 map:
| Intel generation | AVX-512 status |
|---|---|
| Xeon Phi Knights Landing | Early AVX-512: F, CD, ER, PF |
| Xeon Phi Knights Mill | Added specialized AI/HPC subsets such as 4VNNIW and 4FMAPS |
| Skylake-X / Skylake-SP | AVX-512F, CD, BW, DQ, VL |
| Cascade Lake Xeon | Added AVX-512VNNI |
| Cooper Lake Xeon | Added AVX-512BF16 |
| Ice Lake client/server | Broader AVX-512 support including VNNI and byte/bit manipulation subsets |
| Tiger Lake | Client AVX-512 on many SKUs |
| Rocket Lake | Desktop AVX-512 on supported SKUs |
| Alder Lake and later mainstream hybrid client CPUs | AVX-512 not officially supported |
| Sapphire Rapids Xeon | AVX-512 plus AMX, BF16, FP16 support |
| Emerald Rapids Xeon | Similar server-class AVX-512/AMX direction |
| Granite Rapids / Xeon 6 P-core | Server/workstation AVX-512 and transition toward AVX10 |
| Xeon 6 E-core-only lines | AVX-512 support is not the same as P-core Xeon; check SKU documentation |
The most important practical point is that Intel client CPUs and Intel server CPUs diverged.
For several years, high-end Intel servers had strong AVX-512 support while many mainstream client CPUs did not.
AVX-512 on AMD CPUs
AMD did not support AVX-512 in Zen, Zen+, Zen 2, or Zen 3.
AMD added AVX-512 support with Zen 4.
A simplified AMD AVX-512 map:
| AMD generation | AVX-512 status |
|---|---|
| Zen / Zen+ | No AVX-512 |
| Zen 2 | No AVX-512 |
| Zen 3 | No AVX-512 |
| Zen 4 | AVX-512 support added |
| Zen 5 / EPYC Turin | AVX-512 support continues and becomes stronger |
AMD Zen 4 processors support a practical subset of AVX-512 that includes important features such as AVX-512F, DQ, IFMA, CD, BW, VL, VBMI, VNNI, BITALG, VPOPCNTDQ, BF16, and related vector crypto/Galois-field extensions depending on the exact model.
Zen 5, including 5th Gen AMD EPYC Turin, continues the AVX-512 direction and is especially relevant for HPC, AI, analytics, and cloud workloads.
The important lesson is that AVX-512 is no longer Intel-only. Modern portable high-performance code should consider AVX-512 dispatch paths for both recent Intel server CPUs and recent AMD Zen 4 / Zen 5 CPUs.
AVX10
AVX10 is Intel’s attempt to simplify the future of the AVX family.
The problem with AVX-512 is fragmentation. There are many feature bits, and software has to check which subset is available.
AVX10 moves toward a versioned model.
Instead of thinking only in terms of dozens of separate AVX-512 feature flags, AVX10 introduces an AVX10 version number.
The idea is:
AVX10.1
AVX10.2
future AVX10 versions
A later version is expected to include the earlier version’s capabilities.
AVX10 is also designed to make the AVX-512-style programming model available across future Intel P-core and E-core processors.
Important AVX10 points:
- AVX10 is based on the AVX-512 programming model.
- It uses versioned enumeration.
- It is intended to reduce feature-detection complexity.
- Future new Intel vector instructions are expected to be enumerated under AVX10 rather than by adding more AVX-512 feature flags.
- AVX10.1 is a transition version.
- AVX10.2 adds new instructions, including AI data type and conversion support.
For developers, AVX10 is important because it points to the future direction of Intel vector programming.
However, for software that must run on existing machines, AVX2 and AVX-512 runtime dispatch remain essential.
AMX
AMX stands for:
Advanced Matrix Extensions
AMX is not traditional SIMD. It introduces tile registers and tile operations designed for matrix multiplication and AI workloads.
Important AMX subsets include:
| Extension | Description |
|---|---|
| AMX-TILE | Tile register architecture |
| AMX-INT8 | 8-bit integer matrix operations |
| AMX-BF16 | bfloat16 matrix operations |
| AMX-FP16 | FP16 matrix operations on newer/future processors |
| AMX-COMPLEX | Complex-number tile operations on newer/future processors |
AMX is especially relevant for:
- neural-network inference;
- matrix multiplication;
- deep learning kernels;
- server-side AI;
- dense numerical compute.
Intel introduced AMX in Sapphire Rapids Xeon processors.
AMX requires operating system support because it adds new architectural state. Like AVX and AVX-512, detecting the CPU feature alone is not enough; the OS must support saving and restoring the state.
ACE: AI Compute Extensions
ACE, or AI Compute Extensions, is a newer x86 ecosystem direction jointly associated with Intel and AMD.
ACE is focused on AI-oriented matrix acceleration and reduced-precision numerical formats. It is intended to provide a more consistent cross-vendor target for future x86 AI workloads.
ACE is not a replacement for SSE, AVX, or AVX-512. It belongs to the same broader evolution: the CPU is gaining more native support for dense data-parallel and matrix-heavy workloads.
For the purpose of a SIMD map, ACE should be considered a future-facing matrix/AI extension rather than a classic lane-based SIMD family.
Intel Processor Generations and SIMD Support
The following table summarizes major Intel generations and their most important SIMD support.
This table is intentionally practical rather than exhaustive. Some product lines, steppings, low-end SKUs, embedded parts, and disabled features differ.
| Intel processor generation | Approx. era | Important SIMD support |
|---|---|---|
| Pentium | 1993 | No MMX in original Pentium |
| Pentium MMX | 1997 | MMX |
| Pentium II | 1997 | MMX |
| Original Celeron | 1998 | MMX |
| Pentium III | 1999 | MMX, SSE |
| Celeron II | 2000 | MMX, SSE |
| Pentium 4 Willamette / Northwood | 2000-2002 | MMX, SSE, SSE2 |
| Pentium 4 Prescott | 2004 | MMX, SSE, SSE2, SSE3 |
| Pentium D | 2005 | SSE2, SSE3 |
| Pentium M | 2003-2005 | SSE2; later models added SSE3 |
| Core Solo / Core Duo | 2006 | SSE2, SSE3 |
| Core 2 Merom / Conroe | 2006 | SSE2, SSE3, SSSE3 |
| Core 2 Penryn | 2007 | SSSE3, SSE4.1 |
| Nehalem | 2008 | SSE4.1, SSE4.2 |
| Westmere | 2010 | SSE4.2, AES-NI, PCLMULQDQ |
| Sandy Bridge | 2011 | AVX |
| Ivy Bridge | 2012 | AVX, F16C |
| Haswell | 2013 | AVX2, FMA3 |
| Broadwell | 2014-2015 | AVX2, FMA3 |
| Skylake client | 2015 | AVX2, FMA3 |
| Skylake-X / Skylake-SP | 2017 | AVX-512F, CD, BW, DQ, VL |
| Kaby Lake / Coffee Lake / Comet Lake | 2016-2020 | AVX2, FMA3 |
| Cannon Lake | 2018 | AVX-512 on limited client products |
| Cascade Lake Xeon | 2019 | AVX-512VNNI |
| Cooper Lake Xeon | 2020 | AVX-512BF16 |
| Ice Lake client/server | 2019-2021 | Broad AVX-512 support on many SKUs |
| Tiger Lake | 2020 | AVX-512 on many client SKUs |
| Rocket Lake | 2021 | AVX-512 on supported desktop SKUs |
| Alder Lake | 2021 | AVX2, FMA3, AVX-VNNI on many SKUs; AVX-512 not officially supported |
| Raptor Lake | 2022-2023 | AVX2, FMA3, AVX-VNNI on many SKUs; no official AVX-512 |
| Sapphire Rapids Xeon | 2023 | AVX-512, BF16, FP16, AMX |
| Emerald Rapids Xeon | 2023 | AVX-512, AMX server-class support |
| Meteor Lake / Core Ultra Series 1 | 2023-2024 | AVX2/FMA-class client SIMD; no mainstream AVX-512 |
| Sierra Forest Xeon 6 E-core | 2024 | E-core server line; AVX-512 support differs from P-core Xeon |
| Granite Rapids Xeon 6 P-core | 2024-2025 | AVX-512, AMX, AVX10 transition generation |
| Arrow Lake / Core Ultra 200 | 2024-2025 | AVX2/FMA-class client SIMD; no mainstream AVX-512 |
| Lunar Lake / Core Ultra 200V | 2024-2025 | AVX2/FMA-class client SIMD; AI acceleration also via NPU |
| Xeon 600 workstation / Xeon 6 workstation | 2026 | Server/workstation-class AVX-512 and AMX on P-core products |
| Panther Lake / Core Ultra Series 3 | 2026 generation | Check final SKU documentation; client hybrid direction, not a simple AVX-512 baseline |
AMD Processor Generations and SIMD Support
The following table summarizes major AMD generations and their most important SIMD support.
Again, check CPUID for exact systems.
| AMD processor generation | Approx. era | Important SIMD support |
|---|---|---|
| K5 | 1996 | No MMX baseline |
| K6 | 1997 | MMX |
| K6-2 | 1998 | MMX, 3DNow! |
| K6-III | 1999 | MMX, 3DNow! |
| Athlon | 1999 | MMX, Enhanced 3DNow!, Extended MMX |
| Duron | 2000 | MMX, Enhanced 3DNow!, Extended MMX |
| Athlon XP | 2001 | MMX, Enhanced 3DNow!, SSE |
| Athlon 64 / Opteron | 2003 | SSE, SSE2; later revisions added SSE3 |
| Sempron 64 | 2004 | SSE, SSE2; later SSE3 depending model |
| K10 / Barcelona / Phenom | 2007 | SSE3, SSE4a, 3DNow! legacy support |
| Phenom II / Athlon II | 2008-2009 | SSE3, SSE4a |
| Bobcat | 2011 | SSE2/SSE3/SSSE3-class low-power support depending model |
| Bulldozer | 2011 | AVX, SSE4.1, SSE4.2, XOP, FMA4 |
| Piledriver | 2012 | AVX, FMA3, FMA4, F16C, XOP |
| Steamroller | 2014 | AVX/FMA-class Bulldozer-family SIMD |
| Excavator | 2015 | AVX2 support in some products |
| Jaguar / Puma | 2013-2014 | SSE4.x/AVX-class low-power SIMD, depending model |
| Zen | 2017 | SSE4.2, AVX, AVX2, FMA3, AES, F16C |
| Zen+ | 2018 | Similar to Zen |
| Zen 2 | 2019 | AVX2, FMA3; no AVX-512 |
| Zen 3 | 2020 | AVX2, FMA3; no AVX-512 |
| Zen 4 | 2022 | AVX-512 support added |
| Zen 4c | 2023 | AVX-512 support in dense-core server/client variants where exposed |
| Zen 5 | 2024 | AVX-512 support continues; stronger vector capability |
| 5th Gen EPYC Turin | 2024-2026 | AVX-512, including BF16/FP16-related support depending SKU/platform |
| Future Zen generations | 2026+ | Check AMD documentation and CPUID; ACE/AVX10 ecosystem direction may matter in future |
Practical SIMD Baselines for Software
The best SIMD target depends on what kind of software you are writing.
If you need maximum compatibility
Use scalar code plus SSE2.
SSE2 is a safe baseline for x86-64 and works on a very wide range of Intel and AMD processors.
Good for:
- general-purpose libraries;
- small utilities;
- long-tail compatibility;
- software that must run on old machines.
If you target reasonably modern desktops and laptops
Use AVX2 plus a fallback.
AVX2 is a strong practical baseline for many modern machines from the last decade.
Good for:
- image processing;
- compression;
- video processing;
- game engines;
- data scanning;
- numerical kernels;
- DSP;
- high-performance C/C++ libraries.
If you target recent servers or high-performance workstations
Use AVX-512 with runtime dispatch.
AVX-512 can provide significant gains, but support varies across generations and vendors.
Good for:
- HPC;
- machine learning;
- analytics;
- genomics;
- cryptography;
- compression;
- vectorized database operations;
- scientific computing.
If you target AI matrix workloads on recent Intel Xeon
Consider AMX.
AMX is not a general replacement for AVX-512. It is especially useful for matrix multiplication and deep learning kernels.
Good for:
- neural-network inference;
- BF16 matrix multiplication;
- INT8 inference;
- server-side AI.
If you target future Intel vector code
Track AVX10.
AVX10 is designed to simplify the future AVX programming model, but existing deployed systems still require AVX2 and AVX-512 dispatch paths.
Why Width Is Not Everything
It is tempting to rank SIMD instruction sets only by vector width:
MMX = 64-bit
SSE = 128-bit
AVX2 = 256-bit
AVX-512 = 512-bit
But real performance depends on much more than width.
Important factors include:
- instruction latency;
- instruction throughput;
- number of execution ports;
- load/store bandwidth;
- cache behavior;
- memory alignment;
- downclocking behavior;
- register pressure;
- compiler quality;
- data layout;
- branch behavior;
- whether the workload is compute-bound or memory-bound.
A 512-bit instruction is not automatically twice as fast as a 256-bit instruction. If the workload is memory-bound, wider vectors may not help much. If the CPU executes a 512-bit operation internally as multiple narrower operations, peak throughput may be different from the architectural width.
Always benchmark on the target CPU.
Data Layout Matters
SIMD code works best when data is arranged in a vector-friendly layout.
For example, consider pixels stored as:
RGB RGB RGB RGB
This is convenient for scalar code, but it can be awkward for SIMD code if you want to process all red values together, then all green values, then all blue values.
A SIMD-friendly layout may look like this:
RRRR GGGG BBBB
or use separate arrays:
R[] G[] B[]
This is the classic difference between:
Array of Structures
and:
Structure of Arrays
Instruction set support matters, but data layout often matters just as much.
Runtime Dispatch Strategy
A good modern x86 library often contains multiple implementations of the same hot loop.
For example:
scalar
SSE2
SSSE3 or SSE4.1
AVX2
AVX-512
AMX or specialized AI path
At startup or first use, the library checks CPU features and selects the best implementation.
A simplified dispatch order might be:
if AMX and workload is matrix-heavy:
use AMX path
else if AVX-512 suitable subset is available:
use AVX-512 path
else if AVX2 and FMA are available:
use AVX2/FMA path
else if SSE4.1 is available:
use SSE4.1 path
else if SSSE3 is available:
use SSSE3 path
else if SSE2 is available:
use SSE2 path
else:
use scalar path
For AVX and later, remember that CPU support alone is not enough. The operating system must support the extended register state.
Recommended Feature Checks
For old SSE code, checking the CPUID feature bit is usually enough.
For AVX and later, check both the CPU and the OS.
A practical checklist:
| Feature family | What to check |
|---|---|
| MMX | CPUID MMX |
| SSE | CPUID SSE |
| SSE2 | CPUID SSE2 |
| SSE3/SSSE3/SSE4.x | CPUID feature bits |
| AVX | CPUID AVX, OSXSAVE, XGETBV for XMM/YMM state |
| AVX2 | AVX checks plus CPUID AVX2 |
| FMA/F16C | AVX checks plus CPUID FMA/F16C |
| AVX-512 | AVX checks plus CPUID AVX-512 bits plus XGETBV for opmask/ZMM state |
| AMX | CPUID AMX bits plus OS support for tile state |
| AVX10 | CPUID AVX10 bit and AVX10 version enumeration |
Do not assume that a CPU family name is enough.
Complete SIMD Family Map
The following table summarizes the major x86 SIMD and SIMD-adjacent instruction-set families.
| Instruction set | Vendor origin | Register width/model | Main purpose | Modern relevance |
|---|---|---|---|---|
| MMX | Intel | 64-bit MMX | Packed integer multimedia | Legacy |
| Extended MMX / MMX+ | AMD / SSE-era | 64-bit MMX | Extra integer/media operations | Legacy |
| 3DNow! | AMD | 64-bit MMX | Packed floating-point | Obsolete |
| Enhanced 3DNow! | AMD | 64-bit MMX | More media/DSP operations | Obsolete |
| SSE | Intel | 128-bit XMM | Packed single-precision FP | Historical baseline |
| SSE2 | Intel | 128-bit XMM | Integer and double FP SIMD | x86-64 baseline |
| SSE3 | Intel | 128-bit XMM | Horizontal ops, complex arithmetic helpers | Common |
| SSSE3 | Intel | 128-bit XMM | Byte shuffle, integer media ops | Very useful |
| SSE4a | AMD | 128-bit XMM | Small AMD-specific extension | Niche |
| SSE4.1 | Intel | 128-bit XMM | Blends, dot product, rounding, media ops | Useful |
| SSE4.2 | Intel | 128-bit XMM | Text/string compare, CRC32 | Common |
| AES-NI | Intel | XMM-based | AES crypto acceleration | Very important |
| PCLMULQDQ | Intel | XMM-based | Carry-less multiply | Very important |
| AVX | Intel | 256-bit YMM | Wider FP SIMD, VEX encoding | Common |
| F16C | Intel | XMM/YMM | FP16/FP32 conversion | Common |
| FMA3 | Intel/AMD | XMM/YMM | Fused multiply-add | Common |
| FMA4 | AMD | XMM/YMM | Four-operand FMA | Historical |
| XOP | AMD | XMM/YMM | AMD-specific vector ops | Historical |
| AVX2 | Intel | 256-bit YMM | 256-bit integer SIMD | Modern baseline |
| AVX-VNNI | Intel | XMM/YMM | Neural-network dot products | Newer client/server |
| AVX-IFMA | Intel | XMM/YMM | Integer multiply-add | Specialized |
| AVX-VNNI-INT8 | Intel | XMM/YMM | INT8 AI operations | New/future-facing |
| AVX-VNNI-INT16 | Intel | XMM/YMM | INT16 AI operations | New/future-facing |
| AVX-NE-CONVERT | Intel | XMM/YMM | Low-precision conversion | New/future-facing |
| AVX-512F | Intel | 512-bit ZMM | AVX-512 foundation | Server/HPC/AI |
| AVX-512CD | Intel | 512-bit ZMM | Conflict detection | Server/HPC |
| AVX-512ER | Intel | 512-bit ZMM | Exp/reciprocal, Xeon Phi | Limited |
| AVX-512PF | Intel | 512-bit ZMM | Prefetch, Xeon Phi | Limited |
| AVX-512DQ | Intel | 512-bit ZMM | Dword/qword operations | Common AVX-512 subset |
| AVX-512BW | Intel | 512-bit ZMM | Byte/word operations | Common AVX-512 subset |
| AVX-512VL | Intel | 128/256-bit forms of AVX-512 ops | Makes AVX-512 more flexible | Important |
| AVX-512IFMA | Intel | 512-bit ZMM | Integer fused multiply-add | Crypto/bignum |
| AVX-512VBMI | Intel | 512-bit ZMM | Byte manipulation | Text/media |
| AVX-512VBMI2 | Intel | 512-bit ZMM | More byte/word manipulation | Text/media |
| AVX-512VNNI | Intel | 512-bit ZMM | Neural-network inference | AI |
| AVX-512BITALG | Intel | 512-bit ZMM | Bit algorithms | Specialized |
| AVX-512VPOPCNTDQ | Intel | 512-bit ZMM | Vector popcount | Data/search/analytics |
| AVX-512BF16 | Intel/AMD | 512-bit ZMM | bfloat16 operations | AI |
| AVX-512FP16 | Intel/AMD | 512-bit ZMM | FP16 arithmetic | AI/HPC/media |
| AVX-512VP2INTERSECT | Intel | 512-bit ZMM | Vector pair intersection | Limited |
| AVX-5124VNNIW | Intel | 512-bit ZMM | Xeon Phi AI instructions | Historical/limited |
| AVX-5124FMAPS | Intel | 512-bit ZMM | Xeon Phi FMA instructions | Historical/limited |
| VAES | Intel/AMD | XMM/YMM/ZMM depending CPU | Vector AES | Crypto |
| VPCLMULQDQ | Intel/AMD | XMM/YMM/ZMM depending CPU | Vector carry-less multiply | Crypto |
| GFNI | Intel/AMD | XMM/YMM/ZMM depending CPU | Galois-field operations | Crypto/coding |
| AVX10.1 | Intel | AVX-512-style versioned ISA | Transition from AVX-512 | Future/current transition |
| AVX10.2 | Intel | AVX10 versioned ISA | New AI/data movement/conversion ops | Future-facing |
| AMX-TILE | Intel | Tile state | Matrix/tile base | AI/server |
| AMX-INT8 | Intel | Tile state | INT8 matrix operations | AI/server |
| AMX-BF16 | Intel | Tile state | BF16 matrix operations | AI/server |
| AMX-FP16 | Intel | Tile state | FP16 matrix operations | Newer/future server |
| ACE | Intel/AMD ecosystem | Matrix/AVX10-related model | AI matrix acceleration | Future-facing |
Practical Recommendations
For old code
If the code uses MMX or 3DNow!, consider rewriting it using SSE2 or AVX2.
MMX and 3DNow! were important historically, but they are poor targets for modern code.
For portable 64-bit x86 code
Use SSE2 as the minimum SIMD baseline.
SSE2 is available on x86-64 and avoids the old MMX/x87 state-sharing problem.
For modern desktop software
Use AVX2 and FMA when available.
AVX2 is widely supported across Intel Haswell-and-newer and AMD Zen-and-newer processors.
For high-performance server software
Add AVX-512 dispatch paths.
Recent Intel Xeon and AMD EPYC processors can benefit significantly from AVX-512, especially for compute-heavy workloads.
For AI and matrix-heavy workloads
Consider AMX on supported Intel Xeon processors, and track future ACE developments.
For AMD Zen 4 and Zen 5, AVX-512 is the important CPU-side vector path today.
For future Intel vector code
Track AVX10.
AVX10 is intended to reduce fragmentation and provide a more consistent future AVX programming model.
Summary
The x86 SIMD map has grown from a simple MMX/SSE/3DNow! table into a complex family tree.
The big historical steps are:
- MMX introduced packed integer SIMD on x86.
- 3DNow! gave AMD an early packed floating-point SIMD path.
- SSE introduced 128-bit XMM registers.
- SSE2 made 128-bit integer and double-precision SIMD central to x86-64.
- SSSE3 and SSE4.x added many practical media, text, and integer operations.
- AVX introduced 256-bit registers and better instruction encoding.
- AVX2 made 256-bit integer SIMD broadly useful.
- FMA and F16C improved numerical and conversion-heavy workloads.
- AVX-512 introduced 512-bit vectors, mask registers, and many specialized subsets.
- AMX added matrix/tile acceleration for AI workloads.
- AVX10 points toward a more unified future Intel vector ISA.
- ACE points toward future cross-vendor AI matrix acceleration.
For developers, the most important practical lesson is simple:
Do not choose a SIMD path from the processor name alone. Detect the actual instruction sets at runtime.
For broad compatibility, start with scalar and SSE2.
For modern performance, add AVX2.
For recent servers and high-end compute, add AVX-512.
For AI matrix workloads, consider AMX where available.
And for future Intel platforms, keep an eye on AVX10.
References
- Original article: Map of Instruction sets / CPU
- Intel Intrinsics Guide
- Intel 64 and IA-32 Architectures Software Developer’s Manual
- Intel Architecture Instruction Set Extensions and Future Features Programming Reference
- Intel AVX-512 Instructions
- Intel AVX10 Technical Paper: The Converged Vector ISA
- Intel Xeon 6 Product Brief
- AMD64 Architecture Programmer’s Manual
- AMD Software Optimization Guide for the Zen 4 Microarchitecture
- AMD Software Optimization Guide for the Zen 5 Microarchitecture
- AMD: Understanding AVX-512 and Validating Usage on AMD EPYC
- x86 AI Compute Extensions ACE Whitepaper
- Agner Fog: Instruction Tables and Optimization Manuals
- GCC x86 Options
- LLVM X86 Backend Documentation



