SIMD on x64/x86

SIMD Instruction Latency Map

Instruction latency is one of the most important details to understand when optimizing SIMD code.

A SIMD instruction may look simple at the source-code level, but the number of cycles required before its result can be used depends heavily on the exact instruction, operand type, vector width, instruction encoding, and CPU microarchitecture.

This matters because modern x86 SIMD now spans many generations of instruction sets:

  • MMX;
  • 3DNow!;
  • SSE;
  • SSE2;
  • SSE3;
  • SSSE3;
  • SSE4.1;
  • SSE4.2;
  • SSE4a;
  • AES-NI;
  • PCLMULQDQ;
  • AVX;
  • F16C;
  • FMA;
  • FMA4;
  • XOP;
  • AVX2;
  • VAES;
  • VPCLMULQDQ;
  • GFNI;
  • AVX-VNNI;
  • AVX-IFMA;
  • AVX-VNNI-INT8;
  • AVX-VNNI-INT16;
  • AVX-512;
  • AVX-512 VNNI;
  • AVX-512 BF16;
  • AVX-512 FP16;
  • AVX10;
  • AMX.

Each generation added wider registers, new execution units, new instruction encodings, and new performance trade-offs.

A single flat table with every SIMD instruction, every operand form, every vector width, and every recent processor would be too large for a blog post. It would also become outdated quickly.

Instead, this article provides a practical SIMD latency map.

The goal is to explain:

  • what instruction latency means;
  • why latency is different from throughput;
  • how latency changes across SIMD instruction families;
  • which instruction groups are usually cheap;
  • which instruction groups are usually expensive;
  • how recent Intel and AMD processors compare;
  • what changed in recent Intel and AMD generations;
  • when exact instruction data must be checked in a detailed reference;
  • how to use latency information when optimizing real code.

What Instruction Latency Means

Instruction latency is the number of clock cycles between an instruction receiving its input and producing a result that a dependent instruction can use.

For example:

vaddps ymm0, ymm0, ymm1
vaddps ymm0, ymm0, ymm2
vaddps ymm0, ymm0, ymm3

Each instruction depends on the result of the previous one because every instruction reads and writes ymm0.

If vaddps has a latency of 4 cycles on a given CPU, then the second vaddps cannot use the result of the first until roughly 4 cycles later.

This kind of code is latency-bound.

A dependency chain looks like this:

a0 -> a1 -> a2 -> a3 -> a4

Each step must wait for the previous step.

By contrast, independent instructions can overlap:

vaddps ymm0, ymm0, ymm1
vaddps ymm2, ymm2, ymm3
vaddps ymm4, ymm4, ymm5
vaddps ymm6, ymm6, ymm7

These instructions do not depend on each other. A modern out-of-order CPU can execute several of them in parallel if enough execution resources are available.

That is why latency is important, but it is not the whole performance story.

Latency vs Throughput

Latency answers this question:

How long before the result of this instruction is ready for a dependent instruction?

Throughput answers this question:

How often can the CPU start or complete this instruction when many independent instructions are available?

For example, a SIMD floating-point add may have:

latency:    3 or 4 cycles
throughput: 0.5 cycles

This means a dependent chain of additions advances every 3 or 4 cycles, but the CPU may be able to execute two independent additions per cycle.

Both values matter.

MetricMeaningMost important for
LatencyTime from input to dependent outputDependency chains
ThroughputRate with many independent operationsUnrolled loops and vector kernels
µopsInternal operations generated by the instructionFront-end and scheduler pressure
Port usageWhich execution units are usedExecution-resource conflicts
Memory latencyDelay caused by loads, stores, cache misses, and memory hierarchyData-dependent and streaming workloads

A latency table alone cannot predict performance.

A real SIMD loop can be limited by:

  • dependency chains;
  • instruction throughput;
  • load bandwidth;
  • store bandwidth;
  • L1, L2, L3, or memory latency;
  • cache misses;
  • address-generation limits;
  • branch prediction;
  • front-end decode bandwidth;
  • µop cache bandwidth;
  • execution-port conflicts;
  • register pressure;
  • vector-width frequency effects;
  • data layout;
  • alignment;
  • compiler code generation.

Instruction latency is still useful, but only when interpreted in context.

Why Modern SIMD Latency Tables Are Complicated

Modern x86 instruction latency depends on many details.

The same instruction family can have different latency depending on:

  • scalar vs packed form;
  • integer vs floating-point operation;
  • 128-bit, 256-bit, or 512-bit vector width;
  • register operand vs memory operand;
  • legacy SSE encoding vs VEX encoding vs EVEX encoding;
  • exact source and destination dependency;
  • whether the instruction crosses vector lanes;
  • whether the result changes execution domain;
  • whether the instruction is implemented as one µop or several µops;
  • whether the instruction uses a general vector ALU, shuffle unit, multiplier, divider, crypto unit, or matrix unit;
  • whether the core is a performance core or efficiency core;
  • whether the processor lowers frequency during wide-vector execution.

For this reason, this article uses latency classes and representative numbers.

For exact work, use detailed references such as:

  • Intel Intrinsics Guide;
  • Intel Optimization Reference Manual;
  • AMD Software Optimization Guides;
  • uops.info;
  • Agner Fog’s instruction tables.

Latency Classes Used in This Article

The tables below use approximate latency classes.

They are meant to be useful for reasoning, not as a replacement for exact instruction-specific data.

ClassTypical latencyMeaning
Very low1 cycleSimple integer/vector logic or simple data movement
Low2-3 cyclesCommon simple SIMD operations, some shuffles
Medium3-5 cyclesFloating-point add/multiply/FMA, many conversions
High5-10 cyclesComplex shuffles, integer multiply variants, carry-less multiply, crypto-style operations
Very high10+ cyclesDivision, square root, gathers, some horizontal reductions, special-purpose or microcoded operations
VariableDepends heavily on data/cache/operand formGathers, scatters, cache operations, masked memory, AMX tile operations

A table entry such as 3-5 means “usually in this range on recent CPUs, but check exact data for the exact instruction and core.”

Processor Generations Covered

This article focuses on the most relevant recent Intel and AMD generations.

For Intel client CPUs, the practical recent generations are:

Intel client generationTypical core generationSIMD context
Raptor Lake / 13th and 14th Gen CoreRaptor Cove + GracemontAVX2/FMA-class client SIMD; no official AVX-512
Meteor Lake / Core Ultra Series 1Redwood Cove + CrestmontAVX2/FMA-class SIMD; AVX-VNNI on many SKUs
Arrow Lake and Lunar Lake / Core Ultra Series 2Lion Cove / Skymont familyAVX2/FMA-class SIMD; newer VNNI-style extensions depending on SKU

For Intel server and workstation CPUs, the recent generations are different:

Intel server generationSIMD context
Sapphire Rapids / 4th Gen Xeon ScalableAVX-512 and AMX
Emerald Rapids / 5th Gen Xeon ScalableAVX-512 and AMX
Granite Rapids / Xeon 6 P-coreAVX-512, AMX, AVX10 transition generation

Intel E-core server lines such as Sierra Forest have a different SIMD profile from P-core Xeon lines. Always check the exact SKU.

For AMD, the relevant recent generations are:

AMD generationSIMD context
Zen 3AVX2 and FMA; no AVX-512
Zen 4AVX-512 support added
Zen 5AVX-512 support continues and becomes stronger, especially in server and high-performance products

This is a practical simplification. Exact support still depends on the processor model, BIOS, operating system, and virtualization environment.

SIMD Instruction Set Coverage

The following table summarizes the major x86 SIMD and SIMD-adjacent instruction sets.

Instruction setMain width/modelMain roleModern status
MMX64-bitPacked integer SIMDLegacy
3DNow!64-bitAMD packed floating-point SIMDObsolete
SSE128-bitPacked single-precision floating pointHistorical baseline
SSE2128-bitInteger SIMD and double-precision floating pointx86-64 baseline
SSE3128-bitHorizontal and complex arithmetic helpersCommon
SSSE3128-bitByte shuffle and media operationsVery useful
SSE4.1128-bitBlends, dot products, min/max, mediaUseful
SSE4.2128-bitText/string compare and CRC32Common
SSE4a128-bitAMD-specific small extensionLegacy/niche
AES-NI128-bitAES crypto accelerationCommon and important
PCLMULQDQ128-bitCarry-less multiplyCrypto/CRC
AVX256-bitWider floating-point SIMDCommon
F16C128/256-bitFP16 conversionCommon
FMA3128/256-bitFused multiply-addCommon
FMA4128/256-bitAMD four-operand FMALegacy
XOP128/256-bitAMD-specific vector opsLegacy
AVX2256-bit256-bit integer SIMDModern baseline
VAES128/256/512-bitVector AESCrypto
VPCLMULQDQ128/256/512-bitVector carry-less multiplyCrypto/CRC
GFNI128/256/512-bitGalois-field operationsCrypto/coding
AVX-VNNI128/256-bitNeural-network dot productsNewer client/server
AVX-IFMA128/256-bitInteger fused multiply-addSpecialized
AVX-VNNI-INT8128/256-bitINT8 AI inferenceNewer/future-facing
AVX-VNNI-INT16128/256-bitINT16 AI inferenceNewer/future-facing
AVX-512F512-bitAVX-512 foundationServer/HPC/AI
AVX-512BW/DQ/VL512-bit and smaller formsByte/word/dword/qword supportImportant AVX-512 subsets
AVX-512VNNI512-bitINT8 inferenceAI/server
AVX-512BF16512-bitbfloat16 operationsAI/server
AVX-512FP16512-bitFP16 arithmeticAI/HPC/media
AVX10128/256/512-bit modelFuture versioned Intel vector ISAEmerging
AMXTile stateMatrix/tile accelerationAI/server

AMX is not traditional lane-based SIMD. It is a matrix/tile extension, but it belongs in this map because it accelerates data-parallel numerical workloads on recent Intel Xeon processors.

Historical Baseline: MMX and Early SSE Latency

MMX and early SSE instructions were relatively narrow by modern standards, but they already showed the basic latency patterns that still matter today.

Simple packed integer operations such as add, subtract, compare, unpack, and logical operations were usually low-latency. Packed multiply, horizontal operations, state-transition instructions, and specialized media instructions were more expensive.

A simplified historical pattern looks like this:

Instruction groupTypical latency pattern
Packed add/subtractLow
Packed logical operationsLow
Packed comparisonsLow
Packed unpackLow
Packed shiftsLow to moderate
Packed multiplyModerate to high
Sum of absolute differencesModerate
MMX state cleanup with EMMSExpensive enough to avoid inside tight loops

Two lessons from early SIMD programming are still valid today:

  1. Simple SIMD operations are usually cheap.
  2. Data movement, dependencies, shuffles, conversions, and special-purpose instructions often dominate performance.

The register widths have grown from 64-bit MMX to 128-bit SSE, 256-bit AVX, and 512-bit AVX-512, but the same optimization principle remains true:

The fastest SIMD code is usually the code that keeps data independent, avoids unnecessary shuffles, and gives the CPU many independent operations to execute.

Latency Map by Instruction Family

The following table gives a practical latency map for major SIMD instruction families.

These are representative ranges for recent x86 CPUs, not exact values for every instruction.

Instruction familyExamplesTypical latency class
Integer vector add/subtractpadd*, vpadd*, psub*, vpsub*1-2
Integer vector logicalpand, por, pxor, vpand, vpor, vpxor1
Integer vector comparepcmpeq*, pcmpgt*, vpcmp*1-3
Integer vector shiftspsll*, psrl*, psra*, vpsll*1-3
Integer multiplypmullw, pmulld, vpmull*3-5
Integer multiply high / widepmulhw, pmuludq, vpmul*3-6
Saturating arithmeticpaddusb, psubsw, vpaddus*1-3
Unpack / interleavepunpck*, vpunpck*1-3
Simple shufflepshufd, shufps, vpshufd1-3
Byte shufflepshufb, vpshufb1-4
Cross-lane permutevperm2i128, vpermd, vpermps, vshufi*3-8
Blend/selectblend*, vpblend*, masked AVX-512 operations1-3
Floating-point add/subtractaddps, vaddps, vaddpd3-4
Floating-point multiplymulps, vmulps, vmulpd3-5
Floating-point FMAvfmadd*4-5
Floating-point min/maxminps, maxps, vmin*, vmax*3-5
Floating-point comparecmpps, vcmpps, vcmppd3-5
Floating-point conversioncvt*, vcvt*3-8
Floating-point dividedivps, vdivps, vdivpd10-30+
Floating-point square rootsqrtps, vsqrtps, vsqrtpd10-30+
Reciprocal approximationrcpps, vrcp*Low to medium
Reciprocal square-root approximationrsqrtps, vrsqrt*Low to medium
Gathervgather*Variable, often expensive
Scattervscatter*Variable, often expensive
AES roundsaesenc, vaesencMedium
Carry-less multiplypclmulqdq, vpclmulqdqMedium to high
VNNI dot productvpdpbusd, vpdpwssdMedium
BF16 dot productvdpbf16psMedium
FP16 arithmeticvaddph, vmulph, vfmaddphMedium
AMX tile dot producttdpbusd, tdpbf16psVariable, high-latency but high-throughput
Cache controlclflush, clflushopt, clwbVariable
Memory fenceslfence, mfence, sfenceVariable, often pipeline-impacting

The table shows the most important pattern:

Simple element-wise SIMD operations are usually cheap. Cross-lane movement, conversions, division, square root, gathers, scatters, and matrix/tile operations require more care.

Intel Client CPUs: Last Three Generations

Intel client CPUs in recent generations generally provide strong AVX2/FMA-class SIMD, but not official mainstream AVX-512.

A practical summary:

Intel client generationSIMD support profilePractical latency notes
Raptor LakeSSE through AVX2, FMA, AES, PCLMUL, VNNI-style support on many SKUsExcellent AVX2 throughput; avoid assuming AVX-512
Meteor LakeSSE through AVX2, FMA, AVX-VNNI on many SKUsSimilar AVX2 programming model; P-core/E-core differences matter
Arrow Lake / Lunar LakeSSE through AVX2, FMA, newer VNNI-style extensions depending SKUGood AVX2-class SIMD; check exact CPUID for newer AI/vector subsets

Representative latency classes on recent Intel client cores:

Operation groupRaptor Lake classMeteor Lake classArrow/Lunar Lake class
Vector integer add/logical111
Vector integer multiply3-53-53-5
128/256-bit FP add3-43-43-4
128/256-bit FP multiply3-53-53-5
128/256-bit FMA4-54-54-5
Simple shuffle1-31-31-3
Cross-lane permute3-83-83-8
Conversion3-83-83-8
Divide/sqrt10-30+10-30+10-30+
GatherVariableVariableVariable
AESMediumMediumMedium
PCLMULMedium to highMedium to highMedium to high
AVX-VNNISKU-dependentUsually available on many SKUsSKU-dependent; check CPUID
AVX-512Not official mainstream supportNot mainstreamNot mainstream

The most important Intel client rule is:

Treat AVX2/FMA as the main wide SIMD target, and detect newer VNNI-style features explicitly. Do not assume AVX-512 on mainstream client CPUs.

Intel Server CPUs: Last Three Generations

Intel server CPUs have a different SIMD profile from Intel client CPUs.

Recent Xeon generations are where Intel AVX-512 and AMX are most important.

Intel server generationSIMD support profilePractical latency notes
Sapphire RapidsAVX-512, AVX-512 BF16/FP16 on supported SKUs, AMXStrong server SIMD and matrix acceleration
Emerald RapidsAVX-512 and AMX continuationSimilar programming model, improved platform generation
Granite Rapids / Xeon 6 P-coreAVX-512, AMX, AVX10 transition directionStrong wide-vector and matrix focus

Representative latency classes on recent Intel server cores:

Operation groupSapphire RapidsEmerald RapidsGranite Rapids / Xeon 6 P-core
128/256-bit integer add/logical111
512-bit integer add/logical1-21-21-2
128/256-bit FP add3-43-43-4
512-bit FP add3-53-53-5
128/256-bit FMA4-54-54-5
512-bit FMA4-64-64-6
AVX-512 mask operations1-31-31-3
AVX-512 permutes3-83-83-8
AVX-512 gathers/scattersVariableVariableVariable
AVX-512 VNNIMediumMediumMedium
AVX-512 BF16MediumMediumMedium
AVX-512 FP16MediumMediumMedium
AMX tile operationsVariable; high-latency but high-throughputVariable; high-latency but high-throughputVariable; high-latency but high-throughput
Divide/sqrt10-30+10-30+10-30+

For Intel server optimization, latency is not the only concern. Wide-vector and AMX code can be limited by:

  • data packing;
  • tile loading and storing;
  • memory bandwidth;
  • cache blocking;
  • register pressure;
  • mixed-width transitions;
  • frequency behavior;
  • NUMA effects;
  • thread scheduling.

The key rule is:

Use AVX-512 and AMX where the workload has enough arithmetic intensity to justify them. For memory-bound code, wider instructions alone may not help.

AMD CPUs: Zen 3, Zen 4, and Zen 5

AMD’s recent SIMD story is easy to summarize at a high level:

AMD generationSIMD support profile
Zen 3SSE through AVX2 and FMA; no AVX-512
Zen 4AVX-512 support added
Zen 5AVX-512 support continues and becomes stronger

Representative latency classes:

Operation groupZen 3Zen 4Zen 5
128-bit integer add/logical111
256-bit integer add/logical1-21-21-2
512-bit integer add/logicaln/a1-31-2
128/256-bit FP add3-43-43-4
128/256-bit FP multiply3-53-53-5
128/256-bit FMA4-54-54-5
512-bit FP addn/a3-53-5
512-bit FMAn/a4-64-5
Simple shuffle1-31-31-3
Cross-lane permute3-83-83-8
Conversion3-83-83-8
Divide/sqrt10-30+10-30+10-30+
GatherVariableVariableVariable
AVX-512 VNNIn/aMediumMedium
AVX-512 BF16n/aMediumMedium
AVX-512 FP16n/a or SKU-dependentSKU-dependentSupported on relevant Zen 5 products depending model
AMXn/an/an/a

Zen 4 added AVX-512 support, but developers should still check exact instruction subsets with CPUID.

Zen 5 strengthens AMD’s AVX-512 position and is especially important in server workloads such as EPYC Turin.

The practical AMD rule is:

For Zen 3, AVX2/FMA is the main target. For Zen 4 and Zen 5, AVX-512 becomes a realistic optimization path, especially for server, HPC, AI, analytics, compression, and data-processing workloads.

MMX and 3DNow! Latency

MMX and 3DNow! are legacy instruction sets, but they are still worth understanding.

MMX uses 64-bit MMX registers and packed integer operations.

3DNow! was AMD’s early packed floating-point SIMD extension using the MMX register file.

Typical latency patterns:

Operation groupTypical latency class
MMX add/subtract/logical1-2
MMX compare1-2
MMX unpack/pack1-3
MMX shift1-3
MMX multiply3-5
MMX state cleanup with EMMSHigh enough to avoid inside loops
3DNow! floating-point add/multiplyMedium
3DNow! reciprocal/rsqrt approximationsMedium
3DNow! special instructionsLookup exact data

For modern code, avoid MMX and 3DNow!.

Use SSE2 or later instead.

SSE and SSE2 Latency

SSE introduced 128-bit XMM registers and packed single-precision floating-point SIMD.

SSE2 added double-precision floating-point and 128-bit integer SIMD.

Typical latency patterns:

Operation groupExamplesTypical latency class
SSE scalar/packed FP addaddss, addps3-4
SSE scalar/packed FP multiplymulss, mulps3-5
SSE FP comparecmpps, cmpss3-5
SSE shuffleshufps1-3
SSE reciprocal approximationrcpps, rsqrtpsLow to medium
SSE divide/sqrtdivps, sqrtpsVery high
SSE2 integer add/logicalpadd*, pand, pxor1
SSE2 integer multiplypmullw, pmuludq3-5
SSE2 double add/multiplyaddpd, mulpd3-5
SSE2 conversioncvt*3-8
SSE2 shift/unpackpsll*, punpck*1-3

SSE2 remains important because it is the baseline for x86-64.

Even if a program has AVX2 or AVX-512 optimized paths, SSE2 is often the first SIMD fallback.

SSE3, SSSE3, SSE4.1, and SSE4.2 Latency

The later SSE extensions added many useful instructions, especially for horizontal operations, byte shuffling, blending, string processing, and media code.

Representative latency patterns:

Instruction setImportant instructionsTypical latency class
SSE3horizontal add/subtractMedium
SSSE3pshufb, pmaddubsw, phadd*, pabs*Low to medium
SSE4.1blends, dot product, min/max, insert/extractLow to medium
SSE4.2string compare, CRC32Medium to variable
SSE4aAMD-specific extract/insert/misaligned helpersLookup exact data

SSSE3’s pshufb is especially important. It is often one of the most useful byte-level SIMD instructions in real-world code.

However, shuffles are rarely free. They can become a bottleneck if the algorithm constantly rearranges data.

A useful rule:

If your SIMD loop does more shuffling than arithmetic, the shuffle unit may be the bottleneck.

AES-NI and PCLMULQDQ Latency

AES-NI and PCLMULQDQ are SIMD-adjacent cryptographic extensions.

They use XMM registers and later gained wider vector forms through VAES and VPCLMULQDQ.

Typical latency patterns:

Instruction familyExamplesTypical latency class
AES roundsaesenc, aesenclast, aesdecMedium
AES keygen assistaeskeygenassistMedium
Carry-less multiplypclmulqdqMedium to high
Vector AESvaesencMedium
Vector carry-less multiplyvpclmulqdqMedium to high
GFNIgf2p8*Medium

Crypto code is often throughput-sensitive rather than latency-sensitive.

For example, AES can be optimized by processing many independent blocks in parallel. That hides the latency of individual AES rounds.

AVX and FMA Latency

AVX introduced 256-bit YMM registers and the VEX instruction encoding.

AVX1 mainly widened floating-point SIMD, while AVX2 later widened integer SIMD.

FMA added fused multiply-add operations.

Typical AVX/FMA latency patterns:

Operation groupExamplesTypical latency class
128-bit VEX FP addvaddps xmm3-4
256-bit FP addvaddps ymm3-4
128-bit VEX FP multiplyvmulps xmm3-5
256-bit FP multiplyvmulps ymm3-5
FMAvfmadd*4-5
AVX shufflevshufps, vunpck*1-3
AVX cross-lane permutevperm2f1283-8
FP conversionvcvt*3-8
Divide/sqrtvdiv*, vsqrt*Very high

FMA has higher latency than a simple add, but it does more work:

a * b + c

as one fused operation.

For numerical kernels, FMA is usually excellent when the code has enough independent accumulators to hide latency.

AVX2 Latency

AVX2 extended 256-bit SIMD to integer operations.

This made AVX2 one of the most important modern SIMD targets.

Typical AVX2 latency patterns:

Operation groupExamplesTypical latency class
Integer add/subtractvpaddb, vpaddw, vpaddd1
Integer logicalvpand, vpor, vpxor1
Integer comparevpcmpeq*, vpcmpgt*1-3
Integer shiftvpsll*, vpsrl*, vpsra*1-3
Integer multiplyvpmullw, vpmulld3-5
Byte shufflevpshufb1-4
Lane-crossing permutevperm2i128, vpermd3-8
Gathervgather*Variable, often expensive
Blendvpblend*1-3

AVX2 is often the best practical target for portable high-performance x86 code because it is widely available on modern Intel and AMD processors.

However, not all AVX2 instructions are equally cheap.

The main danger areas are:

  • gathers;
  • cross-lane permutations;
  • variable shifts;
  • complex shuffles;
  • memory bandwidth;
  • frequency effects on some Intel CPUs.

AVX-512 Latency

AVX-512 introduced 512-bit ZMM registers and mask registers.

It also introduced many feature subsets, including:

  • AVX-512F;
  • AVX-512CD;
  • AVX-512BW;
  • AVX-512DQ;
  • AVX-512VL;
  • AVX-512IFMA;
  • AVX-512VBMI;
  • AVX-512VBMI2;
  • AVX-512BITALG;
  • AVX-512VPOPCNTDQ;
  • AVX-512VNNI;
  • AVX-512BF16;
  • AVX-512FP16.

Typical AVX-512 latency patterns:

Operation groupExamplesTypical latency class
512-bit integer add/logicalvpadd*, vpand*, vpxor*1-2
512-bit integer comparevpcmp*1-3
512-bit integer multiplyvpmull*3-6
512-bit FP addvaddps, vaddpd3-5
512-bit FP multiplyvmulps, vmulpd3-5
512-bit FMAvfmadd*4-6
Mask operationskortest, kand, kor, kxor1-3
Masked vector operationsEVEX masked formsSimilar to unmasked or slightly more complex
Compress/expandvcompress*, vexpand*Medium to variable
Permutevperm*, vshuf*3-8
Gather/scattervgather*, vscatter*Variable, often expensive
VNNI dot productvpdpbusd, vpdpwssdMedium
BF16 dot productvdpbf16psMedium
FP16 arithmeticvaddph, vmulph, vfmaddphMedium
Divide/sqrtvdiv*, vsqrt*Very high

AVX-512 can be very powerful, but it requires careful use.

The advantages are:

  • wider vectors;
  • mask registers;
  • better predication;
  • more powerful integer and floating-point operations;
  • better support for AI, analytics, compression, and HPC kernels.

The risks are:

  • frequency reduction on some processors;
  • higher register pressure;
  • expensive gathers/scatters;
  • complex permutes;
  • larger code size;
  • need for runtime dispatch;
  • differences between Intel and AMD implementations;
  • differences between AVX-512 subsets.

A useful rule:

Use AVX-512 when the algorithm benefits from masks, wide vectors, or specialized instructions. Do not use it blindly just because it is available.

AVX-VNNI and AVX-512 VNNI Latency

VNNI stands for Vector Neural Network Instructions.

The core idea is to combine multiply and add operations commonly used in integer neural-network inference.

Representative instructions include:

vpdpbusd
vpdpbusds
vpdpwssd
vpdpwssds

Typical latency is medium, but throughput and data reuse are usually more important than single-instruction latency.

Instruction familyMain useTypical latency class
AVX-VNNIINT8/INT16 dot products without full AVX-512Medium
AVX-512 VNNI512-bit INT8/INT16 dot productsMedium
AVX-VNNI-INT8Newer INT8 dot-product formsMedium
AVX-VNNI-INT16Newer INT16 dot-product formsMedium

For inference kernels, performance depends heavily on:

  • data layout;
  • cache blocking;
  • quantization format;
  • accumulation strategy;
  • number of independent accumulators;
  • memory bandwidth;
  • whether the workload fits in cache.

BF16 and FP16 Latency

BF16 and FP16 are important for AI and some numerical workloads.

BF16 keeps the same exponent width as FP32 but uses fewer mantissa bits. It is common in machine learning.

FP16 has a smaller exponent and mantissa and is common in graphics, AI, and storage formats.

Representative latency classes:

Instruction familyMain useTypical latency class
F16C conversionFP16 to/from FP32 conversionMedium
AVX-512 BF16BF16 dot products and conversionMedium
AVX-512 FP16FP16 arithmeticMedium
AVX10.2 low-precision extensionsFuture AI/numerical formatsCheck exact data

For AI workloads, BF16/FP16 performance is often throughput-bound rather than latency-bound.

The key is to feed the units with enough independent work.

AVX10 Latency

AVX10 is Intel’s future-facing vector ISA direction.

It is designed to converge the AVX-512 programming model across future Intel P-core and E-core processors using a versioned feature model.

AVX10 should not be treated as a single fixed-latency instruction set. It is a versioned family.

From a latency perspective, the right way to think about AVX10 is:

  • AVX10 inherits much of the AVX-512-style programming model;
  • exact latency depends on the AVX10 version;
  • exact latency depends on whether the implementation supports 128-bit, 256-bit, or 512-bit vector lengths;
  • exact latency depends on the core type;
  • future instructions must be checked in current references.

A practical early AVX10 latency table is therefore:

AVX10 categoryExpected latency style
Simple integer/vector logicalVery low to low
FP add/multiply/FMAMedium
Mask operationsLow
Permutes/shufflesLow to high depending complexity
New AI/data conversion operationsMedium to variable
Wider 512-bit operationsCheck exact CPU and frequency behavior

The rule for developers is:

Treat AVX10 as a future dispatch target, not as a replacement for checking exact CPU features and measured latency.

AMX Latency

AMX stands for Advanced Matrix Extensions.

AMX is not traditional SIMD. It uses tile registers and tile operations.

It is designed for dense matrix operations such as:

  • INT8 matrix multiplication;
  • BF16 matrix multiplication;
  • FP16 matrix multiplication on newer products;
  • AI inference and training kernels.

AMX instructions can have high latency, but that is not the main issue. AMX is designed for high throughput over large blocks of work.

Representative AMX latency map:

AMX operation groupExamplesLatency interpretation
Tile configurationldtilecfgSetup overhead; keep outside hot inner loops
Tile load/storetileloadd, tilestoredMemory/cache dependent
INT8 tile dot producttdpbusd, tdpbuudHigh-latency but high-throughput
BF16 tile dot producttdpbf16psHigh-latency but high-throughput
FP16 tile dot productnewer AMX FP16 formsCheck exact CPU
Tile releasetilereleaseState-management overhead

The optimization rule for AMX is different from the rule for simple SIMD:

Do not think about AMX as a single instruction latency problem. Think about blocking, packing, tile reuse, memory hierarchy, and throughput.

Representative Latency Map: Recent Intel vs AMD

The following table provides a compact view of recent Intel and AMD SIMD latency classes.

Operation groupIntel recent clientIntel recent serverAMD Zen 3AMD Zen 4AMD Zen 5
128-bit integer add/logical11111
256-bit integer add/logical111-21-21-2
512-bit integer add/logicaln/a1-2n/a1-31-2
128-bit FP add3-43-43-43-43-4
256-bit FP add3-43-43-43-43-4
512-bit FP addn/a3-5n/a3-53-5
128/256-bit FMA4-54-54-54-54-5
512-bit FMAn/a4-6n/a4-64-5
Integer multiply3-53-63-53-63-6
Simple shuffle1-31-31-31-31-3
Cross-lane permute3-83-83-83-83-8
Conversion3-83-83-83-83-8
Divide/sqrt10-30+10-30+10-30+10-30+10-30+
GatherVariableVariableVariableVariableVariable
Scattern/a or limitedVariablen/aVariableVariable
AES/VAESMediumMediumMediumMediumMedium
PCLMUL/VPCLMULMedium to highMedium to highMedium to highMedium to highMedium to high
VNNISKU-dependentMediumn/aMedium with AVX-512 VNNIMedium
BF16SKU-dependentMediumn/aMedium where supportedMedium
FP16SKU-dependentMediumn/aSKU-dependentMedium where supported
AMXn/aVariable/high-throughputn/an/an/a

This table intentionally avoids pretending that every instruction has one universal latency.

The correct conclusion is:

Recent Intel and AMD CPUs have broadly similar latency classes for simple SIMD arithmetic, but differ significantly in supported instruction sets, vector width, execution resources, AMX availability, AVX-512 implementation, and frequency behavior.

Why Division and Square Root Are Special

Floating-point division and square root are much slower than add, multiply, or FMA.

For example:

vaddps ymm0, ymm1, ymm2
vmulps ymm3, ymm4, ymm5
vdivps ymm6, ymm7, ymm8
vsqrtps ymm9, ymm10

The add and multiply instructions are usually medium-latency and high-throughput.

The divide and square-root instructions are much higher latency and lower throughput.

This is why optimized numerical code often tries to replace division with multiplication by a reciprocal when acceptable:

x / y

can sometimes become:

x * (1 / y)

If several values use the same divisor, computing the reciprocal once and multiplying many times can be much faster.

For approximate math, reciprocal approximation instructions may be useful, followed by one or more Newton-Raphson refinement steps when more precision is needed.

Why Shuffles Often Dominate SIMD Performance

SIMD arithmetic is usually cheap.

Data rearrangement is often expensive.

For example, image code may need to transform data from this layout:

RGB RGB RGB RGB

into this layout:

RRRR GGGG BBBB

That transformation requires unpacking, shuffling, permuting, or blending.

In many SIMD kernels, the actual arithmetic is not the bottleneck. The bottleneck is moving data into the right lanes.

Common expensive or bottleneck-prone operations include:

  • byte shuffles;
  • cross-lane permutes;
  • horizontal reductions;
  • gather/scatter;
  • compress/expand;
  • format conversions;
  • matrix packing;
  • transposes.

A useful SIMD optimization rule is:

Before making arithmetic faster, make data layout easier.

Why Gathers and Scatters Are Variable

Gather instructions load multiple elements from unrelated memory addresses into one vector.

Scatter instructions store multiple vector elements to unrelated memory addresses.

These are powerful, but their latency is highly variable because memory dominates the cost.

A gather from L1 cache may be reasonable.

A gather from L3 cache or main memory may be extremely expensive.

A gather with repeated cache misses is not really an “instruction latency” problem anymore. It is a memory-system problem.

The same applies to scatter.

Use gathers and scatters when they simplify an algorithm or when the memory pattern is unavoidable, but do not expect them to behave like ordinary aligned vector loads and stores.

Wide Vectors and Frequency Effects

On some CPUs, especially some Intel generations, heavy AVX2 or AVX-512 code can reduce core frequency.

This happens because wide vector operations consume more power and create more thermal pressure.

The performance trade-off is not always obvious.

A 512-bit instruction may process twice as much data as a 256-bit instruction, but if the CPU lowers frequency significantly, the overall speedup may be smaller than expected.

This is workload-dependent.

A practical rule is:

Benchmark 128-bit, 256-bit, and 512-bit implementations on the actual target CPU. Do not assume wider is always faster.

For server code, this is especially important when a small amount of AVX-512 code is mixed into a mostly scalar or AVX2 service. The wide-vector section may affect the frequency of surrounding code.

Latency and Dependency Chains

Latency matters most when operations depend on previous results.

Example: a reduction sum.

float sum = 0.0f;

for (int i = 0; i < n; ++i)
{
    sum += a[i];
}

Even if vectorized, a naive reduction can become a dependency chain:

sum0 -> sum1 -> sum2 -> sum3 -> ...

The solution is to use multiple accumulators:

sum0 += a[i + 0];
sum1 += a[i + 1];
sum2 += a[i + 2];
sum3 += a[i + 3];

Then combine the partial sums at the end.

In SIMD code, this often means using several independent vector accumulators:

__m256 acc0 = _mm256_setzero_ps();
__m256 acc1 = _mm256_setzero_ps();
__m256 acc2 = _mm256_setzero_ps();
__m256 acc3 = _mm256_setzero_ps();

for (size_t i = 0; i + 32 <= n; i += 32)
{
    acc0 = _mm256_add_ps(acc0, _mm256_loadu_ps(a + i + 0));
    acc1 = _mm256_add_ps(acc1, _mm256_loadu_ps(a + i + 8));
    acc2 = _mm256_add_ps(acc2, _mm256_loadu_ps(a + i + 16));
    acc3 = _mm256_add_ps(acc3, _mm256_loadu_ps(a + i + 24));
}

This gives the CPU independent work and hides add latency.

Latency and Loop Unrolling

Loop unrolling can improve SIMD performance because it exposes more independent instructions.

For example, instead of processing one vector per iteration:

for (size_t i = 0; i + 8 <= n; i += 8)
{
    __m256 v = _mm256_loadu_ps(a + i);
    acc = _mm256_add_ps(acc, v);
}

process multiple vectors:

for (size_t i = 0; i + 32 <= n; i += 32)
{
    __m256 v0 = _mm256_loadu_ps(a + i + 0);
    __m256 v1 = _mm256_loadu_ps(a + i + 8);
    __m256 v2 = _mm256_loadu_ps(a + i + 16);
    __m256 v3 = _mm256_loadu_ps(a + i + 24);

    acc0 = _mm256_add_ps(acc0, v0);
    acc1 = _mm256_add_ps(acc1, v1);
    acc2 = _mm256_add_ps(acc2, v2);
    acc3 = _mm256_add_ps(acc3, v3);
}

This helps hide latency and improve throughput.

However, too much unrolling can increase register pressure and code size.

The right amount depends on:

  • instruction latency;
  • throughput;
  • available registers;
  • compiler register allocation;
  • cache behavior;
  • target microarchitecture.

Latency and Memory Loads

A SIMD instruction that uses a memory operand may look like one instruction, but the load still has to happen.

For example:

vaddps ymm0, ymm0, [rdi]

This performs a memory load and a vector add.

If the data is in L1 cache, the load may be fast.

If the data misses in cache, the operation can take far longer than the arithmetic latency.

Approximate memory latency classes:

SourceTypical latency scale
RegisterNo memory load
L1 cacheLow
L2 cacheMedium
L3 cacheHigh
Main memoryVery high
Page miss / TLB missVery high
Remote NUMA memoryExtremely high

For many SIMD loops, memory bandwidth and cache locality matter more than arithmetic instruction latency.

Alignment and Latency

Modern CPUs handle unaligned vector loads much better than early SSE processors did.

However, alignment can still matter.

Unaligned loads and stores may be slower when they:

  • cross cache-line boundaries;
  • cross page boundaries;
  • interact badly with store forwarding;
  • create split loads;
  • increase memory-system pressure.

A practical rule:

  • use unaligned loads when alignment is unknown;
  • align data when designing performance-critical data structures;
  • avoid crossing cache-line boundaries unnecessarily;
  • benchmark before writing complicated alignment prologues.

For most modern code, simple unaligned loads are often good enough, especially when the data is in cache and the loop is not close to the memory bandwidth limit.

Instruction Encoding Matters: SSE vs VEX vs EVEX

The same logical operation may exist in several encodings.

For example:

addps   xmm0, xmm1        ; legacy SSE
vaddps  xmm0, xmm1, xmm2  ; VEX encoded AVX
vaddps  zmm0, zmm1, zmm2  ; EVEX encoded AVX-512

The VEX and EVEX encodings can provide advantages:

  • three-operand non-destructive forms;
  • better register usage;
  • access to wider registers;
  • access to masks in AVX-512;
  • cleaner dependency behavior in some cases.

Legacy SSE code can sometimes create false dependencies if not written carefully. Modern compilers usually prefer VEX-encoded instructions when AVX is enabled.

A practical rule:

When targeting AVX or later, let the compiler generate VEX/EVEX encodings consistently. Avoid mixing legacy SSE and AVX code unnecessarily.

SIMD Latency by Data Type

The same operation can have different latency depending on data type.

Data typeTypical latency notes
8-bit integerSimple operations are very cheap; multiplication is limited or indirect
16-bit integerCommon in media/audio; multiply support is good
32-bit integerAdd/logical cheap; multiply moderate
64-bit integerAdd/logical cheap; multiply can be more expensive
32-bit floatAdd/mul/FMA well optimized
64-bit doubleAdd/mul/FMA well optimized but fewer lanes per vector
FP16Newer support; check exact CPU
BF16AI-oriented; throughput more important than scalar latency
INT8Important for inference; VNNI/AMX can be much faster than plain SIMD
Mask registersUsually cheap, but interaction with vector operations matters

Data type matters because execution units are not identical for every operation.

For example, integer add and integer multiply are very different internally. Floating-point add and floating-point divide are also very different.

SIMD Latency and Reductions

Horizontal reductions are often latency-sensitive.

For example, summing all lanes of a vector requires moving data across lanes:

[a0 a1 a2 a3 a4 a5 a6 a7] -> a0+a1+...+a7

This is not a simple element-wise operation.

It usually requires:

  • shuffles;
  • adds;
  • extracts;
  • lane crossing;
  • scalar cleanup.

A typical reduction strategy is:

  1. Accumulate many vectors independently.
  2. Reduce within each vector only at the end.
  3. Combine partial sums.

Avoid reducing inside the main loop unless necessary.

SIMD Latency and Dot Products

Dot products combine multiplication and addition.

For floating-point code, FMA is ideal:

acc = a * b + acc

The latency of FMA may be 4-5 cycles, but a well-unrolled loop with multiple accumulators can reach high throughput.

For integer neural-network inference, VNNI and AMX provide more specialized dot-product operations.

Dot-product styleBest instruction family
FP32 dot productAVX2/FMA or AVX-512/FMA
FP64 dot productAVX2/FMA or AVX-512/FMA
INT8 dot productAVX-VNNI, AVX-512 VNNI, or AMX-INT8
BF16 dot productAVX-512 BF16 or AMX-BF16
FP16 dot productAVX-512 FP16 or AMX-FP16 where available

Dot products are usually throughput-bound if written correctly.

The optimization target is not one instruction’s latency, but enough independent accumulators to keep the execution units busy.

How to Use Latency Data in Practice

When optimizing SIMD code, follow this process.

1. Identify the hot loop

Do not optimize random code.

Use profiling first.

Find the loop or function that actually matters.

2. Determine the bottleneck

Ask whether the loop is limited by:

  • arithmetic latency;
  • arithmetic throughput;
  • memory bandwidth;
  • cache misses;
  • shuffles;
  • gathers/scatters;
  • branch misprediction;
  • stores;
  • conversions;
  • horizontal reductions.

Latency tables help mainly when the loop is dependency-bound.

3. Look for dependency chains

Dependency chains are where latency hurts most.

Common examples:

  • reductions;
  • prefix sums;
  • recurrence relations;
  • repeated multiply-add into one accumulator;
  • scalar control values extracted from vectors;
  • long chains of conversions or shuffles.

4. Add independent accumulators

If the loop is latency-bound, create independent work.

For example:

one accumulator  -> latency-bound
four accumulators -> much easier to pipeline

5. Reduce shuffles

If shuffles dominate, change the data layout.

Better data layout often beats clever instruction selection.

6. Avoid unnecessary conversions

Conversions between integer and floating-point domains can add latency and reduce throughput.

Try to keep data in one representation as long as possible.

7. Use the right vector width

Try 128-bit, 256-bit, and 512-bit implementations when possible.

The best width depends on:

  • CPU;
  • workload;
  • frequency behavior;
  • memory bandwidth;
  • register pressure;
  • instruction mix.

8. Check exact instruction data

When performance really matters, look up exact latency, throughput, µop count, and port usage for the target CPU.

Do not rely on generic assumptions.

Common Mistakes

Mistake 1: Confusing latency with throughput

A 4-cycle latency instruction may still have excellent throughput if many independent operations are available.

Mistake 2: Optimizing for latency when the loop is memory-bound

If the loop is waiting on memory, changing an add from 4 cycles to 3 cycles will not help.

Mistake 3: Assuming wider SIMD is always faster

AVX-512 is not automatically faster than AVX2.

Wider vectors can increase throughput, but they can also increase register pressure, memory pressure, and frequency effects.

Mistake 4: Ignoring shuffles

Many SIMD loops are limited by data rearrangement, not arithmetic.

Mistake 5: Using gathers as if they were normal loads

Gather latency is highly variable and often expensive.

Mistake 6: Using one accumulator in a reduction

One accumulator creates a dependency chain. Use multiple accumulators.

Mistake 7: Assuming CPU generation is enough

Always check CPUID and operating-system support.

A product name is not enough to know which SIMD features are available.

Mistake 8: Copying latency numbers without checking operand form

The latency of an instruction may differ depending on which input operand the output depends on.

Exact references often distinguish between operand-to-result latencies.

Mistake 9: Forgetting P-core vs E-core differences

Modern Intel CPUs may contain different core types with different execution resources.

Latency and throughput can differ between P-cores and E-cores.

Mistake 10: Ignoring compiler output

Intrinsics do not guarantee ideal machine code.

Inspect generated assembly when performance matters.

Practical Recommendations

For broad x86-64 compatibility:

  • use SSE2 as the baseline;
  • keep scalar fallbacks where needed;
  • avoid MMX in new code.

For modern desktop performance:

  • use AVX2 and FMA when available;
  • consider AVX-VNNI for inference workloads on supported CPUs;
  • do not assume AVX-512 on Intel client CPUs.

For recent AMD performance:

  • use AVX2/FMA for Zen 3;
  • add AVX-512 paths for Zen 4 and Zen 5 where appropriate;
  • check exact AVX-512 subsets.

For recent Intel server performance:

  • use AVX-512 for suitable HPC, analytics, compression, and AI kernels;
  • use AMX for dense matrix AI workloads;
  • benchmark vector width and frequency effects.

For latency-sensitive code:

  • break dependency chains;
  • use multiple accumulators;
  • avoid unnecessary conversions;
  • avoid horizontal reductions in the inner loop;
  • reduce cross-lane shuffles.

For throughput-sensitive code:

  • unroll loops;
  • expose independent operations;
  • keep data in cache;
  • use aligned and contiguous data layouts where possible;
  • avoid memory bottlenecks.

Summary

SIMD instruction latency has become much more complex since the early days of MMX and SSE.

Modern x86 CPUs support a wide range of SIMD instruction families, from 64-bit MMX to 128-bit SSE, 256-bit AVX2, 512-bit AVX-512, versioned AVX10, and matrix-oriented AMX.

The most important lessons are:

  1. Simple SIMD add, subtract, logical, compare, and shift operations are usually cheap.
  2. Floating-point add, multiply, and FMA have moderate latency but excellent throughput when independent work exists.
  3. Integer multiply, conversions, carry-less multiply, crypto operations, and complex shuffles require more attention.
  4. Division, square root, gathers, scatters, and cache-control operations are expensive or variable.
  5. AVX-512 and AMX can be extremely powerful, but only when the workload fits their execution model.
  6. Recent Intel client CPUs are mainly AVX2/FMA-class SIMD targets.
  7. Recent Intel server CPUs are AVX-512 and AMX targets.
  8. AMD Zen 3 is mainly an AVX2/FMA target.
  9. AMD Zen 4 and Zen 5 make AVX-512 a realistic cross-vendor optimization path.
  10. Exact latency must be checked for the exact instruction, operand form, vector width, and CPU.

The practical SIMD optimization rule is:

Use latency data to understand dependency chains, but optimize the whole loop: data layout, memory behavior, throughput, shuffles, vector width, and CPU-specific execution resources all matter.

References