Instruction latency is one of the most important details to understand when optimizing SIMD code.
A SIMD instruction may look simple at the source-code level, but the number of cycles required before its result can be used depends heavily on the exact instruction, operand type, vector width, instruction encoding, and CPU microarchitecture.
This matters because modern x86 SIMD now spans many generations of instruction sets:
- MMX;
- 3DNow!;
- SSE;
- SSE2;
- SSE3;
- SSSE3;
- SSE4.1;
- SSE4.2;
- SSE4a;
- AES-NI;
- PCLMULQDQ;
- AVX;
- F16C;
- FMA;
- FMA4;
- XOP;
- AVX2;
- VAES;
- VPCLMULQDQ;
- GFNI;
- AVX-VNNI;
- AVX-IFMA;
- AVX-VNNI-INT8;
- AVX-VNNI-INT16;
- AVX-512;
- AVX-512 VNNI;
- AVX-512 BF16;
- AVX-512 FP16;
- AVX10;
- AMX.
Each generation added wider registers, new execution units, new instruction encodings, and new performance trade-offs.
A single flat table with every SIMD instruction, every operand form, every vector width, and every recent processor would be too large for a blog post. It would also become outdated quickly.
Instead, this article provides a practical SIMD latency map.
The goal is to explain:
- what instruction latency means;
- why latency is different from throughput;
- how latency changes across SIMD instruction families;
- which instruction groups are usually cheap;
- which instruction groups are usually expensive;
- how recent Intel and AMD processors compare;
- what changed in recent Intel and AMD generations;
- when exact instruction data must be checked in a detailed reference;
- how to use latency information when optimizing real code.
What Instruction Latency Means
Instruction latency is the number of clock cycles between an instruction receiving its input and producing a result that a dependent instruction can use.
For example:
vaddps ymm0, ymm0, ymm1
vaddps ymm0, ymm0, ymm2
vaddps ymm0, ymm0, ymm3
Each instruction depends on the result of the previous one because every instruction reads and writes ymm0.
If vaddps has a latency of 4 cycles on a given CPU, then the second vaddps cannot use the result of the first until roughly 4 cycles later.
This kind of code is latency-bound.
A dependency chain looks like this:
a0 -> a1 -> a2 -> a3 -> a4
Each step must wait for the previous step.
By contrast, independent instructions can overlap:
vaddps ymm0, ymm0, ymm1
vaddps ymm2, ymm2, ymm3
vaddps ymm4, ymm4, ymm5
vaddps ymm6, ymm6, ymm7
These instructions do not depend on each other. A modern out-of-order CPU can execute several of them in parallel if enough execution resources are available.
That is why latency is important, but it is not the whole performance story.
Latency vs Throughput
Latency answers this question:
How long before the result of this instruction is ready for a dependent instruction?
Throughput answers this question:
How often can the CPU start or complete this instruction when many independent instructions are available?
For example, a SIMD floating-point add may have:
latency: 3 or 4 cycles
throughput: 0.5 cycles
This means a dependent chain of additions advances every 3 or 4 cycles, but the CPU may be able to execute two independent additions per cycle.
Both values matter.
| Metric | Meaning | Most important for |
|---|---|---|
| Latency | Time from input to dependent output | Dependency chains |
| Throughput | Rate with many independent operations | Unrolled loops and vector kernels |
| µops | Internal operations generated by the instruction | Front-end and scheduler pressure |
| Port usage | Which execution units are used | Execution-resource conflicts |
| Memory latency | Delay caused by loads, stores, cache misses, and memory hierarchy | Data-dependent and streaming workloads |
A latency table alone cannot predict performance.
A real SIMD loop can be limited by:
- dependency chains;
- instruction throughput;
- load bandwidth;
- store bandwidth;
- L1, L2, L3, or memory latency;
- cache misses;
- address-generation limits;
- branch prediction;
- front-end decode bandwidth;
- µop cache bandwidth;
- execution-port conflicts;
- register pressure;
- vector-width frequency effects;
- data layout;
- alignment;
- compiler code generation.
Instruction latency is still useful, but only when interpreted in context.
Why Modern SIMD Latency Tables Are Complicated
Modern x86 instruction latency depends on many details.
The same instruction family can have different latency depending on:
- scalar vs packed form;
- integer vs floating-point operation;
- 128-bit, 256-bit, or 512-bit vector width;
- register operand vs memory operand;
- legacy SSE encoding vs VEX encoding vs EVEX encoding;
- exact source and destination dependency;
- whether the instruction crosses vector lanes;
- whether the result changes execution domain;
- whether the instruction is implemented as one µop or several µops;
- whether the instruction uses a general vector ALU, shuffle unit, multiplier, divider, crypto unit, or matrix unit;
- whether the core is a performance core or efficiency core;
- whether the processor lowers frequency during wide-vector execution.
For this reason, this article uses latency classes and representative numbers.
For exact work, use detailed references such as:
- Intel Intrinsics Guide;
- Intel Optimization Reference Manual;
- AMD Software Optimization Guides;
- uops.info;
- Agner Fog’s instruction tables.
Latency Classes Used in This Article
The tables below use approximate latency classes.
They are meant to be useful for reasoning, not as a replacement for exact instruction-specific data.
| Class | Typical latency | Meaning |
|---|---|---|
| Very low | 1 cycle | Simple integer/vector logic or simple data movement |
| Low | 2-3 cycles | Common simple SIMD operations, some shuffles |
| Medium | 3-5 cycles | Floating-point add/multiply/FMA, many conversions |
| High | 5-10 cycles | Complex shuffles, integer multiply variants, carry-less multiply, crypto-style operations |
| Very high | 10+ cycles | Division, square root, gathers, some horizontal reductions, special-purpose or microcoded operations |
| Variable | Depends heavily on data/cache/operand form | Gathers, scatters, cache operations, masked memory, AMX tile operations |
A table entry such as 3-5 means “usually in this range on recent CPUs, but check exact data for the exact instruction and core.”
Processor Generations Covered
This article focuses on the most relevant recent Intel and AMD generations.
For Intel client CPUs, the practical recent generations are:
| Intel client generation | Typical core generation | SIMD context |
|---|---|---|
| Raptor Lake / 13th and 14th Gen Core | Raptor Cove + Gracemont | AVX2/FMA-class client SIMD; no official AVX-512 |
| Meteor Lake / Core Ultra Series 1 | Redwood Cove + Crestmont | AVX2/FMA-class SIMD; AVX-VNNI on many SKUs |
| Arrow Lake and Lunar Lake / Core Ultra Series 2 | Lion Cove / Skymont family | AVX2/FMA-class SIMD; newer VNNI-style extensions depending on SKU |
For Intel server and workstation CPUs, the recent generations are different:
| Intel server generation | SIMD context |
|---|---|
| Sapphire Rapids / 4th Gen Xeon Scalable | AVX-512 and AMX |
| Emerald Rapids / 5th Gen Xeon Scalable | AVX-512 and AMX |
| Granite Rapids / Xeon 6 P-core | AVX-512, AMX, AVX10 transition generation |
Intel E-core server lines such as Sierra Forest have a different SIMD profile from P-core Xeon lines. Always check the exact SKU.
For AMD, the relevant recent generations are:
| AMD generation | SIMD context |
|---|---|
| Zen 3 | AVX2 and FMA; no AVX-512 |
| Zen 4 | AVX-512 support added |
| Zen 5 | AVX-512 support continues and becomes stronger, especially in server and high-performance products |
This is a practical simplification. Exact support still depends on the processor model, BIOS, operating system, and virtualization environment.
SIMD Instruction Set Coverage
The following table summarizes the major x86 SIMD and SIMD-adjacent instruction sets.
| Instruction set | Main width/model | Main role | Modern status |
|---|---|---|---|
| MMX | 64-bit | Packed integer SIMD | Legacy |
| 3DNow! | 64-bit | AMD packed floating-point SIMD | Obsolete |
| SSE | 128-bit | Packed single-precision floating point | Historical baseline |
| SSE2 | 128-bit | Integer SIMD and double-precision floating point | x86-64 baseline |
| SSE3 | 128-bit | Horizontal and complex arithmetic helpers | Common |
| SSSE3 | 128-bit | Byte shuffle and media operations | Very useful |
| SSE4.1 | 128-bit | Blends, dot products, min/max, media | Useful |
| SSE4.2 | 128-bit | Text/string compare and CRC32 | Common |
| SSE4a | 128-bit | AMD-specific small extension | Legacy/niche |
| AES-NI | 128-bit | AES crypto acceleration | Common and important |
| PCLMULQDQ | 128-bit | Carry-less multiply | Crypto/CRC |
| AVX | 256-bit | Wider floating-point SIMD | Common |
| F16C | 128/256-bit | FP16 conversion | Common |
| FMA3 | 128/256-bit | Fused multiply-add | Common |
| FMA4 | 128/256-bit | AMD four-operand FMA | Legacy |
| XOP | 128/256-bit | AMD-specific vector ops | Legacy |
| AVX2 | 256-bit | 256-bit integer SIMD | Modern baseline |
| VAES | 128/256/512-bit | Vector AES | Crypto |
| VPCLMULQDQ | 128/256/512-bit | Vector carry-less multiply | Crypto/CRC |
| GFNI | 128/256/512-bit | Galois-field operations | Crypto/coding |
| AVX-VNNI | 128/256-bit | Neural-network dot products | Newer client/server |
| AVX-IFMA | 128/256-bit | Integer fused multiply-add | Specialized |
| AVX-VNNI-INT8 | 128/256-bit | INT8 AI inference | Newer/future-facing |
| AVX-VNNI-INT16 | 128/256-bit | INT16 AI inference | Newer/future-facing |
| AVX-512F | 512-bit | AVX-512 foundation | Server/HPC/AI |
| AVX-512BW/DQ/VL | 512-bit and smaller forms | Byte/word/dword/qword support | Important AVX-512 subsets |
| AVX-512VNNI | 512-bit | INT8 inference | AI/server |
| AVX-512BF16 | 512-bit | bfloat16 operations | AI/server |
| AVX-512FP16 | 512-bit | FP16 arithmetic | AI/HPC/media |
| AVX10 | 128/256/512-bit model | Future versioned Intel vector ISA | Emerging |
| AMX | Tile state | Matrix/tile acceleration | AI/server |
AMX is not traditional lane-based SIMD. It is a matrix/tile extension, but it belongs in this map because it accelerates data-parallel numerical workloads on recent Intel Xeon processors.
Historical Baseline: MMX and Early SSE Latency
MMX and early SSE instructions were relatively narrow by modern standards, but they already showed the basic latency patterns that still matter today.
Simple packed integer operations such as add, subtract, compare, unpack, and logical operations were usually low-latency. Packed multiply, horizontal operations, state-transition instructions, and specialized media instructions were more expensive.
A simplified historical pattern looks like this:
| Instruction group | Typical latency pattern |
|---|---|
| Packed add/subtract | Low |
| Packed logical operations | Low |
| Packed comparisons | Low |
| Packed unpack | Low |
| Packed shifts | Low to moderate |
| Packed multiply | Moderate to high |
| Sum of absolute differences | Moderate |
MMX state cleanup with EMMS | Expensive enough to avoid inside tight loops |
Two lessons from early SIMD programming are still valid today:
- Simple SIMD operations are usually cheap.
- Data movement, dependencies, shuffles, conversions, and special-purpose instructions often dominate performance.
The register widths have grown from 64-bit MMX to 128-bit SSE, 256-bit AVX, and 512-bit AVX-512, but the same optimization principle remains true:
The fastest SIMD code is usually the code that keeps data independent, avoids unnecessary shuffles, and gives the CPU many independent operations to execute.
Latency Map by Instruction Family
The following table gives a practical latency map for major SIMD instruction families.
These are representative ranges for recent x86 CPUs, not exact values for every instruction.
| Instruction family | Examples | Typical latency class |
|---|---|---|
| Integer vector add/subtract | padd*, vpadd*, psub*, vpsub* | 1-2 |
| Integer vector logical | pand, por, pxor, vpand, vpor, vpxor | 1 |
| Integer vector compare | pcmpeq*, pcmpgt*, vpcmp* | 1-3 |
| Integer vector shifts | psll*, psrl*, psra*, vpsll* | 1-3 |
| Integer multiply | pmullw, pmulld, vpmull* | 3-5 |
| Integer multiply high / wide | pmulhw, pmuludq, vpmul* | 3-6 |
| Saturating arithmetic | paddusb, psubsw, vpaddus* | 1-3 |
| Unpack / interleave | punpck*, vpunpck* | 1-3 |
| Simple shuffle | pshufd, shufps, vpshufd | 1-3 |
| Byte shuffle | pshufb, vpshufb | 1-4 |
| Cross-lane permute | vperm2i128, vpermd, vpermps, vshufi* | 3-8 |
| Blend/select | blend*, vpblend*, masked AVX-512 operations | 1-3 |
| Floating-point add/subtract | addps, vaddps, vaddpd | 3-4 |
| Floating-point multiply | mulps, vmulps, vmulpd | 3-5 |
| Floating-point FMA | vfmadd* | 4-5 |
| Floating-point min/max | minps, maxps, vmin*, vmax* | 3-5 |
| Floating-point compare | cmpps, vcmpps, vcmppd | 3-5 |
| Floating-point conversion | cvt*, vcvt* | 3-8 |
| Floating-point divide | divps, vdivps, vdivpd | 10-30+ |
| Floating-point square root | sqrtps, vsqrtps, vsqrtpd | 10-30+ |
| Reciprocal approximation | rcpps, vrcp* | Low to medium |
| Reciprocal square-root approximation | rsqrtps, vrsqrt* | Low to medium |
| Gather | vgather* | Variable, often expensive |
| Scatter | vscatter* | Variable, often expensive |
| AES rounds | aesenc, vaesenc | Medium |
| Carry-less multiply | pclmulqdq, vpclmulqdq | Medium to high |
| VNNI dot product | vpdpbusd, vpdpwssd | Medium |
| BF16 dot product | vdpbf16ps | Medium |
| FP16 arithmetic | vaddph, vmulph, vfmaddph | Medium |
| AMX tile dot product | tdpbusd, tdpbf16ps | Variable, high-latency but high-throughput |
| Cache control | clflush, clflushopt, clwb | Variable |
| Memory fences | lfence, mfence, sfence | Variable, often pipeline-impacting |
The table shows the most important pattern:
Simple element-wise SIMD operations are usually cheap. Cross-lane movement, conversions, division, square root, gathers, scatters, and matrix/tile operations require more care.
Intel Client CPUs: Last Three Generations
Intel client CPUs in recent generations generally provide strong AVX2/FMA-class SIMD, but not official mainstream AVX-512.
A practical summary:
| Intel client generation | SIMD support profile | Practical latency notes |
|---|---|---|
| Raptor Lake | SSE through AVX2, FMA, AES, PCLMUL, VNNI-style support on many SKUs | Excellent AVX2 throughput; avoid assuming AVX-512 |
| Meteor Lake | SSE through AVX2, FMA, AVX-VNNI on many SKUs | Similar AVX2 programming model; P-core/E-core differences matter |
| Arrow Lake / Lunar Lake | SSE through AVX2, FMA, newer VNNI-style extensions depending SKU | Good AVX2-class SIMD; check exact CPUID for newer AI/vector subsets |
Representative latency classes on recent Intel client cores:
| Operation group | Raptor Lake class | Meteor Lake class | Arrow/Lunar Lake class |
|---|---|---|---|
| Vector integer add/logical | 1 | 1 | 1 |
| Vector integer multiply | 3-5 | 3-5 | 3-5 |
| 128/256-bit FP add | 3-4 | 3-4 | 3-4 |
| 128/256-bit FP multiply | 3-5 | 3-5 | 3-5 |
| 128/256-bit FMA | 4-5 | 4-5 | 4-5 |
| Simple shuffle | 1-3 | 1-3 | 1-3 |
| Cross-lane permute | 3-8 | 3-8 | 3-8 |
| Conversion | 3-8 | 3-8 | 3-8 |
| Divide/sqrt | 10-30+ | 10-30+ | 10-30+ |
| Gather | Variable | Variable | Variable |
| AES | Medium | Medium | Medium |
| PCLMUL | Medium to high | Medium to high | Medium to high |
| AVX-VNNI | SKU-dependent | Usually available on many SKUs | SKU-dependent; check CPUID |
| AVX-512 | Not official mainstream support | Not mainstream | Not mainstream |
The most important Intel client rule is:
Treat AVX2/FMA as the main wide SIMD target, and detect newer VNNI-style features explicitly. Do not assume AVX-512 on mainstream client CPUs.
Intel Server CPUs: Last Three Generations
Intel server CPUs have a different SIMD profile from Intel client CPUs.
Recent Xeon generations are where Intel AVX-512 and AMX are most important.
| Intel server generation | SIMD support profile | Practical latency notes |
|---|---|---|
| Sapphire Rapids | AVX-512, AVX-512 BF16/FP16 on supported SKUs, AMX | Strong server SIMD and matrix acceleration |
| Emerald Rapids | AVX-512 and AMX continuation | Similar programming model, improved platform generation |
| Granite Rapids / Xeon 6 P-core | AVX-512, AMX, AVX10 transition direction | Strong wide-vector and matrix focus |
Representative latency classes on recent Intel server cores:
| Operation group | Sapphire Rapids | Emerald Rapids | Granite Rapids / Xeon 6 P-core |
|---|---|---|---|
| 128/256-bit integer add/logical | 1 | 1 | 1 |
| 512-bit integer add/logical | 1-2 | 1-2 | 1-2 |
| 128/256-bit FP add | 3-4 | 3-4 | 3-4 |
| 512-bit FP add | 3-5 | 3-5 | 3-5 |
| 128/256-bit FMA | 4-5 | 4-5 | 4-5 |
| 512-bit FMA | 4-6 | 4-6 | 4-6 |
| AVX-512 mask operations | 1-3 | 1-3 | 1-3 |
| AVX-512 permutes | 3-8 | 3-8 | 3-8 |
| AVX-512 gathers/scatters | Variable | Variable | Variable |
| AVX-512 VNNI | Medium | Medium | Medium |
| AVX-512 BF16 | Medium | Medium | Medium |
| AVX-512 FP16 | Medium | Medium | Medium |
| AMX tile operations | Variable; high-latency but high-throughput | Variable; high-latency but high-throughput | Variable; high-latency but high-throughput |
| Divide/sqrt | 10-30+ | 10-30+ | 10-30+ |
For Intel server optimization, latency is not the only concern. Wide-vector and AMX code can be limited by:
- data packing;
- tile loading and storing;
- memory bandwidth;
- cache blocking;
- register pressure;
- mixed-width transitions;
- frequency behavior;
- NUMA effects;
- thread scheduling.
The key rule is:
Use AVX-512 and AMX where the workload has enough arithmetic intensity to justify them. For memory-bound code, wider instructions alone may not help.
AMD CPUs: Zen 3, Zen 4, and Zen 5
AMD’s recent SIMD story is easy to summarize at a high level:
| AMD generation | SIMD support profile |
|---|---|
| Zen 3 | SSE through AVX2 and FMA; no AVX-512 |
| Zen 4 | AVX-512 support added |
| Zen 5 | AVX-512 support continues and becomes stronger |
Representative latency classes:
| Operation group | Zen 3 | Zen 4 | Zen 5 |
|---|---|---|---|
| 128-bit integer add/logical | 1 | 1 | 1 |
| 256-bit integer add/logical | 1-2 | 1-2 | 1-2 |
| 512-bit integer add/logical | n/a | 1-3 | 1-2 |
| 128/256-bit FP add | 3-4 | 3-4 | 3-4 |
| 128/256-bit FP multiply | 3-5 | 3-5 | 3-5 |
| 128/256-bit FMA | 4-5 | 4-5 | 4-5 |
| 512-bit FP add | n/a | 3-5 | 3-5 |
| 512-bit FMA | n/a | 4-6 | 4-5 |
| Simple shuffle | 1-3 | 1-3 | 1-3 |
| Cross-lane permute | 3-8 | 3-8 | 3-8 |
| Conversion | 3-8 | 3-8 | 3-8 |
| Divide/sqrt | 10-30+ | 10-30+ | 10-30+ |
| Gather | Variable | Variable | Variable |
| AVX-512 VNNI | n/a | Medium | Medium |
| AVX-512 BF16 | n/a | Medium | Medium |
| AVX-512 FP16 | n/a or SKU-dependent | SKU-dependent | Supported on relevant Zen 5 products depending model |
| AMX | n/a | n/a | n/a |
Zen 4 added AVX-512 support, but developers should still check exact instruction subsets with CPUID.
Zen 5 strengthens AMD’s AVX-512 position and is especially important in server workloads such as EPYC Turin.
The practical AMD rule is:
For Zen 3, AVX2/FMA is the main target. For Zen 4 and Zen 5, AVX-512 becomes a realistic optimization path, especially for server, HPC, AI, analytics, compression, and data-processing workloads.
MMX and 3DNow! Latency
MMX and 3DNow! are legacy instruction sets, but they are still worth understanding.
MMX uses 64-bit MMX registers and packed integer operations.
3DNow! was AMD’s early packed floating-point SIMD extension using the MMX register file.
Typical latency patterns:
| Operation group | Typical latency class |
|---|---|
| MMX add/subtract/logical | 1-2 |
| MMX compare | 1-2 |
| MMX unpack/pack | 1-3 |
| MMX shift | 1-3 |
| MMX multiply | 3-5 |
MMX state cleanup with EMMS | High enough to avoid inside loops |
| 3DNow! floating-point add/multiply | Medium |
| 3DNow! reciprocal/rsqrt approximations | Medium |
| 3DNow! special instructions | Lookup exact data |
For modern code, avoid MMX and 3DNow!.
Use SSE2 or later instead.
SSE and SSE2 Latency
SSE introduced 128-bit XMM registers and packed single-precision floating-point SIMD.
SSE2 added double-precision floating-point and 128-bit integer SIMD.
Typical latency patterns:
| Operation group | Examples | Typical latency class |
|---|---|---|
| SSE scalar/packed FP add | addss, addps | 3-4 |
| SSE scalar/packed FP multiply | mulss, mulps | 3-5 |
| SSE FP compare | cmpps, cmpss | 3-5 |
| SSE shuffle | shufps | 1-3 |
| SSE reciprocal approximation | rcpps, rsqrtps | Low to medium |
| SSE divide/sqrt | divps, sqrtps | Very high |
| SSE2 integer add/logical | padd*, pand, pxor | 1 |
| SSE2 integer multiply | pmullw, pmuludq | 3-5 |
| SSE2 double add/multiply | addpd, mulpd | 3-5 |
| SSE2 conversion | cvt* | 3-8 |
| SSE2 shift/unpack | psll*, punpck* | 1-3 |
SSE2 remains important because it is the baseline for x86-64.
Even if a program has AVX2 or AVX-512 optimized paths, SSE2 is often the first SIMD fallback.
SSE3, SSSE3, SSE4.1, and SSE4.2 Latency
The later SSE extensions added many useful instructions, especially for horizontal operations, byte shuffling, blending, string processing, and media code.
Representative latency patterns:
| Instruction set | Important instructions | Typical latency class |
|---|---|---|
| SSE3 | horizontal add/subtract | Medium |
| SSSE3 | pshufb, pmaddubsw, phadd*, pabs* | Low to medium |
| SSE4.1 | blends, dot product, min/max, insert/extract | Low to medium |
| SSE4.2 | string compare, CRC32 | Medium to variable |
| SSE4a | AMD-specific extract/insert/misaligned helpers | Lookup exact data |
SSSE3’s pshufb is especially important. It is often one of the most useful byte-level SIMD instructions in real-world code.
However, shuffles are rarely free. They can become a bottleneck if the algorithm constantly rearranges data.
A useful rule:
If your SIMD loop does more shuffling than arithmetic, the shuffle unit may be the bottleneck.
AES-NI and PCLMULQDQ Latency
AES-NI and PCLMULQDQ are SIMD-adjacent cryptographic extensions.
They use XMM registers and later gained wider vector forms through VAES and VPCLMULQDQ.
Typical latency patterns:
| Instruction family | Examples | Typical latency class |
|---|---|---|
| AES rounds | aesenc, aesenclast, aesdec | Medium |
| AES keygen assist | aeskeygenassist | Medium |
| Carry-less multiply | pclmulqdq | Medium to high |
| Vector AES | vaesenc | Medium |
| Vector carry-less multiply | vpclmulqdq | Medium to high |
| GFNI | gf2p8* | Medium |
Crypto code is often throughput-sensitive rather than latency-sensitive.
For example, AES can be optimized by processing many independent blocks in parallel. That hides the latency of individual AES rounds.
AVX and FMA Latency
AVX introduced 256-bit YMM registers and the VEX instruction encoding.
AVX1 mainly widened floating-point SIMD, while AVX2 later widened integer SIMD.
FMA added fused multiply-add operations.
Typical AVX/FMA latency patterns:
| Operation group | Examples | Typical latency class |
|---|---|---|
| 128-bit VEX FP add | vaddps xmm | 3-4 |
| 256-bit FP add | vaddps ymm | 3-4 |
| 128-bit VEX FP multiply | vmulps xmm | 3-5 |
| 256-bit FP multiply | vmulps ymm | 3-5 |
| FMA | vfmadd* | 4-5 |
| AVX shuffle | vshufps, vunpck* | 1-3 |
| AVX cross-lane permute | vperm2f128 | 3-8 |
| FP conversion | vcvt* | 3-8 |
| Divide/sqrt | vdiv*, vsqrt* | Very high |
FMA has higher latency than a simple add, but it does more work:
a * b + c
as one fused operation.
For numerical kernels, FMA is usually excellent when the code has enough independent accumulators to hide latency.
AVX2 Latency
AVX2 extended 256-bit SIMD to integer operations.
This made AVX2 one of the most important modern SIMD targets.
Typical AVX2 latency patterns:
| Operation group | Examples | Typical latency class |
|---|---|---|
| Integer add/subtract | vpaddb, vpaddw, vpaddd | 1 |
| Integer logical | vpand, vpor, vpxor | 1 |
| Integer compare | vpcmpeq*, vpcmpgt* | 1-3 |
| Integer shift | vpsll*, vpsrl*, vpsra* | 1-3 |
| Integer multiply | vpmullw, vpmulld | 3-5 |
| Byte shuffle | vpshufb | 1-4 |
| Lane-crossing permute | vperm2i128, vpermd | 3-8 |
| Gather | vgather* | Variable, often expensive |
| Blend | vpblend* | 1-3 |
AVX2 is often the best practical target for portable high-performance x86 code because it is widely available on modern Intel and AMD processors.
However, not all AVX2 instructions are equally cheap.
The main danger areas are:
- gathers;
- cross-lane permutations;
- variable shifts;
- complex shuffles;
- memory bandwidth;
- frequency effects on some Intel CPUs.
AVX-512 Latency
AVX-512 introduced 512-bit ZMM registers and mask registers.
It also introduced many feature subsets, including:
- AVX-512F;
- AVX-512CD;
- AVX-512BW;
- AVX-512DQ;
- AVX-512VL;
- AVX-512IFMA;
- AVX-512VBMI;
- AVX-512VBMI2;
- AVX-512BITALG;
- AVX-512VPOPCNTDQ;
- AVX-512VNNI;
- AVX-512BF16;
- AVX-512FP16.
Typical AVX-512 latency patterns:
| Operation group | Examples | Typical latency class |
|---|---|---|
| 512-bit integer add/logical | vpadd*, vpand*, vpxor* | 1-2 |
| 512-bit integer compare | vpcmp* | 1-3 |
| 512-bit integer multiply | vpmull* | 3-6 |
| 512-bit FP add | vaddps, vaddpd | 3-5 |
| 512-bit FP multiply | vmulps, vmulpd | 3-5 |
| 512-bit FMA | vfmadd* | 4-6 |
| Mask operations | kortest, kand, kor, kxor | 1-3 |
| Masked vector operations | EVEX masked forms | Similar to unmasked or slightly more complex |
| Compress/expand | vcompress*, vexpand* | Medium to variable |
| Permute | vperm*, vshuf* | 3-8 |
| Gather/scatter | vgather*, vscatter* | Variable, often expensive |
| VNNI dot product | vpdpbusd, vpdpwssd | Medium |
| BF16 dot product | vdpbf16ps | Medium |
| FP16 arithmetic | vaddph, vmulph, vfmaddph | Medium |
| Divide/sqrt | vdiv*, vsqrt* | Very high |
AVX-512 can be very powerful, but it requires careful use.
The advantages are:
- wider vectors;
- mask registers;
- better predication;
- more powerful integer and floating-point operations;
- better support for AI, analytics, compression, and HPC kernels.
The risks are:
- frequency reduction on some processors;
- higher register pressure;
- expensive gathers/scatters;
- complex permutes;
- larger code size;
- need for runtime dispatch;
- differences between Intel and AMD implementations;
- differences between AVX-512 subsets.
A useful rule:
Use AVX-512 when the algorithm benefits from masks, wide vectors, or specialized instructions. Do not use it blindly just because it is available.
AVX-VNNI and AVX-512 VNNI Latency
VNNI stands for Vector Neural Network Instructions.
The core idea is to combine multiply and add operations commonly used in integer neural-network inference.
Representative instructions include:
vpdpbusd
vpdpbusds
vpdpwssd
vpdpwssds
Typical latency is medium, but throughput and data reuse are usually more important than single-instruction latency.
| Instruction family | Main use | Typical latency class |
|---|---|---|
| AVX-VNNI | INT8/INT16 dot products without full AVX-512 | Medium |
| AVX-512 VNNI | 512-bit INT8/INT16 dot products | Medium |
| AVX-VNNI-INT8 | Newer INT8 dot-product forms | Medium |
| AVX-VNNI-INT16 | Newer INT16 dot-product forms | Medium |
For inference kernels, performance depends heavily on:
- data layout;
- cache blocking;
- quantization format;
- accumulation strategy;
- number of independent accumulators;
- memory bandwidth;
- whether the workload fits in cache.
BF16 and FP16 Latency
BF16 and FP16 are important for AI and some numerical workloads.
BF16 keeps the same exponent width as FP32 but uses fewer mantissa bits. It is common in machine learning.
FP16 has a smaller exponent and mantissa and is common in graphics, AI, and storage formats.
Representative latency classes:
| Instruction family | Main use | Typical latency class |
|---|---|---|
| F16C conversion | FP16 to/from FP32 conversion | Medium |
| AVX-512 BF16 | BF16 dot products and conversion | Medium |
| AVX-512 FP16 | FP16 arithmetic | Medium |
| AVX10.2 low-precision extensions | Future AI/numerical formats | Check exact data |
For AI workloads, BF16/FP16 performance is often throughput-bound rather than latency-bound.
The key is to feed the units with enough independent work.
AVX10 Latency
AVX10 is Intel’s future-facing vector ISA direction.
It is designed to converge the AVX-512 programming model across future Intel P-core and E-core processors using a versioned feature model.
AVX10 should not be treated as a single fixed-latency instruction set. It is a versioned family.
From a latency perspective, the right way to think about AVX10 is:
- AVX10 inherits much of the AVX-512-style programming model;
- exact latency depends on the AVX10 version;
- exact latency depends on whether the implementation supports 128-bit, 256-bit, or 512-bit vector lengths;
- exact latency depends on the core type;
- future instructions must be checked in current references.
A practical early AVX10 latency table is therefore:
| AVX10 category | Expected latency style |
|---|---|
| Simple integer/vector logical | Very low to low |
| FP add/multiply/FMA | Medium |
| Mask operations | Low |
| Permutes/shuffles | Low to high depending complexity |
| New AI/data conversion operations | Medium to variable |
| Wider 512-bit operations | Check exact CPU and frequency behavior |
The rule for developers is:
Treat AVX10 as a future dispatch target, not as a replacement for checking exact CPU features and measured latency.
AMX Latency
AMX stands for Advanced Matrix Extensions.
AMX is not traditional SIMD. It uses tile registers and tile operations.
It is designed for dense matrix operations such as:
- INT8 matrix multiplication;
- BF16 matrix multiplication;
- FP16 matrix multiplication on newer products;
- AI inference and training kernels.
AMX instructions can have high latency, but that is not the main issue. AMX is designed for high throughput over large blocks of work.
Representative AMX latency map:
| AMX operation group | Examples | Latency interpretation |
|---|---|---|
| Tile configuration | ldtilecfg | Setup overhead; keep outside hot inner loops |
| Tile load/store | tileloadd, tilestored | Memory/cache dependent |
| INT8 tile dot product | tdpbusd, tdpbuud | High-latency but high-throughput |
| BF16 tile dot product | tdpbf16ps | High-latency but high-throughput |
| FP16 tile dot product | newer AMX FP16 forms | Check exact CPU |
| Tile release | tilerelease | State-management overhead |
The optimization rule for AMX is different from the rule for simple SIMD:
Do not think about AMX as a single instruction latency problem. Think about blocking, packing, tile reuse, memory hierarchy, and throughput.
Representative Latency Map: Recent Intel vs AMD
The following table provides a compact view of recent Intel and AMD SIMD latency classes.
| Operation group | Intel recent client | Intel recent server | AMD Zen 3 | AMD Zen 4 | AMD Zen 5 |
|---|---|---|---|---|---|
| 128-bit integer add/logical | 1 | 1 | 1 | 1 | 1 |
| 256-bit integer add/logical | 1 | 1 | 1-2 | 1-2 | 1-2 |
| 512-bit integer add/logical | n/a | 1-2 | n/a | 1-3 | 1-2 |
| 128-bit FP add | 3-4 | 3-4 | 3-4 | 3-4 | 3-4 |
| 256-bit FP add | 3-4 | 3-4 | 3-4 | 3-4 | 3-4 |
| 512-bit FP add | n/a | 3-5 | n/a | 3-5 | 3-5 |
| 128/256-bit FMA | 4-5 | 4-5 | 4-5 | 4-5 | 4-5 |
| 512-bit FMA | n/a | 4-6 | n/a | 4-6 | 4-5 |
| Integer multiply | 3-5 | 3-6 | 3-5 | 3-6 | 3-6 |
| Simple shuffle | 1-3 | 1-3 | 1-3 | 1-3 | 1-3 |
| Cross-lane permute | 3-8 | 3-8 | 3-8 | 3-8 | 3-8 |
| Conversion | 3-8 | 3-8 | 3-8 | 3-8 | 3-8 |
| Divide/sqrt | 10-30+ | 10-30+ | 10-30+ | 10-30+ | 10-30+ |
| Gather | Variable | Variable | Variable | Variable | Variable |
| Scatter | n/a or limited | Variable | n/a | Variable | Variable |
| AES/VAES | Medium | Medium | Medium | Medium | Medium |
| PCLMUL/VPCLMUL | Medium to high | Medium to high | Medium to high | Medium to high | Medium to high |
| VNNI | SKU-dependent | Medium | n/a | Medium with AVX-512 VNNI | Medium |
| BF16 | SKU-dependent | Medium | n/a | Medium where supported | Medium |
| FP16 | SKU-dependent | Medium | n/a | SKU-dependent | Medium where supported |
| AMX | n/a | Variable/high-throughput | n/a | n/a | n/a |
This table intentionally avoids pretending that every instruction has one universal latency.
The correct conclusion is:
Recent Intel and AMD CPUs have broadly similar latency classes for simple SIMD arithmetic, but differ significantly in supported instruction sets, vector width, execution resources, AMX availability, AVX-512 implementation, and frequency behavior.
Why Division and Square Root Are Special
Floating-point division and square root are much slower than add, multiply, or FMA.
For example:
vaddps ymm0, ymm1, ymm2
vmulps ymm3, ymm4, ymm5
vdivps ymm6, ymm7, ymm8
vsqrtps ymm9, ymm10
The add and multiply instructions are usually medium-latency and high-throughput.
The divide and square-root instructions are much higher latency and lower throughput.
This is why optimized numerical code often tries to replace division with multiplication by a reciprocal when acceptable:
x / y
can sometimes become:
x * (1 / y)
If several values use the same divisor, computing the reciprocal once and multiplying many times can be much faster.
For approximate math, reciprocal approximation instructions may be useful, followed by one or more Newton-Raphson refinement steps when more precision is needed.
Why Shuffles Often Dominate SIMD Performance
SIMD arithmetic is usually cheap.
Data rearrangement is often expensive.
For example, image code may need to transform data from this layout:
RGB RGB RGB RGB
into this layout:
RRRR GGGG BBBB
That transformation requires unpacking, shuffling, permuting, or blending.
In many SIMD kernels, the actual arithmetic is not the bottleneck. The bottleneck is moving data into the right lanes.
Common expensive or bottleneck-prone operations include:
- byte shuffles;
- cross-lane permutes;
- horizontal reductions;
- gather/scatter;
- compress/expand;
- format conversions;
- matrix packing;
- transposes.
A useful SIMD optimization rule is:
Before making arithmetic faster, make data layout easier.
Why Gathers and Scatters Are Variable
Gather instructions load multiple elements from unrelated memory addresses into one vector.
Scatter instructions store multiple vector elements to unrelated memory addresses.
These are powerful, but their latency is highly variable because memory dominates the cost.
A gather from L1 cache may be reasonable.
A gather from L3 cache or main memory may be extremely expensive.
A gather with repeated cache misses is not really an “instruction latency” problem anymore. It is a memory-system problem.
The same applies to scatter.
Use gathers and scatters when they simplify an algorithm or when the memory pattern is unavoidable, but do not expect them to behave like ordinary aligned vector loads and stores.
Wide Vectors and Frequency Effects
On some CPUs, especially some Intel generations, heavy AVX2 or AVX-512 code can reduce core frequency.
This happens because wide vector operations consume more power and create more thermal pressure.
The performance trade-off is not always obvious.
A 512-bit instruction may process twice as much data as a 256-bit instruction, but if the CPU lowers frequency significantly, the overall speedup may be smaller than expected.
This is workload-dependent.
A practical rule is:
Benchmark 128-bit, 256-bit, and 512-bit implementations on the actual target CPU. Do not assume wider is always faster.
For server code, this is especially important when a small amount of AVX-512 code is mixed into a mostly scalar or AVX2 service. The wide-vector section may affect the frequency of surrounding code.
Latency and Dependency Chains
Latency matters most when operations depend on previous results.
Example: a reduction sum.
float sum = 0.0f;
for (int i = 0; i < n; ++i)
{
sum += a[i];
}
Even if vectorized, a naive reduction can become a dependency chain:
sum0 -> sum1 -> sum2 -> sum3 -> ...
The solution is to use multiple accumulators:
sum0 += a[i + 0];
sum1 += a[i + 1];
sum2 += a[i + 2];
sum3 += a[i + 3];
Then combine the partial sums at the end.
In SIMD code, this often means using several independent vector accumulators:
__m256 acc0 = _mm256_setzero_ps();
__m256 acc1 = _mm256_setzero_ps();
__m256 acc2 = _mm256_setzero_ps();
__m256 acc3 = _mm256_setzero_ps();
for (size_t i = 0; i + 32 <= n; i += 32)
{
acc0 = _mm256_add_ps(acc0, _mm256_loadu_ps(a + i + 0));
acc1 = _mm256_add_ps(acc1, _mm256_loadu_ps(a + i + 8));
acc2 = _mm256_add_ps(acc2, _mm256_loadu_ps(a + i + 16));
acc3 = _mm256_add_ps(acc3, _mm256_loadu_ps(a + i + 24));
}
This gives the CPU independent work and hides add latency.
Latency and Loop Unrolling
Loop unrolling can improve SIMD performance because it exposes more independent instructions.
For example, instead of processing one vector per iteration:
for (size_t i = 0; i + 8 <= n; i += 8)
{
__m256 v = _mm256_loadu_ps(a + i);
acc = _mm256_add_ps(acc, v);
}
process multiple vectors:
for (size_t i = 0; i + 32 <= n; i += 32)
{
__m256 v0 = _mm256_loadu_ps(a + i + 0);
__m256 v1 = _mm256_loadu_ps(a + i + 8);
__m256 v2 = _mm256_loadu_ps(a + i + 16);
__m256 v3 = _mm256_loadu_ps(a + i + 24);
acc0 = _mm256_add_ps(acc0, v0);
acc1 = _mm256_add_ps(acc1, v1);
acc2 = _mm256_add_ps(acc2, v2);
acc3 = _mm256_add_ps(acc3, v3);
}
This helps hide latency and improve throughput.
However, too much unrolling can increase register pressure and code size.
The right amount depends on:
- instruction latency;
- throughput;
- available registers;
- compiler register allocation;
- cache behavior;
- target microarchitecture.
Latency and Memory Loads
A SIMD instruction that uses a memory operand may look like one instruction, but the load still has to happen.
For example:
vaddps ymm0, ymm0, [rdi]
This performs a memory load and a vector add.
If the data is in L1 cache, the load may be fast.
If the data misses in cache, the operation can take far longer than the arithmetic latency.
Approximate memory latency classes:
| Source | Typical latency scale |
|---|---|
| Register | No memory load |
| L1 cache | Low |
| L2 cache | Medium |
| L3 cache | High |
| Main memory | Very high |
| Page miss / TLB miss | Very high |
| Remote NUMA memory | Extremely high |
For many SIMD loops, memory bandwidth and cache locality matter more than arithmetic instruction latency.
Alignment and Latency
Modern CPUs handle unaligned vector loads much better than early SSE processors did.
However, alignment can still matter.
Unaligned loads and stores may be slower when they:
- cross cache-line boundaries;
- cross page boundaries;
- interact badly with store forwarding;
- create split loads;
- increase memory-system pressure.
A practical rule:
- use unaligned loads when alignment is unknown;
- align data when designing performance-critical data structures;
- avoid crossing cache-line boundaries unnecessarily;
- benchmark before writing complicated alignment prologues.
For most modern code, simple unaligned loads are often good enough, especially when the data is in cache and the loop is not close to the memory bandwidth limit.
Instruction Encoding Matters: SSE vs VEX vs EVEX
The same logical operation may exist in several encodings.
For example:
addps xmm0, xmm1 ; legacy SSE
vaddps xmm0, xmm1, xmm2 ; VEX encoded AVX
vaddps zmm0, zmm1, zmm2 ; EVEX encoded AVX-512
The VEX and EVEX encodings can provide advantages:
- three-operand non-destructive forms;
- better register usage;
- access to wider registers;
- access to masks in AVX-512;
- cleaner dependency behavior in some cases.
Legacy SSE code can sometimes create false dependencies if not written carefully. Modern compilers usually prefer VEX-encoded instructions when AVX is enabled.
A practical rule:
When targeting AVX or later, let the compiler generate VEX/EVEX encodings consistently. Avoid mixing legacy SSE and AVX code unnecessarily.
SIMD Latency by Data Type
The same operation can have different latency depending on data type.
| Data type | Typical latency notes |
|---|---|
| 8-bit integer | Simple operations are very cheap; multiplication is limited or indirect |
| 16-bit integer | Common in media/audio; multiply support is good |
| 32-bit integer | Add/logical cheap; multiply moderate |
| 64-bit integer | Add/logical cheap; multiply can be more expensive |
| 32-bit float | Add/mul/FMA well optimized |
| 64-bit double | Add/mul/FMA well optimized but fewer lanes per vector |
| FP16 | Newer support; check exact CPU |
| BF16 | AI-oriented; throughput more important than scalar latency |
| INT8 | Important for inference; VNNI/AMX can be much faster than plain SIMD |
| Mask registers | Usually cheap, but interaction with vector operations matters |
Data type matters because execution units are not identical for every operation.
For example, integer add and integer multiply are very different internally. Floating-point add and floating-point divide are also very different.
SIMD Latency and Reductions
Horizontal reductions are often latency-sensitive.
For example, summing all lanes of a vector requires moving data across lanes:
[a0 a1 a2 a3 a4 a5 a6 a7] -> a0+a1+...+a7
This is not a simple element-wise operation.
It usually requires:
- shuffles;
- adds;
- extracts;
- lane crossing;
- scalar cleanup.
A typical reduction strategy is:
- Accumulate many vectors independently.
- Reduce within each vector only at the end.
- Combine partial sums.
Avoid reducing inside the main loop unless necessary.
SIMD Latency and Dot Products
Dot products combine multiplication and addition.
For floating-point code, FMA is ideal:
acc = a * b + acc
The latency of FMA may be 4-5 cycles, but a well-unrolled loop with multiple accumulators can reach high throughput.
For integer neural-network inference, VNNI and AMX provide more specialized dot-product operations.
| Dot-product style | Best instruction family |
|---|---|
| FP32 dot product | AVX2/FMA or AVX-512/FMA |
| FP64 dot product | AVX2/FMA or AVX-512/FMA |
| INT8 dot product | AVX-VNNI, AVX-512 VNNI, or AMX-INT8 |
| BF16 dot product | AVX-512 BF16 or AMX-BF16 |
| FP16 dot product | AVX-512 FP16 or AMX-FP16 where available |
Dot products are usually throughput-bound if written correctly.
The optimization target is not one instruction’s latency, but enough independent accumulators to keep the execution units busy.
How to Use Latency Data in Practice
When optimizing SIMD code, follow this process.
1. Identify the hot loop
Do not optimize random code.
Use profiling first.
Find the loop or function that actually matters.
2. Determine the bottleneck
Ask whether the loop is limited by:
- arithmetic latency;
- arithmetic throughput;
- memory bandwidth;
- cache misses;
- shuffles;
- gathers/scatters;
- branch misprediction;
- stores;
- conversions;
- horizontal reductions.
Latency tables help mainly when the loop is dependency-bound.
3. Look for dependency chains
Dependency chains are where latency hurts most.
Common examples:
- reductions;
- prefix sums;
- recurrence relations;
- repeated multiply-add into one accumulator;
- scalar control values extracted from vectors;
- long chains of conversions or shuffles.
4. Add independent accumulators
If the loop is latency-bound, create independent work.
For example:
one accumulator -> latency-bound
four accumulators -> much easier to pipeline
5. Reduce shuffles
If shuffles dominate, change the data layout.
Better data layout often beats clever instruction selection.
6. Avoid unnecessary conversions
Conversions between integer and floating-point domains can add latency and reduce throughput.
Try to keep data in one representation as long as possible.
7. Use the right vector width
Try 128-bit, 256-bit, and 512-bit implementations when possible.
The best width depends on:
- CPU;
- workload;
- frequency behavior;
- memory bandwidth;
- register pressure;
- instruction mix.
8. Check exact instruction data
When performance really matters, look up exact latency, throughput, µop count, and port usage for the target CPU.
Do not rely on generic assumptions.
Common Mistakes
Mistake 1: Confusing latency with throughput
A 4-cycle latency instruction may still have excellent throughput if many independent operations are available.
Mistake 2: Optimizing for latency when the loop is memory-bound
If the loop is waiting on memory, changing an add from 4 cycles to 3 cycles will not help.
Mistake 3: Assuming wider SIMD is always faster
AVX-512 is not automatically faster than AVX2.
Wider vectors can increase throughput, but they can also increase register pressure, memory pressure, and frequency effects.
Mistake 4: Ignoring shuffles
Many SIMD loops are limited by data rearrangement, not arithmetic.
Mistake 5: Using gathers as if they were normal loads
Gather latency is highly variable and often expensive.
Mistake 6: Using one accumulator in a reduction
One accumulator creates a dependency chain. Use multiple accumulators.
Mistake 7: Assuming CPU generation is enough
Always check CPUID and operating-system support.
A product name is not enough to know which SIMD features are available.
Mistake 8: Copying latency numbers without checking operand form
The latency of an instruction may differ depending on which input operand the output depends on.
Exact references often distinguish between operand-to-result latencies.
Mistake 9: Forgetting P-core vs E-core differences
Modern Intel CPUs may contain different core types with different execution resources.
Latency and throughput can differ between P-cores and E-cores.
Mistake 10: Ignoring compiler output
Intrinsics do not guarantee ideal machine code.
Inspect generated assembly when performance matters.
Practical Recommendations
For broad x86-64 compatibility:
- use SSE2 as the baseline;
- keep scalar fallbacks where needed;
- avoid MMX in new code.
For modern desktop performance:
- use AVX2 and FMA when available;
- consider AVX-VNNI for inference workloads on supported CPUs;
- do not assume AVX-512 on Intel client CPUs.
For recent AMD performance:
- use AVX2/FMA for Zen 3;
- add AVX-512 paths for Zen 4 and Zen 5 where appropriate;
- check exact AVX-512 subsets.
For recent Intel server performance:
- use AVX-512 for suitable HPC, analytics, compression, and AI kernels;
- use AMX for dense matrix AI workloads;
- benchmark vector width and frequency effects.
For latency-sensitive code:
- break dependency chains;
- use multiple accumulators;
- avoid unnecessary conversions;
- avoid horizontal reductions in the inner loop;
- reduce cross-lane shuffles.
For throughput-sensitive code:
- unroll loops;
- expose independent operations;
- keep data in cache;
- use aligned and contiguous data layouts where possible;
- avoid memory bottlenecks.
Summary
SIMD instruction latency has become much more complex since the early days of MMX and SSE.
Modern x86 CPUs support a wide range of SIMD instruction families, from 64-bit MMX to 128-bit SSE, 256-bit AVX2, 512-bit AVX-512, versioned AVX10, and matrix-oriented AMX.
The most important lessons are:
- Simple SIMD add, subtract, logical, compare, and shift operations are usually cheap.
- Floating-point add, multiply, and FMA have moderate latency but excellent throughput when independent work exists.
- Integer multiply, conversions, carry-less multiply, crypto operations, and complex shuffles require more attention.
- Division, square root, gathers, scatters, and cache-control operations are expensive or variable.
- AVX-512 and AMX can be extremely powerful, but only when the workload fits their execution model.
- Recent Intel client CPUs are mainly AVX2/FMA-class SIMD targets.
- Recent Intel server CPUs are AVX-512 and AMX targets.
- AMD Zen 3 is mainly an AVX2/FMA target.
- AMD Zen 4 and Zen 5 make AVX-512 a realistic cross-vendor optimization path.
- Exact latency must be checked for the exact instruction, operand form, vector width, and CPU.
The practical SIMD optimization rule is:
Use latency data to understand dependency chains, but optimize the whole loop: data layout, memory behavior, throughput, shuffles, vector width, and CPU-specific execution resources all matter.
References
- uops.info Instruction Table
- uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures
- Intel Intrinsics Guide
- Intel 64 and IA-32 Architectures Optimization Reference Manual
- Intel 64 and IA-32 Architectures Software Developer’s Manual
- Intel Advanced Vector Extensions 10 Technical Paper
- Intel Advanced Vector Extensions 10.2 Architecture Specification
- AMD Processor Programming Reference and Optimization Guides
- AMD: Leadership HPC Performance with 5th Generation AMD EPYC Processors
- Agner Fog Optimization Manuals and Instruction Tables
- Microsoft x86 Intrinsics List
- GCC x86 Options
- LLVM X86 Backend Documentation


