• SIMD on x64/x86

    SSE Data Movement

    MOVAPS transfers 128 bits of packed data from memory to SIMD floating-point registers and vice versa, or between SIMD floating-point registers, while MOVUPS makes no assumption for alignment. MOVHPS transfers 64 bits of packed data from memory to the upper two fields of a SIMD floating-point register and vice versa,…

  • SIMD on x64/x86

    SSE Conversion Instructions: Converting Between Floats, Integers, MMX, and XMM Registers

    SSE introduced 128-bit XMM registers and a new set of SIMD instructions for single-precision floating-point arithmetic. Alongside arithmetic, comparison, shuffle, and logical operations, SSE also added several important conversion instructions. These conversion instructions move data between two worlds: The original SSE conversion instructions are: They are easy to overlook, but…

  • SIMD on x64/x86

    MMX Performance on Intel Pentium 4

    The recent arrival of the Intel Pentium 4 processor has generated the usual flurry of benchmarks and comments, most of them emphasizing that current software does not fully exploit the power of this new architecture (click here for an overview of the SSE2 instruction set). However, until the Pentium 4…

  • SIMD on x64/x86

    SIMD Instruction Latency Map

    Instruction latency is one of the most important details to understand when optimizing SIMD code. A SIMD instruction may look simple at the source-code level, but the number of cycles required before its result can be used depends heavily on the exact instruction, operand type, vector width, instruction encoding, and…

  • SIMD on x64/x86

    Map of SIMD Instruction Sets and CPUs

    The original version of this article was written in 2000, when the practical SIMD landscape on x86 processors was still small enough to fit in a compact table. At that time, the important questions were simple: That map was useful because the market was transitioning from scalar x86 code to…

  • SIMD on x64/x86

    Intel Pentium III

    The Intel P6 core, introduced with the Pentium Pro processor and used in all current Intel processors, features a RISC-like microarchitecture and an out-of-order execution unit, representing a radical shift from previous designs.  The P6’s new dynamic execution micro-architecture removes the constraint of linear instruction sequencing between the traditional fetch…