|
The recent arrival of the Intel Pentium 4 processor has generated the usual flurry of benchmarks and comments, most of them emphasizing that current software does not fully exploit the power of this new architecture (click here for an overview of the SSE2 instruction set).
However, until the Pentium 4 gains a significant share of the market, most applications will not be tuned for it, so it is interesting to analyze of the P4 runs current MMX/SSE code.
The following table summarizes the latencies and throughputs of MMX/SSE instructions on the P4:
Instruction | Latency | Throughput | Execution Unit |
MOVD mm,r32 | 2 | 1 | MMX_ALU |
MOVD r32,mm | 5 | 1 | FP_MISC |
MOVQ mm,mm | 6 | 1 | FP_MOV |
PACKSSWB / PACKSSDW / PACKUSWB mm,mm | 2 | 1 | MMX_SHFT |
PADDB / PADDW / PADDD | 2 | 1 | MMX_ALU |
PADDSB / PADDSW / PADDUSB / PADDUSW mm,mm | 2 | 1 | MMX_ALU |
PAND / PANDN/ POR / PXOR mm,mm | 2 | 1 | MMX_ALU |
PCMPEQB / PCMPEQW / PCMPEQD mm,mm | 2 | 1 | MMX_ALU |
PCMPGTB / PCMPGTW / PCMPGTD mm,mm | 2 | 1 | MMX_ALU |
PMADDWD mm,mm | 8 | 1 | FP_MUL |
PMULHW / PMULLW / PMULHUW mm,mm | 8 | 1 | FP_MUL |
PSLLW / PSLLW / PSLLQ mm,mm/imm8 | 2 | 1 | MMX_SHFT |
PSRAW / PSRAD mm,mm/imm8 | 2 | 1 | MMX_SHFT |
PSUBB / PSUBW / PSUBD mm,mm | 2 | 1 | MMX_ALU |
PSUBSB / PSUBSW / PSUBUSB / PSUBUSW mm,mm | 2 | 1 | MMX_ALU |
PUNPCKHBW / PUNPCKHWD / PUNPCKHDQ mm,mm | 2 | 1 | MMX_SHFT |
PUNPCKLBW / PUNPCKLWD / PUNPCKLDQ mm,mm | 2 | 1 | MMX_SHFT |
EMMS | 12 | 12 | |
PAVGB / PAVGW mm,mm | 2 | 1 | MMX_ALU |
PEXTRW r32,mm,imm8 | 7 | 2 | MMX_SHFT,FP_MISC |
PINSRW mm,r32,imm8 | 4 | 1 | MMX_SHFT,MMX_MISC |
PMAX / PMIN mm,mm | 2 | 1 | MMX_ALU |
PMOVMSKB r32,mm | 7 | 2 | FP_MISC |
PSADBW mm,mm | 4 | 1 | MMX_ALU |
PSHUFW mm,mm,imm8 | 2 | 1 | MMX_SHFT |
Assuming that we should have P4s running at 2 GHz and more pretty soon, I would not worry about the doubling in latency of most MMX instructions. But the multiply instructions' latency (PMADDWD / PMULHW / PMULLW) jumped from 3 cycles in the P6 core to 8 cycles in the Pentium 4! This will affect all convolutional kernel codes that are widely used, for example, in audio applications. Another troublesome latency is MOVQ's 6 cycles versus only 1 cycle on the P6 core, given that is widely used to move memory blocks and copy results.
But troubles do not stop here. The image above outlines how instructions are addressed to specific ports in the P4 execution engine. All MMX instructions are queued in Port 1! This is major drawback compared to the P6 core, in which most MMX instructions could be issued to Port 0 or Port 1.
Intel's P4 Optimization Guide also reveals that: "Floating-point, MMX technology, Streaming SIMD Extensions and Streaming SIMD Extension 2 instructions with load operations require 6 more clocks in latency than the register-only version of the instructions", i.e. twice the clocks required by the P6 core.
Summing up, the P4 can issue only one MMX instruction per cycle, and the latency is at best twice that on the older Pentium III processor. In pathological conditions, this adds up to bring P4's SIMD performance down to about one third P-III's. Until the P4 ramps up into the 2+ GHz frequency range, its integer SIMD execution speed will simply lag behind the venerable P6 core.