The recent arrival of the Intel Pentium 4 processor has generated the usual flurry of benchmarks and comments, most of them emphasizing that current software does not fully exploit the power of this new architecture (click here for an overview of the SSE2 instruction set).
However, until the Pentium 4 gains a significant share of the market, most applications will not be tuned for it, so it is interesting to analyze of the P4 runs current MMX/SSE code.
The following table summarizes the latencies and throughputs of MMX/SSE instructions on the P4:
Instruction | Latency | Throughput | Execution Unit |
MOVD mm,r32 | 2 | 1 | MMX_ALU |
MOVD r32,mm | 5 | 1 | FP_MISC |
MOVQ mm,mm | 6 | 1 | FP_MOV |
PACKSSWB / PACKSSDW / PACKUSWB mm,mm | 2 | 1 | MMX_SHFT |
PADDB / PADDW / PADDD | 2 | 1 | MMX_ALU |
PADDSB / PADDSW / PADDUSB / PADDUSW mm,mm | 2 | 1 | MMX_ALU |
PAND / PANDN/ POR / PXOR mm,mm | 2 | 1 | MMX_ALU |
PCMPEQB / PCMPEQW / PCMPEQD mm,mm | 2 | 1 | MMX_ALU |
PCMPGTB / PCMPGTW / PCMPGTD mm,mm | 2 | 1 | MMX_ALU |
PMADDWD mm,mm | 8 | 1 | FP_MUL |
PMULHW / PMULLW / PMULHUW mm,mm | 8 | 1 | FP_MUL |
PSLLW / PSLLW / PSLLQ mm,mm/imm8 | 2 | 1 | MMX_SHFT |
PSRAW / PSRAD mm,mm/imm8 | 2 | 1 | MMX_SHFT |
PSUBB / PSUBW / PSUBD mm,mm | 2 | 1 | MMX_ALU |
PSUBSB / PSUBSW / PSUBUSB / PSUBUSW mm,mm | 2 | 1 | MMX_ALU |
PUNPCKHBW / PUNPCKHWD / PUNPCKHDQ mm,mm | 2 | 1 | MMX_SHFT |
PUNPCKLBW / PUNPCKLWD / PUNPCKLDQ mm,mm | 2 | 1 | MMX_SHFT |
EMMS | 12 | 12 | |
PAVGB / PAVGW mm,mm | 2 | 1 | MMX_ALU |
PEXTRW r32,mm,imm8 | 7 | 2 | MMX_SHFT,FP_MISC |
PINSRW mm,r32,imm8 | 4 | 1 | MMX_SHFT,MMX_MISC |
PMAX / PMIN mm,mm | 2 | 1 | MMX_ALU |
PMOVMSKB r32,mm | 7 | 2 | FP_MISC |
PSADBW mm,mm | 4 | 1 | MMX_ALU |
PSHUFW mm,mm,imm8 | 2 | 1 | MMX_SHFT |
Latency | the number of clock cycles that are required to complete the execution of all of the
|