SIMD

MMX Performance on Intel Pentium 4

The recent arrival of the Intel Pentium 4 processor has generated the usual flurry of benchmarks and comments, most of them emphasizing that current software does not fully exploit the power of this new architecture (click here for an overview of the SSE2 instruction set).
However, until the Pentium 4 gains a significant share of the market, most applications will not be tuned for it, so it is interesting to analyze of the P4 runs current MMX/SSE code.
The following table summarizes the latencies and throughputs of MMX/SSE instructions on the P4:
 

 

Instruction Latency Throughput Execution Unit
MOVD mm,r32 2 1 MMX_ALU
MOVD r32,mm 5 1 FP_MISC
MOVQ mm,mm 6 1 FP_MOV
PACKSSWB / PACKSSDW / PACKUSWB mm,mm 2 1 MMX_SHFT
PADDB / PADDW / PADDD 2 1 MMX_ALU
PADDSB / PADDSW / PADDUSB / PADDUSW mm,mm 2 1 MMX_ALU
PAND / PANDN/ POR / PXOR mm,mm 2 1 MMX_ALU
PCMPEQB / PCMPEQW / PCMPEQD mm,mm 2 1 MMX_ALU
PCMPGTB / PCMPGTW / PCMPGTD mm,mm 2 1 MMX_ALU
PMADDWD mm,mm 8 1 FP_MUL
PMULHW / PMULLW / PMULHUW mm,mm 8 1 FP_MUL
PSLLW / PSLLW / PSLLQ mm,mm/imm8 2 1 MMX_SHFT
PSRAW / PSRAD mm,mm/imm8 2 1 MMX_SHFT
PSUBB / PSUBW / PSUBD mm,mm 2 1 MMX_ALU
PSUBSB / PSUBSW / PSUBUSB / PSUBUSW mm,mm 2 1 MMX_ALU
PUNPCKHBW / PUNPCKHWD / PUNPCKHDQ mm,mm 2 1 MMX_SHFT
PUNPCKLBW / PUNPCKLWD / PUNPCKLDQ mm,mm 2 1 MMX_SHFT
EMMS 12 12
PAVGB / PAVGW mm,mm 2 1 MMX_ALU
PEXTRW r32,mm,imm8 7 2 MMX_SHFT,FP_MISC
PINSRW mm,r32,imm8 4 1 MMX_SHFT,MMX_MISC
PMAX / PMIN mm,mm 2 1 MMX_ALU
PMOVMSKB r32,mm 7 2 FP_MISC
PSADBW mm,mm 4 1 MMX_ALU
PSHUFW mm,mm,imm8 2 1 MMX_SHFT

 

 

Latency the number of clock cycles that are required to complete the execution of all of the

Leave a Reply

Your email address will not be published. Required fields are marked *