Stefano Tommesani

  • Increase font size
  • Default font size
  • Decrease font size
Home SIMD MMX Performance on Intel Pentium 4

MMX Performance on Intel Pentium 4

The recent arrival of the Intel Pentium 4 processor has generated the usual flurry of benchmarks and comments, most of them emphasizing that current software does not fully exploit the power of this new architecture (click here for an overview of the SSE2 instruction set).
However, until the Pentium 4 gains a significant share of the market, most applications will not be tuned for it, so it is interesting to analyze of the P4 runs current MMX/SSE code.
The following table summarizes the latencies and throughputs of MMX/SSE instructions on the P4:
 

 
Instruction Latency Throughput Execution Unit
MOVD mm,r32 2 1 MMX_ALU
MOVD r32,mm 5 1 FP_MISC
MOVQ mm,mm 6 1 FP_MOV
PACKSSWB / PACKSSDW / PACKUSWB mm,mm 2 1 MMX_SHFT
PADDB / PADDW / PADDD 2 1 MMX_ALU
PADDSB / PADDSW / PADDUSB / PADDUSW mm,mm 2 1 MMX_ALU
PAND / PANDN/ POR / PXOR mm,mm 2 1 MMX_ALU
PCMPEQB / PCMPEQW / PCMPEQD mm,mm 2 1 MMX_ALU
PCMPGTB / PCMPGTW / PCMPGTD mm,mm 2 1 MMX_ALU
PMADDWD mm,mm 8 1 FP_MUL
PMULHW / PMULLW / PMULHUW mm,mm 8 1 FP_MUL
PSLLW / PSLLW / PSLLQ mm,mm/imm8 2 1 MMX_SHFT
PSRAW / PSRAD mm,mm/imm8 2 1 MMX_SHFT
PSUBB / PSUBW / PSUBD mm,mm 2 1 MMX_ALU
PSUBSB / PSUBSW / PSUBUSB / PSUBUSW mm,mm 2 1 MMX_ALU
PUNPCKHBW / PUNPCKHWD / PUNPCKHDQ mm,mm 2 1 MMX_SHFT
PUNPCKLBW / PUNPCKLWD / PUNPCKLDQ mm,mm 2 1 MMX_SHFT
EMMS 12 12
PAVGB / PAVGW mm,mm 2 1 MMX_ALU
PEXTRW r32,mm,imm8 7 2 MMX_SHFT,FP_MISC
PINSRW mm,r32,imm8 4 1 MMX_SHFT,MMX_MISC
PMAX / PMIN mm,mm 2 1 MMX_ALU
PMOVMSKB r32,mm 7 2 FP_MISC
PSADBW mm,mm 4 1 MMX_ALU
PSHUFW mm,mm,imm8 2 1 MMX_SHFT


 

 
Latency the number of clock cycles that are required to complete the execution of all of the µops that form an instruction.
Throughput the number of clock cycles required to wait before the issue ports are free to accept the same instruction again.
Execution Unit the names of the execution units in the execution core that are utilized to execute the µops for each instruction.

 

Assuming that we should have P4s running at 2 GHz and more pretty soon, I would not worry about the doubling in latency of most MMX instructions. But the multiply instructions' latency  (PMADDWD / PMULHW / PMULLW) jumped from 3 cycles in the P6 core to 8 cycles in the Pentium 4! This will affect all convolutional kernel codes that are widely used, for example, in audio applications. Another troublesome latency is MOVQ's 6 cycles versus only 1 cycle on the P6 core, given that is widely used to move memory blocks and copy results.

But troubles do not stop here. The image above outlines how instructions are addressed to specific ports in the P4 execution engine. All MMX instructions are queued in Port 1! This is major drawback compared to the P6 core, in which most MMX instructions could be issued to Port 0 or Port 1.
Intel's P4 Optimization Guide also reveals that: "Floating-point, MMX technology, Streaming SIMD Extensions and Streaming SIMD Extension 2 instructions with load operations require 6 more clocks in latency than the register-only version of the instructions", i.e. twice the clocks required by the P6 core.
Summing up, the P4 can issue only one MMX instruction per cycle, and the latency is at best twice that on the older Pentium III processor. In pathological conditions, this adds up to bring P4's SIMD performance down to about one third P-III's. Until the P4 ramps up into the 2+ GHz frequency range, its integer SIMD execution speed will simply lag behind the venerable P6 core.

Quote this article on your site

To create link towards this article on your website,
copy and paste the text below in your page.




Preview :

MMX Performance on Intel Pentium 4
Tuesday, 25 April 2000

Powered by QuoteThis © 2008
 
View Stefano Tommesani's profile on LinkedIn

Latest Articles

Fixing Git pull errors in SourceTree 10 April 2017, 01.44 Software
Fixing Git pull errors in SourceTree
If you encounter the following error when pulling a repository in SourceTree: VirtualAlloc pointer is null, Win32 error 487 it is due to to the Cygwin system failing to allocate a 5 MB large chunk of memory for its heap at
Castle on the hill of crappy audio quality 19 March 2017, 01.53 Audio
Castle on the hill of crappy audio quality
As the yearly dynamic range day is close (March 31st), let's have a look at one of the biggest audio massacres of the year, Ed Sheeran's "Castle on the hill". First time I heard the song, I thought my headphones just got
Necessary evil: testing private methods 29 January 2017, 21.41 Testing
Necessary evil: testing private methods
Some might say that testing private methods should be avoided because it means not testing the contract, that is the interface implemented by the class, but the internal implementation of the class itself. Still, not all
I am right and you are wrong 28 December 2016, 14.23 Web
I am right and you are wrong
Have you ever convinced anyone that disagreed with you about a deeply held belief? Better yet, have you changed your mind lately on an important topic after discussing with someone else that did not share your point of
How Commercial Insight changes R&D 06 November 2016, 01.21 Web
How Commercial Insight changes R&D
The CEB's Commercial Insight is based on three pillars: Be credible/relevant – Demonstrate an understanding of the customer’s world, substantiating claims with real-world evidence. Be frame-breaking – Disrupt the

Translate