Stefano Tommesani

  • Increase font size
  • Default font size
  • Decrease font size
Home Programming MMX Performance on Intel Pentium 4

MMX Performance on Intel Pentium 4

Hits

The recent arrival of the Intel Pentium 4 processor has generated the usual flurry of benchmarks and comments, most of them emphasizing that current software does not fully exploit the power of this new architecture (click here for an overview of the SSE2 instruction set).
However, until the Pentium 4 gains a significant share of the market, most applications will not be tuned for it, so it is interesting to analyze of the P4 runs current MMX/SSE code.
The following table summarizes the latencies and throughputs of MMX/SSE instructions on the P4:
 

 
Instruction Latency Throughput Execution Unit
MOVD mm,r32 2 1 MMX_ALU
MOVD r32,mm 5 1 FP_MISC
MOVQ mm,mm 6 1 FP_MOV
PACKSSWB / PACKSSDW / PACKUSWB mm,mm 2 1 MMX_SHFT
PADDB / PADDW / PADDD 2 1 MMX_ALU
PADDSB / PADDSW / PADDUSB / PADDUSW mm,mm 2 1 MMX_ALU
PAND / PANDN/ POR / PXOR mm,mm 2 1 MMX_ALU
PCMPEQB / PCMPEQW / PCMPEQD mm,mm 2 1 MMX_ALU
PCMPGTB / PCMPGTW / PCMPGTD mm,mm 2 1 MMX_ALU
PMADDWD mm,mm 8 1 FP_MUL
PMULHW / PMULLW / PMULHUW mm,mm 8 1 FP_MUL
PSLLW / PSLLW / PSLLQ mm,mm/imm8 2 1 MMX_SHFT
PSRAW / PSRAD mm,mm/imm8 2 1 MMX_SHFT
PSUBB / PSUBW / PSUBD mm,mm 2 1 MMX_ALU
PSUBSB / PSUBSW / PSUBUSB / PSUBUSW mm,mm 2 1 MMX_ALU
PUNPCKHBW / PUNPCKHWD / PUNPCKHDQ mm,mm 2 1 MMX_SHFT
PUNPCKLBW / PUNPCKLWD / PUNPCKLDQ mm,mm 2 1 MMX_SHFT
EMMS 12 12
PAVGB / PAVGW mm,mm 2 1 MMX_ALU
PEXTRW r32,mm,imm8 7 2 MMX_SHFT,FP_MISC
PINSRW mm,r32,imm8 4 1 MMX_SHFT,MMX_MISC
PMAX / PMIN mm,mm 2 1 MMX_ALU
PMOVMSKB r32,mm 7 2 FP_MISC
PSADBW mm,mm 4 1 MMX_ALU
PSHUFW mm,mm,imm8 2 1 MMX_SHFT


 

 
Latency the number of clock cycles that are required to complete the execution of all of the µops that form an instruction.
Throughput the number of clock cycles required to wait before the issue ports are free to accept the same instruction again.
Execution Unit the names of the execution units in the execution core that are utilized to execute the µops for each instruction.

 

Assuming that we should have P4s running at 2 GHz and more pretty soon, I would not worry about the doubling in latency of most MMX instructions. But the multiply instructions' latency  (PMADDWD / PMULHW / PMULLW) jumped from 3 cycles in the P6 core to 8 cycles in the Pentium 4! This will affect all convolutional kernel codes that are widely used, for example, in audio applications. Another troublesome latency is MOVQ's 6 cycles versus only 1 cycle on the P6 core, given that is widely used to move memory blocks and copy results.

But troubles do not stop here. The image above outlines how instructions are addressed to specific ports in the P4 execution engine. All MMX instructions are queued in Port 1! This is major drawback compared to the P6 core, in which most MMX instructions could be issued to Port 0 or Port 1.
Intel's P4 Optimization Guide also reveals that: "Floating-point, MMX technology, Streaming SIMD Extensions and Streaming SIMD Extension 2 instructions with load operations require 6 more clocks in latency than the register-only version of the instructions", i.e. twice the clocks required by the P6 core.
Summing up, the P4 can issue only one MMX instruction per cycle, and the latency is at best twice that on the older Pentium III processor. In pathological conditions, this adds up to bring P4's SIMD performance down to about one third P-III's. Until the P4 ramps up into the 2+ GHz frequency range, its integer SIMD execution speed will simply lag behind the venerable P6 core.

Quote this article on your site

To create link towards this article on your website,
copy and paste the text below in your page.




Preview :

MMX Performance on Intel Pentium 4
Tuesday, 25 April 2000

Powered by QuoteThis © 2008
 
View Stefano Tommesani's profile on LinkedIn

Latest Articles

A software to stand out 27 January 2018, 14.35 Web
A software to stand out
Standing out of the pack starts by being visible, and being noticed by the right group of professionals. No matter how good your profile is, it is lost in a sea of similar profiles, so you need to show up and start attracting
Web page scraping, the easy way 07 January 2018, 00.46 Web
Web page scraping, the easy way
There are many ways to extract data elements from web pages, almost all of them prettier and cooler than the method proposed here, but as we are in an hurry, let's get that data quickly, ok? Suppose we have to extract the
Scraping dynamic page content 06 January 2018, 23.57 Web
Scraping dynamic page content
One of the most common roadblocks when scraping the content of web sites is getting the full contents of the page, including JS-generated data elements (probably, the ones you are looking for). So, when using CEFSharp to scrape
Unit-testing file I/O 26 November 2017, 12.09 Testing
Unit-testing file I/O
Two good news: file I/O is unit-testable, and it is surprisingly easy to do. Let's see how it works! A software no-one asked for First, we need a piece of software that deals with files and that has to be unit-tested. The
Fixing Git pull errors in SourceTree 10 April 2017, 01.44 Software
Fixing Git pull errors in SourceTree
If you encounter the following error when pulling a repository in SourceTree: VirtualAlloc pointer is null, Win32 error 487 it is due to to the Cygwin system failing to allocate a 5 MB large chunk of memory for its heap at

Translate