Stefano Tommesani

  • Increase font size
  • Default font size
  • Decrease font size
Home SIMD SSE Cacheability Control

SSE Cacheability Control


Data referenced by a program can have temporal (data will be used again) or spatial (data will be in adjacent locations, such as the same cache line) locality, but some multimedia data types are referenced once and not reused in the immediate future (called non-temporal data). Thus, non-temporal data should not overwrite the applicationÂ’s cached code and data: the cacheability control instructions enable the programmer to control caching so that non-temporal accesses will minimize cache pollution.
In addition, the execution engine needs to be fed such that it does not become stalled waiting for data. SSE allows the programmer to prefetch data long before its final use to minimize memory latency.  Prior to SSE, read miss latency and execution and subsequent store miss latency comprised total execution in a serial fashion. SSE lets read miss latency overlap execution via the use of prefetching, and it allowes store miss latency to be reduced and overlap execution via streaming stores. 

Cacheability Control

The following three instructions provide programmatic control for minimizing cache pollution when writing data to memory from either MMX or SSE registers.
MASKMOVQ stores data from an MMX register to the location specified by the EDI register. The most significant bit in each byte of the second MMX mask register is used to selectively write the data of the first register on a per-byte basis. This instruction does not write-allocate (i.e., the processor will not fetch the corresponding cache line into the cache hierarchy, prior to performing the store), and so minimizes cache pollution.
MOVNTQ stores data from an MMX register to memory; this instruction is implicitly weakly-ordered, does not write-allocate, and minimizes cache pollution.
MOVNTPS stores data from a SIMD floating-point register to memory. The memory address must be aligned to a 16-byte boundary; if it is not aligned, a general protection exception will occur. The instruction is implicitly weakly ordered, does not write-allocate, and minimizes cache pollution.
PREFETCH loads either non-temporal data or temporal data in the specified cache level. As this instruction merely provides a hint to the hardware, it will not generate exceptions or faults.
SFENCE guarantees that every store instruction that precedes the store fence instruction in program order is globally visible before any store instruction that follows the fence. The SFENCE instruction provides an efficient way of ensuring ordering between routines that produce weakly-ordered results and routines that consume this data. The use of weakly-ordered memory types can be important under certain data sharing relation-ships, such as a producer-consumer relationship. The use of weakly-ordered memory can make the assembling of data more efficient, but care must be taken to ensure that the consumer obtains the data that the producer intended it to see.


Latest Articles

Easily upload videos of security cameras to YouTube
In this example, we will import video from a Yi security camera into YouTube. The same process, with eventual adjustment to the naming of directories in the SD card used by the camera to record videos, will also apply to other
A software to stand out 27 January 2018, 14.35 Web
A software to stand out
Standing out of the pack starts by being visible, and being noticed by the right group of professionals. No matter how good your profile is, it is lost in a sea of similar profiles, so you need to show up and start attracting
Web page scraping, the easy way 07 January 2018, 00.46 Web
Web page scraping, the easy way
There are many ways to extract data elements from web pages, almost all of them prettier and cooler than the method proposed here, but as we are in an hurry, let's get that data quickly, ok? Suppose we have to extract the
Scraping dynamic page content 06 January 2018, 23.57 Web
Scraping dynamic page content
One of the most common roadblocks when scraping the content of web sites is getting the full contents of the page, including JS-generated data elements (probably, the ones you are looking for). So, when using CEFSharp to scrape
Unit-testing file I/O 26 November 2017, 12.09 Testing
Unit-testing file I/O
Two good news: file I/O is unit-testable, and it is surprisingly easy to do. Let's see how it works! A software no-one asked for First, we need a piece of software that deals with files and that has to be unit-tested. The