Stefano Tommesani

  • Increase font size
  • Default font size
  • Decrease font size
Home Programming Slow blits with latest nVidia drivers

Slow blits with latest nVidia drivers

Hits

An application of mine uses DirectDraw to draw video frames on multiple screens, and so far the visualization pipeline used a group of off-screen YUV surfaces that, at the end of the process, are drawn into the primary surface. So this is the code that creates the primary surface:

DDSURFACEDESC ddsd;
ZeroMemory(&ddsd, sizeof(ddsd));
ddsd.dwSize = sizeof(DDSURFACEDESC);
ddsd.dwFlags = DDSD_CAPS;
ddsd.ddsCaps.dwCaps    = DDSCAPS_PRIMARYSURFACE;
if ((ddrval = lpDD->CreateSurface(&ddsd, &lpDDSPrimary, NULL)) != DD_OK)
{
return VISLIB_ERROR;
}

and this is the code that creates an off-screen YUV surface:

ddsd.dwFlags = DDSD_CAPS | DDSD_HEIGHT | DDSD_WIDTH | DDSD_PIXELFORMAT;
ddsd.ddsCaps.dwCaps    = DDSCAPS_OFFSCREENPLAIN;
ddsd.dwWidth = SurfaceWidth;
ddsd.dwHeight = SurfaceHeight;
ddsd.ddpfPixelFormat.dwFlags = DDPF_FOURCC;
ddsd.ddpfPixelFormat.dwFourCC = MAKEFOURCC('U','Y','V','Y');
ddsd.ddpfPixelFormat.dwYUVBitCount = 16;
if ((ddrval = lpDD->CreateSurface(&ddsd, &lpDDSOff, NULL)) != DD_OK)
{
return VISLIB_ERROR;
}

At the end of the pipeline, the processed image is in the lpDDSOff surface, and gets copied to the primary surface lpDDSPrimary with a simple blit:

ddrval = lpDDSPrimary->Blt(&DestRect, lpDDSOff, &SourceRect, DDBLT_WAIT, NULL);

The last blit does not even perform any rescaling, as the image in the off-screen surface already has the correct dimensions of the DestRect in the primary surface, so it's just a copying of data from the off-screen to the primary surface, and color-space conversion from YUV to the color space of the primary surface (these days, RGB32 is definitely a safe bet). This visualization pipeline has been running fine for years (it works fine up to nVidia 28x driver series), but then after nVidia 29x drivers series, including the latest 30x versions, performance dropped in a dramatic way, blits were so slow that they were dragging down the whole system. So I started benchmarking the various steps of the visualization pipeline, and it turns out that the latest step, that humble blit you see above, was about 100 times slower with newer drivers than with older ones! The performance was even poorer if the code was blitting on a secondary monitor, requiring literally many milliseconds to draw a single image to the screen. Even worse, the GPU usage is really high even when drawing only a few video streams at the same time, so the GPU is clearly a performance bottleneck, and it should not be, as there are no complex operations going on.

The solution is switching the off-screen surfaces from YUV to RGB32, so the declaration of the surfaces becomes:

ddsd.dwFlags = DDSD_CAPS | DDSD_HEIGHT | DDSD_WIDTH | DDSD_PIXELFORMAT;
ddsd.ddsCaps.dwCaps    = DDSCAPS_OFFSCREENPLAIN;
ddsd.dwWidth = SurfaceWidth;
ddsd.dwHeight = SurfaceHeight;
ddsd.ddpfPixelFormat.dwFlags = 0;
// Set up the pixel format for 32-bit RGB (8-8-8).
ddsd.ddpfPixelFormat.dwSize = sizeof(DDPIXELFORMAT);
ddsd.ddpfPixelFormat.dwFlags= DDPF_RGB;
ddsd.ddpfPixelFormat.dwRGBBitCount = 32;
ddsd.ddpfPixelFormat.dwRBitMask    = 0x00FF0000;
ddsd.ddpfPixelFormat.dwGBitMask    = 0x0000FF00;
ddsd.ddpfPixelFormat.dwBBitMask    = 0x000000FF;

The blit to the primary surface become lighting-quick, with minimal GPU usage. The downside of this solution is that

  • copying a RGB32 image to video memory instead of a YUV image takes exactly twice the bandwidth
  • the CPU must perform a YUV -> RGB32 conversion while copying data to video memory

After benchmarking, both downsides seem to be quite minor, due to the speed of sysmem to vidmem memcopy, and that highly optimized versions of YUV->RGB32 color-space conversions are available in the Intel IPP library. Summing up, the performance of RGB32 pipeline on 30x driver series is on a par with that of the YUV pipeline on 28x driver series, and definitely within the required performance boundaries.

Quote this article on your site

To create link towards this article on your website,
copy and paste the text below in your page.




Preview :


Powered by QuoteThis © 2008
Last Updated on Friday, 27 July 2012 11:12  
View Stefano Tommesani's profile on LinkedIn

Latest Articles

A software to stand out 27 January 2018, 14.35 Web
A software to stand out
Standing out of the pack starts by being visible, and being noticed by the right group of professionals. No matter how good your profile is, it is lost in a sea of similar profiles, so you need to show up and start attracting
Web page scraping, the easy way 07 January 2018, 00.46 Web
Web page scraping, the easy way
There are many ways to extract data elements from web pages, almost all of them prettier and cooler than the method proposed here, but as we are in an hurry, let's get that data quickly, ok? Suppose we have to extract the
Scraping dynamic page content 06 January 2018, 23.57 Web
Scraping dynamic page content
One of the most common roadblocks when scraping the content of web sites is getting the full contents of the page, including JS-generated data elements (probably, the ones you are looking for). So, when using CEFSharp to scrape
Unit-testing file I/O 26 November 2017, 12.09 Testing
Unit-testing file I/O
Two good news: file I/O is unit-testable, and it is surprisingly easy to do. Let's see how it works! A software no-one asked for First, we need a piece of software that deals with files and that has to be unit-tested. The
Fixing Git pull errors in SourceTree 10 April 2017, 01.44 Software
Fixing Git pull errors in SourceTree
If you encounter the following error when pulling a repository in SourceTree: VirtualAlloc pointer is null, Win32 error 487 it is due to to the Cygwin system failing to allocate a 5 MB large chunk of memory for its heap at

Translate