Stefano Tommesani

  • Increase font size
  • Default font size
  • Decrease font size
Home Programming Slow blits with latest nVidia drivers

Slow blits with latest nVidia drivers

An application of mine uses DirectDraw to draw video frames on multiple screens, and so far the visualization pipeline used a group of off-screen YUV surfaces that, at the end of the process, are drawn into the primary surface. So this is the code that creates the primary surface:

DDSURFACEDESC ddsd;
ZeroMemory(&ddsd, sizeof(ddsd));
ddsd.dwSize = sizeof(DDSURFACEDESC);
ddsd.dwFlags = DDSD_CAPS;
ddsd.ddsCaps.dwCaps    = DDSCAPS_PRIMARYSURFACE;
if ((ddrval = lpDD->CreateSurface(&ddsd, &lpDDSPrimary, NULL)) != DD_OK)
{
return VISLIB_ERROR;
}

and this is the code that creates an off-screen YUV surface:

ddsd.dwFlags = DDSD_CAPS | DDSD_HEIGHT | DDSD_WIDTH | DDSD_PIXELFORMAT;
ddsd.ddsCaps.dwCaps    = DDSCAPS_OFFSCREENPLAIN;
ddsd.dwWidth = SurfaceWidth;
ddsd.dwHeight = SurfaceHeight;
ddsd.ddpfPixelFormat.dwFlags = DDPF_FOURCC;
ddsd.ddpfPixelFormat.dwFourCC = MAKEFOURCC('U','Y','V','Y');
ddsd.ddpfPixelFormat.dwYUVBitCount = 16;
if ((ddrval = lpDD->CreateSurface(&ddsd, &lpDDSOff, NULL)) != DD_OK)
{
return VISLIB_ERROR;
}

At the end of the pipeline, the processed image is in the lpDDSOff surface, and gets copied to the primary surface lpDDSPrimary with a simple blit:

ddrval = lpDDSPrimary->Blt(&DestRect, lpDDSOff, &SourceRect, DDBLT_WAIT, NULL);

The last blit does not even perform any rescaling, as the image in the off-screen surface already has the correct dimensions of the DestRect in the primary surface, so it's just a copying of data from the off-screen to the primary surface, and color-space conversion from YUV to the color space of the primary surface (these days, RGB32 is definitely a safe bet). This visualization pipeline has been running fine for years (it works fine up to nVidia 28x driver series), but then after nVidia 29x drivers series, including the latest 30x versions, performance dropped in a dramatic way, blits were so slow that they were dragging down the whole system. So I started benchmarking the various steps of the visualization pipeline, and it turns out that the latest step, that humble blit you see above, was about 100 times slower with newer drivers than with older ones! The performance was even poorer if the code was blitting on a secondary monitor, requiring literally many milliseconds to draw a single image to the screen. Even worse, the GPU usage is really high even when drawing only a few video streams at the same time, so the GPU is clearly a performance bottleneck, and it should not be, as there are no complex operations going on.

The solution is switching the off-screen surfaces from YUV to RGB32, so the declaration of the surfaces becomes:

ddsd.dwFlags = DDSD_CAPS | DDSD_HEIGHT | DDSD_WIDTH | DDSD_PIXELFORMAT;
ddsd.ddsCaps.dwCaps    = DDSCAPS_OFFSCREENPLAIN;
ddsd.dwWidth = SurfaceWidth;
ddsd.dwHeight = SurfaceHeight;
ddsd.ddpfPixelFormat.dwFlags = 0;
// Set up the pixel format for 32-bit RGB (8-8-8).
ddsd.ddpfPixelFormat.dwSize = sizeof(DDPIXELFORMAT);
ddsd.ddpfPixelFormat.dwFlags= DDPF_RGB;
ddsd.ddpfPixelFormat.dwRGBBitCount = 32;
ddsd.ddpfPixelFormat.dwRBitMask    = 0x00FF0000;
ddsd.ddpfPixelFormat.dwGBitMask    = 0x0000FF00;
ddsd.ddpfPixelFormat.dwBBitMask    = 0x000000FF;

The blit to the primary surface become lighting-quick, with minimal GPU usage. The downside of this solution is that

  • copying a RGB32 image to video memory instead of a YUV image takes exactly twice the bandwidth
  • the CPU must perform a YUV -> RGB32 conversion while copying data to video memory

After benchmarking, both downsides seem to be quite minor, due to the speed of sysmem to vidmem memcopy, and that highly optimized versions of YUV->RGB32 color-space conversions are available in the Intel IPP library. Summing up, the performance of RGB32 pipeline on 30x driver series is on a par with that of the YUV pipeline on 28x driver series, and definitely within the required performance boundaries.

Quote this article on your site

To create link towards this article on your website,
copy and paste the text below in your page.




Preview :


Powered by QuoteThis © 2008
Last Updated on Friday, 27 July 2012 11:12  
View Stefano Tommesani's profile on LinkedIn

Latest Articles

Fixing Git pull errors in SourceTree 10 April 2017, 01.44 Software
Fixing Git pull errors in SourceTree
If you encounter the following error when pulling a repository in SourceTree: VirtualAlloc pointer is null, Win32 error 487 it is due to to the Cygwin system failing to allocate a 5 MB large chunk of memory for its heap at
Castle on the hill of crappy audio quality 19 March 2017, 01.53 Audio
Castle on the hill of crappy audio quality
As the yearly dynamic range day is close (March 31st), let's have a look at one of the biggest audio massacres of the year, Ed Sheeran's "Castle on the hill". First time I heard the song, I thought my headphones just got
Necessary evil: testing private methods 29 January 2017, 21.41 Testing
Necessary evil: testing private methods
Some might say that testing private methods should be avoided because it means not testing the contract, that is the interface implemented by the class, but the internal implementation of the class itself. Still, not all
I am right and you are wrong 28 December 2016, 14.23 Web
I am right and you are wrong
Have you ever convinced anyone that disagreed with you about a deeply held belief? Better yet, have you changed your mind lately on an important topic after discussing with someone else that did not share your point of
How Commercial Insight changes R&D 06 November 2016, 01.21 Web
How Commercial Insight changes R&D
The CEB's Commercial Insight is based on three pillars: Be credible/relevant – Demonstrate an understanding of the customer’s world, substantiating claims with real-world evidence. Be frame-breaking – Disrupt the

Translate