Stefano Tommesani

  • Increase font size
  • Default font size
  • Decrease font size
Home Programming Different types of parallel loops with SSE2, SSSE3 and Visual C++ 2012

Different types of parallel loops with SSE2, SSSE3 and Visual C++ 2012

This is not the first article on this site that discusses how to use the Intel Thread Building Blocks library to spread the computation of an image-processing kernel over multiple threads: the article named "Multi thread loops with Intel TBB" showed how to do it with Intel TBB 2.x. However, due to the enhancements of Intel TBB 4.x, and the availability of C++ 11 compliant compilers such as Microsoft Visual C++ 2012, it is now possible to write more compact and easy to understand code than before.

TBBScreenShot

The starting point is a scalar C function that converts a RGB or RGBA image into a grayscale, luma-only image:

// ProcessRGBSerial
// reference serial code that converts a given RGB image to a luma-only image
// both input and output image are unidimensional arrays of ImageWidth * ImageHeight pixels
// PixelOffset is the distance in bytes between two pixels in the input RGB image (RGB24 -> PixelOffset = 3, RGBA32 -> PixelOffset = 4)

void ProcessRGBSerial(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset)
{
    unsigned char *SourceImagePtr = (unsigned char *)SourceImage;
    unsigned char *YImagePtr = (unsigned char *)YImage;

    for (int i = (ImageWidth * ImageHeight); i > 0; i--)
    {
        int YValue = (SourceImagePtr[0] * Y_RED_SCALE  ) +
                     (SourceImagePtr[1] * Y_GREEN_SCALE) +
                     (SourceImagePtr[2] * Y_BLUE_SCALE );
        SourceImagePtr += PixelOffset;
        YValue += 1 << (SCALING_LOG - 1);
        YValue >>= SCALING_LOG;
        if (YValue > 255)
            YValue = 255;
        *YImagePtr = (unsigned char)YValue;
        YImagePtr++;
    }
}

Please refer to the full source code at the bottom of the page for the declaration of constants, or just download the whole project and open it in Microsoft Visual C++ 2012.

Before starting to rewrite this function to exploit multiple threads, we add the needed header files of Intel TBB:

#include <task_scheduler_init.h>
using namespace tbb;
#include <parallel_for.h>
#include <blocked_range.h>
#include <blocked_range2d.h>

Now we are ready to rewrite the ProcessRGBSerial function to use TBB's threads. In the first step, we will use the same method of TBB 2.x:

// ProcessRGBTBB1
// first version of multi-threaded code using Intel TBB
// it is necessary to define a struct (RGBToGrayConverter) with a () operator that hosts the kernel of the computation
// all the parameters necessary to the computation are copied inside the struct

struct RGBToGrayConverter {
    unsigned char *SourceImagePtr;
    unsigned char *YImagePtr;
    int PixelOffset;
    void operator( )( const blocked_range<int>& range ) const {
        unsigned char *LocalSourceImagePtr = SourceImagePtr;
        unsigned char *LocalYImagePtr = YImagePtr;
        LocalSourceImagePtr += range.begin() * PixelOffset;
        LocalYImagePtr += range.begin();
        for( int i=range.begin(); i!=range.end( ); ++i )
        {
            int YValue = (LocalSourceImagePtr[0] * Y_RED_SCALE  ) +
                         (LocalSourceImagePtr[1] * Y_GREEN_SCALE) +
                         (LocalSourceImagePtr[2] * Y_BLUE_SCALE );
            LocalSourceImagePtr += PixelOffset;
            YValue += 1 << (SCALING_LOG - 1);
            YValue >>= SCALING_LOG;
            if (YValue > 255)
                YValue = 255;
            *LocalYImagePtr = (unsigned char)YValue;
            LocalYImagePtr++;
        }
    }
};

void ProcessRGBTBB1(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset)
{
    RGBToGrayConverter RGBToGrayConverterPtr;
    RGBToGrayConverterPtr.PixelOffset = PixelOffset;
    RGBToGrayConverterPtr.SourceImagePtr = (unsigned char *)SourceImage;
    RGBToGrayConverterPtr.YImagePtr = (unsigned char *)YImage;

    parallel_for( blocked_range<int>( 0, (ImageWidth * ImageHeight)), RGBToGrayConverterPtr);
}

The computational kernel is contained in the () operator of the RGBToGrayConverter struct. All the I/O parameters of the computation kernel must be added to the struct, and they must be initialized by the calling ProcessTGBTBB1 function before starting the processing via parallel_for. Each time the operator () of the struct is called, the thread gets a sub-range of the whole image to work on, contained in the blocked_range parameter, so it computes the pixels between range.begin() and range.end(). Even if there is a significant amount of boiler-plate code around it, the core of the kernel is mostly unchanged, so most of the source serial code is retained.

Still, we can do better than this. By moving the image-processing kernel in a lambda defined inside the call to parallel_for, we can completely avoid the declaration of a new struct:

// ProcessRGBTBB2
// by using a lambda that copies the required information automatically [=], the kernel of the computation is contained
// inside the invocation to parallel_for

void ProcessRGBTBB2(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset)
{
    parallel_for( blocked_range<int>( 0, (ImageWidth * ImageHeight)),
        [=](const blocked_range<int>& r) {
            unsigned char *LocalSourceImagePtr = (unsigned char *)SourceImage;
            unsigned char *LocalYImagePtr = (unsigned char *)YImage;
            LocalSourceImagePtr += r.begin() * PixelOffset;
            LocalYImagePtr += r.begin();

            for(size_t i = r.begin(); i != r.end(); ++i)
            {
                int YValue = (LocalSourceImagePtr[0] * Y_RED_SCALE  ) +
                             (LocalSourceImagePtr[1] * Y_GREEN_SCALE) +
                             (LocalSourceImagePtr[2] * Y_BLUE_SCALE );
                LocalSourceImagePtr += PixelOffset;
                YValue += 1 << (SCALING_LOG - 1);
                YValue >>= SCALING_LOG;
                if (YValue > 255)
                    YValue = 255;
                *LocalYImagePtr = (unsigned char)YValue;
                LocalYImagePtr++;
            }
          }
        );
}

While the first parameter to the parallel_for invocation is still a blocked_range defined as before, the second parameter is now the declaration of a lambda, so that the image processing kernel is inside the calling function. This parallel code is remarkably similar to the original serial code, the major difference being that processing is applied to a slice of the input image instead of the whole image, as shown in the blocked_range parameter. Please note that [=] in the declaration of the lambda automatically copies the required I/O parameters used by the lambda, so we do not have to add these variables to a struct and fill them one by one, a boring and error-prone task.

Since the images were defined as one-dimensional arrays, using a 1D blocked_range to work on a slice of the image was a natural match. Still, it is possible to use a 2D blocked_range, unsurprinsingly named blocked_range2d, to work on 2D tiles instead of 1D slices:

// ProcessRGBTBB3
// even if working with images expressed as unidimensional array does not require it, a 2D range instead of a 1D one
// works on a tile of the given images at a time
// the location and dimensions of the tile are contained in the rows and cols of the blocked_range2d

void ProcessRGBTBB3(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset)
{
    parallel_for( blocked_range2d<int,int>( 0, ImageHeight, 0, ImageWidth),
        [=](const blocked_range2d<int,int>& r) {
            int StartX = r.cols().begin();
            int StopX = r.cols().end();
            int StartY = r.rows().begin();
            int StopY = r.rows().end();

            for (int y = StartY; y != StopY; y++)
            {
                unsigned char *LocalSourceImagePtr = (unsigned char *)SourceImage + (StartX + y * ImageWidth) * PixelOffset;
                unsigned char *LocalYImagePtr = (unsigned char *)YImage + (StartX + y * ImageWidth);
                for (int x = StartX; x != StopX; x++)
                {
                    int YValue = (LocalSourceImagePtr[0] * Y_RED_SCALE ) +
                             (LocalSourceImagePtr[1] * Y_GREEN_SCALE) +
                             (LocalSourceImagePtr[2] * Y_BLUE_SCALE );
                    LocalSourceImagePtr += PixelOffset;
                    YValue += 1 << (SCALING_LOG - 1);
                    YValue >>= SCALING_LOG;
                    if (YValue > 255)
                        YValue = 255;
                    *LocalYImagePtr = (unsigned char)YValue;
                    LocalYImagePtr++;
                }
            }
          }
        );
}

This time we have a 2D range, and we get the interval in both the horizontal and vertical direction by reading the rows() and cols() of the given blocked_range2d.

Still, we can do better than this, by rewriting the kernel using SIMD instructions (in this case, we will use SSE2 instructions). To keep the code simple and readable, we will assume that ImageWidth * ImageHeight is an integer multiple of 4 (for a generic dimension, we should add a serial loop at the end that computes the missing bytes in the output image). As the input buffer is not guaranteed to be aligned on 16-bytes boundaries, the loads from the input image are unaligned using the _mm_loadu_si128() intrinsics. I have added the corresponding serial C code lines inside the SIMD loop, so that you can match the behaviour of SSE2 instructions to the reference C code, as the SIMD code bears little if any resemblance to the C code. Please note that the SIMD code is not a direct replacement for the serial C code, as each iteration of this loop produces 4 output pixels, while the serial C code produces only 1 output pixel per loop.

// ProcessRGBSIMD
// SSE2 version of Serial code
// both input and output image are unidimensional arrays of ImageWidth * ImageHeight pixels
// assumes that PixelOffset is 4 (RGBA image) and that ImageWidth * ImageHeight * PixelOffset is a multiple of 16
// does not assume that the input image is aligned on 16 bytes

void ProcessRGBSIMD(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset)
{
    __m128i RGBScale = _mm_set_epi16(0, Y_BLUE_SCALE, Y_GREEN_SCALE, Y_RED_SCALE, 0, Y_BLUE_SCALE, Y_GREEN_SCALE, Y_RED_SCALE);
    __m128i ShiftScalingAdjust = _mm_set1_epi32(1 << (SCALING_LOG - 1));
    __m128i *SourceImagePtr = (__m128i *)SourceImage;
    int *YImagePtr = (int *)YImage;

    for (int i = (ImageWidth * ImageHeight); i > 0; i -= 4)  // 4 pixels per loop
    {
        __m128i RGBValue = _mm_loadu_si128(SourceImagePtr);
        SourceImagePtr++;
        __m128i LowRGBValue = _mm_unpacklo_epi8(RGBValue, _mm_setzero_si128());
        __m128i HighRGBValue = _mm_unpackhi_epi8(RGBValue, _mm_setzero_si128());
        // int YValue = (SourceImagePtr[0] * Y_RED_SCALE  ) +
        //             (SourceImagePtr[1] * Y_GREEN_SCALE) +
        //             (SourceImagePtr[2] * Y_BLUE_SCALE );
        __m128i LowYValue = _mm_madd_epi16(LowRGBValue, RGBScale);
        __m128i HighYValue = _mm_madd_epi16(HighRGBValue, RGBScale);
        LowYValue = _mm_add_epi32(LowYValue, _mm_slli_epi64(LowYValue, 32));
        HighYValue = _mm_add_epi32(HighYValue, _mm_slli_epi64(HighYValue, 32));
        // YValue += 1 << (SCALING_LOG - 1);
        LowYValue = _mm_add_epi32(LowYValue, ShiftScalingAdjust);
        HighYValue = _mm_add_epi32(HighYValue, ShiftScalingAdjust);
        // YValue >>= SCALING_LOG;
        LowYValue = _mm_srli_epi64(LowYValue, 32 + SCALING_LOG);
        HighYValue = _mm_srli_epi64(HighYValue, 32 + SCALING_LOG);
        __m128i YValue =  _mm_packs_epi32(LowYValue, HighYValue);
        YValue =  _mm_packs_epi32(YValue, _mm_setzero_si128());
        // if (YValue > 255)
        //    YValue = 255;
        YValue =  _mm_packus_epi16(YValue, YValue);
        *YImagePtr = _mm_cvtsi128_si32(YValue);
        YImagePtr++;
    }
}

The final step is merging the SIMD kernel with the TBB multi-thread approach so that we can achieve the highest level of performance:

// ProcessRGBTBBSIMD
// by using a lambda that copies the required information automatically [=], the kernel of the computation is contained
// inside the invocation to parallel_for

void ProcessRGBTBBSIMD(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset)
{
    __m128i RGBScale = _mm_set_epi16(0, Y_BLUE_SCALE, Y_GREEN_SCALE, Y_RED_SCALE, 0, Y_BLUE_SCALE, Y_GREEN_SCALE, Y_RED_SCALE);
    __m128i ShiftScalingAdjust = _mm_set1_epi32(1 << (SCALING_LOG - 1));
    __m128i *SourceImagePtr = (__m128i *)SourceImage;
    int *YImagePtr = (int *)YImage;

    parallel_for( blocked_range<int>( 0, (ImageWidth * ImageHeight)),
        [=](const blocked_range<int>& r) {
            unsigned char *LocalSourceImagePtr = (unsigned char *)SourceImage;
            unsigned char *LocalYImagePtr = (unsigned char *)YImage;
            LocalSourceImagePtr += r.begin() * PixelOffset;
            LocalYImagePtr += r.begin();
            __m128i *SourceImagePtr = (__m128i *)LocalSourceImagePtr;
            int *YImagePtr = (int *)LocalYImagePtr;

            for (size_t i = r.begin(); i != r.end(); i+=4)
            {
                __m128i RGBValue = _mm_loadu_si128(SourceImagePtr);
                SourceImagePtr++;
                __m128i LowRGBValue = _mm_unpacklo_epi8(RGBValue, _mm_setzero_si128());
                __m128i HighRGBValue = _mm_unpackhi_epi8(RGBValue, _mm_setzero_si128());
                // int YValue = (SourceImagePtr[0] * Y_RED_SCALE  ) +
                //             (SourceImagePtr[1] * Y_GREEN_SCALE) +
                //             (SourceImagePtr[2] * Y_BLUE_SCALE );
                __m128i LowYValue = _mm_madd_epi16(LowRGBValue, RGBScale);
                __m128i HighYValue = _mm_madd_epi16(HighRGBValue, RGBScale);
                LowYValue = _mm_add_epi32(LowYValue, _mm_slli_epi64(LowYValue, 32));
                HighYValue = _mm_add_epi32(HighYValue, _mm_slli_epi64(HighYValue, 32));
                // YValue += 1 << (SCALING_LOG - 1);
                LowYValue = _mm_add_epi32(LowYValue, ShiftScalingAdjust);
                HighYValue = _mm_add_epi32(HighYValue, ShiftScalingAdjust);
                // YValue >>= SCALING_LOG;
                LowYValue = _mm_srli_epi64(LowYValue, 32 + SCALING_LOG);
                HighYValue = _mm_srli_epi64(HighYValue, 32 + SCALING_LOG);
                __m128i YValue =  _mm_packs_epi32(LowYValue, HighYValue);
                YValue =  _mm_packs_epi32(YValue, _mm_setzero_si128());
                // if (YValue > 255)
                //    YValue = 255;
                YValue =  _mm_packus_epi16(YValue, YValue);
                *YImagePtr = _mm_cvtsi128_si32(YValue);
                YImagePtr++;
            }
          }
        );
}

Now we have arrived to an high-performance solution, but code readibility has been severely reduced, mostly due to the SIMD instructions (even if intrinsics are a bit more readable than plain assembly code).

It is time to check how fast these code fragments run: the following graph shows the results on an Intel i3 2570, a 2 cores / 4 HT threads CPU

TBBBenchmark

The Serial C code is obviously the slowest one, and will represent the baseline performance level. Moving to TBB multi-threaded code, both the code using lambda and struct have nearly equal performance, with a 65% speedup over serial code. Using tiling instead of a slice of the input image improves performance significantly, bringing the speed-up to 94%. The SSE2-based solution is faster even if it uses just a thread, running 119% than the baseline code. Merging multi-threads and SSE2 kernel code brings an impressive result, with a 327% speed-up!

You can download the full Microsoft Visual C++ 2012 project, named TBBDemo, in the Download area, it works in Visual C++ Express too! I really urge you to download and play with the code, as it takes a short time to become familiar with the new style based on lambdas, and it is a lot easier both to read and write.

Update (April 29th, 2013)

Usage a specific SSSE3 instruction named PHADDD (packed horizontal add of dwords) replaces a couple of shift + add instructions, and as these instructions are on the critical path, the performance gain is noticeable, as we will see later. Still, in production-quality code, you should check that SSSE3 is supported on the current CPU, as pre-Core2 processors do not support it.

// ProcessRGBSIMD2
// SSSE3 version of Serial code
// both input and output image are unidimensional arrays of ImageWidth * ImageHeight pixels
// assumes that PixelOffset is 4 (RGBA image) and that ImageWidth * ImageHeight * PixelOffset is a multiple of 16
// does not assume that the input image is aligned on 16 bytes

void ProcessRGBSIMD2(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset)
{
    __m128i RGBScale = _mm_set_epi16(0, Y_BLUE_SCALE, Y_GREEN_SCALE, Y_RED_SCALE, 0, Y_BLUE_SCALE, Y_GREEN_SCALE, Y_RED_SCALE);
    __m128i ShiftScalingAdjust = _mm_set1_epi32(1 << (SCALING_LOG - 1));
    __m128i *SourceImagePtr = (__m128i *)SourceImage;
    __m128i ZeroConst = _mm_setzero_si128();
    int *YImagePtr = (int *)YImage;

    for (int i = (ImageWidth * ImageHeight); i > 0; i -= 4)  // 4 pixels per loop
    {
        __m128i RGBValue = _mm_loadu_si128(SourceImagePtr);
        SourceImagePtr++;
        __m128i LowRGBValue = _mm_unpacklo_epi8(RGBValue, ZeroConst);
        __m128i HighRGBValue = _mm_unpackhi_epi8(RGBValue, ZeroConst);
        // int YValue = (SourceImagePtr[0] * Y_RED_SCALE  ) +
        //             (SourceImagePtr[1] * Y_GREEN_SCALE) +
        //             (SourceImagePtr[2] * Y_BLUE_SCALE );
        __m128i LowYValue = _mm_madd_epi16(LowRGBValue, RGBScale);
        __m128i HighYValue = _mm_madd_epi16(HighRGBValue, RGBScale);
        __m128i YValue = _mm_hadd_epi32(LowYValue, HighYValue);
        // YValue += 1 << (SCALING_LOG - 1);
        YValue = _mm_add_epi32(YValue, ShiftScalingAdjust);
        // YValue >>= SCALING_LOG;
        YValue = _mm_srli_epi32(YValue, SCALING_LOG);
        YValue =  _mm_packs_epi32(YValue, YValue);
        // if (YValue > 255)
        //    YValue = 255;
        YValue =  _mm_packus_epi16(YValue, YValue);
        // *YImagePtr = (unsigned char)YValue;
        *YImagePtr = _mm_cvtsi128_si32(YValue);
        YImagePtr++;
    }
}

So, how fast does it run? While the SSE2-optimized code was 2.2 times faster than the serial code, this version using PHADDD achieves 3x speed-up, and adding TBB multi-threading brings the overall speed-up to a respectable 5.4x. A futher 10% improvement can be obtained by unrolling the loop one time, and pairing the instructions of the separate iterations, as there is low pressure on XMM registers and most instructions are on the critical path, so there is a significant number of execution slots available for other instructions.

TBBBenchmark2

 

Source code

 

TBBDemoRoutines.h

// Intel TBB Demo
// by Stefano Tommesani (www.tommesani.com) 2013
// this code is release under the Code Project Open License (CPOL) http://www.codeproject.com/info/cpol10.aspx
// The main points subject to the terms of the License are:
// -   Source Code and Executable Files can be used in commercial applications;
// -   Source Code and Executable Files can be redistributed; and
// -   Source Code can be modified to create derivative works.
// -   No claim of suitability, guarantee, or any warranty whatsoever is provided. The software is provided "as-is".
// -   The Article(s) accompanying the Work may not be distributed or republished without the Author's consent
void ProcessRGBSerial(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset);
void ProcessRGBTBB1(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset);
void ProcessRGBTBB2(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset);
void ProcessRGBTBB3(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset);
void ProcessRGBSIMD(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset);
void ProcessRGBSIMD2(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset);
void ProcessRGBSIMD3(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset);
void ProcessRGBTBBSIMD(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset);

TBBDemoRoutines.cpp

// Intel TBB Demo
// by Stefano Tommesani (www.tommesani.com) 2013
// this code is release under the Code Project Open License (CPOL) http://www.codeproject.com/info/cpol10.aspx
// The main points subject to the terms of the License are:
// -   Source Code and Executable Files can be used in commercial applications;
// -   Source Code and Executable Files can be redistributed; and
// -   Source Code can be modified to create derivative works.
// -   No claim of suitability, guarantee, or any warranty whatsoever is provided. The software is provided "as-is".
// -   The Article(s) accompanying the Work may not be distributed or republished without the Author's consent

#include <Windows.h>

// Intel TBB library
#include <task_scheduler_init.h>
using namespace tbb;
#include <parallel_for.h>
#include <blocked_range.h>
#include <blocked_range2d.h>

// SIMD intrinsics
#include <emmintrin.h>

// constants for RGB to Y conversion
// Ey = 0.299*Er + 0.587*Eg + 0.114*Eb
const int SCALING_LOG = 15;
const int SCALING_FACTOR = (1 << SCALING_LOG);
const int Y_RED_SCALE = (int)(0.299 * SCALING_FACTOR);
const int Y_GREEN_SCALE = (int)(0.587 * SCALING_FACTOR);
const int Y_BLUE_SCALE = (int)(0.114 * SCALING_FACTOR);

// ProcessRGBSerial
// reference serial code that converts a given RGB image to a luma-only image
// both input and output image are unidimensional arrays of ImageWidth * ImageHeight pixels
// PixelOffset is the distance in bytes between two pixels in the input RGB image (RGB24 -> PixelOffset = 3, RGBA32 -> PixelOffset = 4)

void ProcessRGBSerial(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset)
{
    unsigned char *SourceImagePtr = (unsigned char *)SourceImage;
    unsigned char *YImagePtr = (unsigned char *)YImage;

    for (int i = (ImageWidth * ImageHeight); i > 0; i--)
    {
        int YValue = (SourceImagePtr[0] * Y_RED_SCALE  ) +
                     (SourceImagePtr[1] * Y_GREEN_SCALE) +
                     (SourceImagePtr[2] * Y_BLUE_SCALE );
        SourceImagePtr += PixelOffset;
        YValue += 1 << (SCALING_LOG - 1);
        YValue >>= SCALING_LOG;
        if (YValue > 255)
            YValue = 255;
        *YImagePtr = (unsigned char)YValue;
        YImagePtr++;
    }
}

// ProcessRGBTBB1
// first version of multi-threaded code using Intel TBB
// it is necessary to define a struct (RGBToGrayConverter) with a () operator that hosts the kernel of the computation
// all the parameters necessary to the computation are copied inside the struct

struct RGBToGrayConverter {
    unsigned char *SourceImagePtr;
    unsigned char *YImagePtr;
    int PixelOffset;
    void operator( )( const blocked_range<int>& range ) const {
        unsigned char *LocalSourceImagePtr = SourceImagePtr;
        unsigned char *LocalYImagePtr = YImagePtr;
        LocalSourceImagePtr += range.begin() * PixelOffset;
        LocalYImagePtr += range.begin();
        for( int i=range.begin(); i!=range.end( ); ++i )
        {
            int YValue = (LocalSourceImagePtr[0] * Y_RED_SCALE  ) +
                         (LocalSourceImagePtr[1] * Y_GREEN_SCALE) +
                         (LocalSourceImagePtr[2] * Y_BLUE_SCALE );
            LocalSourceImagePtr += PixelOffset;
            YValue += 1 << (SCALING_LOG - 1);
            YValue >>= SCALING_LOG;
            if (YValue > 255)
                YValue = 255;
            *LocalYImagePtr = (unsigned char)YValue;
            LocalYImagePtr++;
        }
    }
};

void ProcessRGBTBB1(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset)
{
    RGBToGrayConverter RGBToGrayConverterPtr;
    RGBToGrayConverterPtr.PixelOffset = PixelOffset;
    RGBToGrayConverterPtr.SourceImagePtr = (unsigned char *)SourceImage;
    RGBToGrayConverterPtr.YImagePtr = (unsigned char *)YImage;
    
    parallel_for( blocked_range<int>( 0, (ImageWidth * ImageHeight)), RGBToGrayConverterPtr);
}

// ProcessRGBTBB2
// by using a lambda that copies the required information automatically [=], the kernel of the computation is contained
// inside the invocation to parallel_for

void ProcessRGBTBB2(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset)
{
    parallel_for( blocked_range<int>( 0, (ImageWidth * ImageHeight)),
        [=](const blocked_range<int>& r) {
            unsigned char *LocalSourceImagePtr = (unsigned char *)SourceImage;
            unsigned char *LocalYImagePtr = (unsigned char *)YImage;
            LocalSourceImagePtr += r.begin() * PixelOffset;
            LocalYImagePtr += r.begin();

            for(size_t i = r.begin(); i != r.end(); ++i)
            {
                int YValue = (LocalSourceImagePtr[0] * Y_RED_SCALE  ) +
                             (LocalSourceImagePtr[1] * Y_GREEN_SCALE) +
                             (LocalSourceImagePtr[2] * Y_BLUE_SCALE );
                LocalSourceImagePtr += PixelOffset;
                YValue += 1 << (SCALING_LOG - 1);
                YValue >>= SCALING_LOG;
                if (YValue > 255)
                    YValue = 255;
                *LocalYImagePtr = (unsigned char)YValue;
                LocalYImagePtr++;
            }
          }
        );
}

// ProcessRGBTBB3
// even if working with images expressed as unidimensional array does not require it, a 2D range instead of a 1D one
// works on a tile of the given images at a time
// the location and dimensions of the tile are contained in the rows and cols of the blocked_range2d

void ProcessRGBTBB3(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset)
{
    parallel_for( blocked_range2d<int,int>( 0, ImageHeight, 0, ImageWidth),
        [=](const blocked_range2d<int,int>& r) {
            int StartX = r.cols().begin();
            int StopX = r.cols().end();
            int StartY = r.rows().begin();
            int StopY = r.rows().end();

            for (int y = StartY; y != StopY; y++)
            {
                unsigned char *LocalSourceImagePtr = (unsigned char *)SourceImage + (StartX + y * ImageWidth) * PixelOffset;
                unsigned char *LocalYImagePtr = (unsigned char *)YImage + (StartX + y * ImageWidth);
                for (int x = StartX; x != StopX; x++)
                {
                    int YValue = (LocalSourceImagePtr[0] * Y_RED_SCALE ) +
                             (LocalSourceImagePtr[1] * Y_GREEN_SCALE) +
                             (LocalSourceImagePtr[2] * Y_BLUE_SCALE );
                    LocalSourceImagePtr += PixelOffset;
                    YValue += 1 << (SCALING_LOG - 1);
                    YValue >>= SCALING_LOG;
                    if (YValue > 255)
                        YValue = 255;
                    *LocalYImagePtr = (unsigned char)YValue;
                    LocalYImagePtr++;
                }
            }
          }
        );
}

// ProcessRGBSIMD
// SSE2 version of Serial code
// both input and output image are unidimensional arrays of ImageWidth * ImageHeight pixels
// assumes that PixelOffset is 4 (RGBA image) and that ImageWidth * ImageHeight * PixelOffset is a multiple of 16
// does not assume that the input image is aligned on 16 bytes

void ProcessRGBSIMD(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset)
{
    __m128i RGBScale = _mm_set_epi16(0, Y_BLUE_SCALE, Y_GREEN_SCALE, Y_RED_SCALE, 0, Y_BLUE_SCALE, Y_GREEN_SCALE, Y_RED_SCALE);
    __m128i ShiftScalingAdjust = _mm_set1_epi32(1 << (SCALING_LOG - 1));
    __m128i *SourceImagePtr = (__m128i *)SourceImage;
    int *YImagePtr = (int *)YImage;

    for (int i = (ImageWidth * ImageHeight); i > 0; i -= 4)  // 4 pixels per loop
    {
        __m128i RGBValue = _mm_loadu_si128(SourceImagePtr);
        SourceImagePtr++;
        __m128i LowRGBValue = _mm_unpacklo_epi8(RGBValue, _mm_setzero_si128());
        __m128i HighRGBValue = _mm_unpackhi_epi8(RGBValue, _mm_setzero_si128());
        // int YValue = (SourceImagePtr[0] * Y_RED_SCALE  ) +
        //             (SourceImagePtr[1] * Y_GREEN_SCALE) +
        //             (SourceImagePtr[2] * Y_BLUE_SCALE );
        __m128i LowYValue = _mm_madd_epi16(LowRGBValue, RGBScale);
        __m128i HighYValue = _mm_madd_epi16(HighRGBValue, RGBScale);
        LowYValue = _mm_add_epi32(LowYValue, _mm_slli_epi64(LowYValue, 32));
        HighYValue = _mm_add_epi32(HighYValue, _mm_slli_epi64(HighYValue, 32));
        // YValue += 1 << (SCALING_LOG - 1);
        LowYValue = _mm_add_epi32(LowYValue, ShiftScalingAdjust);
        HighYValue = _mm_add_epi32(HighYValue, ShiftScalingAdjust);
        // YValue >>= SCALING_LOG;
        LowYValue = _mm_srli_epi64(LowYValue, 32 + SCALING_LOG);
        HighYValue = _mm_srli_epi64(HighYValue, 32 + SCALING_LOG);
        __m128i YValue =  _mm_packs_epi32(LowYValue, HighYValue);
        YValue =  _mm_packs_epi32(YValue, _mm_setzero_si128());
        // if (YValue > 255)
        //    YValue = 255;
        YValue =  _mm_packus_epi16(YValue, YValue);
        // *YImagePtr = (unsigned char)YValue;
        *YImagePtr = _mm_cvtsi128_si32(YValue);
        YImagePtr++;
    }
}

// ProcessRGBSIMD2
// SSSE3 version of Serial code
// both input and output image are unidimensional arrays of ImageWidth * ImageHeight pixels
// assumes that PixelOffset is 4 (RGBA image) and that ImageWidth * ImageHeight * PixelOffset is a multiple of 16
// does not assume that the input image is aligned on 16 bytes

void ProcessRGBSIMD2(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset)
{
    __m128i RGBScale = _mm_set_epi16(0, Y_BLUE_SCALE, Y_GREEN_SCALE, Y_RED_SCALE, 0, Y_BLUE_SCALE, Y_GREEN_SCALE, Y_RED_SCALE);
    __m128i ShiftScalingAdjust = _mm_set1_epi32(1 << (SCALING_LOG - 1));
    __m128i *SourceImagePtr = (__m128i *)SourceImage;
    __m128i ZeroConst = _mm_setzero_si128();
    int *YImagePtr = (int *)YImage;

    for (int i = (ImageWidth * ImageHeight); i > 0; i -= 4)  // 4 pixels per loop
    {
        __m128i RGBValue = _mm_loadu_si128(SourceImagePtr);
        SourceImagePtr++;
        __m128i LowRGBValue = _mm_unpacklo_epi8(RGBValue, ZeroConst);
        __m128i HighRGBValue = _mm_unpackhi_epi8(RGBValue, ZeroConst);
        // int YValue = (SourceImagePtr[0] * Y_RED_SCALE  ) +
        //             (SourceImagePtr[1] * Y_GREEN_SCALE) +
        //             (SourceImagePtr[2] * Y_BLUE_SCALE );
        __m128i LowYValue = _mm_madd_epi16(LowRGBValue, RGBScale);
        __m128i HighYValue = _mm_madd_epi16(HighRGBValue, RGBScale);
        __m128i YValue = _mm_hadd_epi32(LowYValue, HighYValue);
        // YValue += 1 << (SCALING_LOG - 1);
        YValue = _mm_add_epi32(YValue, ShiftScalingAdjust);
        // YValue >>= SCALING_LOG;
        YValue = _mm_srli_epi32(YValue, SCALING_LOG);
        YValue =  _mm_packs_epi32(YValue, YValue);
        // if (YValue > 255)
        //    YValue = 255;
        YValue =  _mm_packus_epi16(YValue, YValue);
        // *YImagePtr = (unsigned char)YValue;
        *YImagePtr = _mm_cvtsi128_si32(YValue);
        YImagePtr++;
    }
}

// ProcessRGBSIMD3
// reduced precision SSSE3 version of Serial code that does NOT produce results that match serial code's ones.
// both input and output image are unidimensional arrays of ImageWidth * ImageHeight pixels
// assumes that PixelOffset is 4 (RGBA image) and that ImageWidth * ImageHeight * PixelOffset is a multiple of 16
// does not assume that the input image is aligned on 16 bytes

void ProcessRGBSIMD3(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset)
{
    const int SCALING_LOG = 7;
    const int SCALING_FACTOR = (1 << SCALING_LOG);
    const int Y_RED_SCALE = (int)(0.299 * SCALING_FACTOR);
    const int Y_GREEN_SCALE = (int)(0.587 * SCALING_FACTOR);
    const int Y_BLUE_SCALE = (int)(0.114 * SCALING_FACTOR);

    __m128i RGBScale = _mm_set_epi8(0, Y_BLUE_SCALE, Y_GREEN_SCALE, Y_RED_SCALE, 0, Y_BLUE_SCALE, Y_GREEN_SCALE, Y_RED_SCALE, 0, Y_BLUE_SCALE, Y_GREEN_SCALE, Y_RED_SCALE, 0, Y_BLUE_SCALE, Y_GREEN_SCALE, Y_RED_SCALE);
    __m128i ShiftScalingAdjust = _mm_set1_epi16(1 << (SCALING_LOG - 1));
    __m128i *SourceImagePtr = (__m128i *)SourceImage;
    __m128i ZeroConst = _mm_setzero_si128();
    int *YImagePtr = (int *)YImage;

    for (int i = (ImageWidth * ImageHeight); i > 0; i -= 4)  // 4 pixels per loop
    {
        __m128i RGBValue = _mm_loadu_si128(SourceImagePtr);
        SourceImagePtr++;
        __m128i MultYValue = _mm_maddubs_epi16(RGBValue, RGBScale);
        // int YValue = (SourceImagePtr[0] * Y_RED_SCALE  ) +
        //             (SourceImagePtr[1] * Y_GREEN_SCALE) +
        //             (SourceImagePtr[2] * Y_BLUE_SCALE );
        __m128i YValue = _mm_hadd_epi16(MultYValue, MultYValue);
        YValue = _mm_add_epi16(YValue, ShiftScalingAdjust);
        YValue = _mm_srli_epi16(YValue, SCALING_LOG);
        // if (YValue > 255)
        //    YValue = 255;
        YValue =  _mm_packus_epi16(YValue, YValue);
        *YImagePtr = _mm_cvtsi128_si32(YValue);
        YImagePtr++;
    }
}

// ProcessRGBTBBSIMD
// by using a lambda that copies the required information automatically [=], the kernel of the computation is contained
// inside the invocation to parallel_for

void ProcessRGBTBBSIMD(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset)
{
    __m128i RGBScale = _mm_set_epi16(0, Y_BLUE_SCALE, Y_GREEN_SCALE, Y_RED_SCALE, 0, Y_BLUE_SCALE, Y_GREEN_SCALE, Y_RED_SCALE);
    __m128i ShiftScalingAdjust = _mm_set1_epi32(1 << (SCALING_LOG - 1));
    __m128i *SourceImagePtr = (__m128i *)SourceImage;
    __m128i ZeroConst = _mm_setzero_si128();
    int *YImagePtr = (int *)YImage;
    
    parallel_for( blocked_range<int>( 0, (ImageWidth * ImageHeight)),
        [=](const blocked_range<int>& r) {
            unsigned char *LocalSourceImagePtr = (unsigned char *)SourceImage;
            unsigned char *LocalYImagePtr = (unsigned char *)YImage;
            LocalSourceImagePtr += r.begin() * PixelOffset;
            LocalYImagePtr += r.begin();
            __m128i *SourceImagePtr = (__m128i *)LocalSourceImagePtr;
            int *YImagePtr = (int *)LocalYImagePtr;

            for (int i = r.begin(); i < r.end(); i+=4)
            {
                __m128i RGBValue = _mm_loadu_si128(SourceImagePtr);
                SourceImagePtr++;
                __m128i LowRGBValue = _mm_unpacklo_epi8(RGBValue, ZeroConst);
                __m128i HighRGBValue = _mm_unpackhi_epi8(RGBValue, ZeroConst);
                // int YValue = (SourceImagePtr[0] * Y_RED_SCALE  ) +
                //             (SourceImagePtr[1] * Y_GREEN_SCALE) +
                //             (SourceImagePtr[2] * Y_BLUE_SCALE );
                __m128i LowYValue = _mm_madd_epi16(LowRGBValue, RGBScale);
                __m128i HighYValue = _mm_madd_epi16(HighRGBValue, RGBScale);
                __m128i YValue = _mm_hadd_epi32(LowYValue, HighYValue);
                YValue = _mm_add_epi32(YValue, ShiftScalingAdjust);
                YValue = _mm_srli_epi32(YValue, SCALING_LOG);
                YValue =  _mm_packs_epi32(YValue, ZeroConst);
                // if (YValue > 255)
                //    YValue = 255;
                YValue =  _mm_packus_epi16(YValue, YValue);
                *YImagePtr = _mm_cvtsi128_si32(YValue);
                YImagePtr++;
            }
          }
        );
}

TBBDemoAMP.h

Yes, that's right, this is a GPGPU implementation using C++ AMP! I will talk about GPGPU programming with C++ AMP in a coming article

// Intel TBB Demo
// by Stefano Tommesani (www.tommesani.com) 2013
// this code is release under the Code Project Open License (CPOL) http://www.codeproject.com/info/cpol10.aspx
// The main points subject to the terms of the License are:
// -   Source Code and Executable Files can be used in commercial applications;
// -   Source Code and Executable Files can be redistributed; and
// -   Source Code can be modified to create derivative works.
// -   No claim of suitability, guarantee, or any warranty whatsoever is provided. The software is provided "as-is".
// -   The Article(s) accompanying the Work may not be distributed or republished without the Author's consent

void ProcessRGBAMP(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset);

TBBDemoAMP.cpp

// Intel TBB Demo
// by Stefano Tommesani (www.tommesani.com) 2013
// this code is release under the Code Project Open License (CPOL) http://www.codeproject.com/info/cpol10.aspx
// The main points subject to the terms of the License are:
// -   Source Code and Executable Files can be used in commercial applications;
// -   Source Code and Executable Files can be redistributed; and
// -   Source Code can be modified to create derivative works.
// -   No claim of suitability, guarantee, or any warranty whatsoever is provided. The software is provided "as-is".
// -   The Article(s) accompanying the Work may not be distributed or republished without the Author's consent

#include <iostream>
using namespace std;
#include <amp.h>
using namespace concurrency;

#include "TBBBenchmark.h"
#include "TBBDemoRoutines.h"

// constants for RGB to Y conversion
// Ey = 0.299*Er + 0.587*Eg + 0.114*Eb
const int SCALING_LOG = 15;
const int SCALING_FACTOR = (1 << SCALING_LOG);
const int Y_RED_SCALE = (int)(0.299 * SCALING_FACTOR);
const int Y_GREEN_SCALE = (int)(0.587 * SCALING_FACTOR);
const int Y_BLUE_SCALE = (int)(0.114 * SCALING_FACTOR);

void inspect_accelerators()
{
    auto accelerators = accelerator::get_all();
    for_each(begin(accelerators), end(accelerators),[=](accelerator acc){ 
        wcout << "New accelerator: " << acc.description << endl;
        wcout << "is_debug = " << acc.is_debug << endl;
        wcout << "is_emulated = " << acc.is_emulated <<endl;
        wcout << "dedicated_memory = " << acc.dedicated_memory << endl;
        wcout << "device_path = " << acc.device_path << endl;
        wcout << "has_display = " << acc.has_display << endl;                
        wcout << "version = " << (acc.version >> 16) << '.' << (acc.version & 0xFFFF) << endl;
    });
}

// ProcessRGBAMP
// C++ AMP GPGPU version of Serial code
// both input and output image are unidimensional arrays of ImageWidth * ImageHeight pixels
// assumes that PixelOffset is 4 (RGBA image) and that ImageWidth * ImageHeight * PixelOffset is a multiple of 16

void ProcessRGBAMP(void *SourceImage, void *YImage, int ImageWidth, int ImageHeight, int PixelOffset)
{
    array_view<const unsigned int, 1> SourceImageView(ImageWidth * ImageHeight, (unsigned int *)SourceImage);
    array_view<unsigned int, 1> YImageView((ImageWidth * ImageHeight) >> 2, (unsigned int *)YImage);
    YImageView.discard_data();

    parallel_for_each(YImageView.extent, [=] (index<1> idx) restrict(amp) {
        unsigned int YValue = 0;
        for (int index = 0; index < 4; index++)
        {
            unsigned int RGBPixel = SourceImageView[idx * 4 + index];
            unsigned int PartialYValue = ((RGBPixel & 0xFF) * Y_RED_SCALE) + (((RGBPixel >> 8) & 0xFF) * Y_GREEN_SCALE) + (((RGBPixel >> 16) & 0xFF) * Y_BLUE_SCALE);
            PartialYValue += 1 << (SCALING_LOG - 1);
            PartialYValue >>= SCALING_LOG;
            //if (PartialYValue > 255)
            //    PartialYValue = 255;
            PartialYValue = concurrency::direct3d::clamp(PartialYValue, 0, 255);
            YValue |= PartialYValue << (index * 8);
        }
        YImageView[idx] = YValue;
    });

    YImageView.synchronize();
}

TBBDemo.cpp

// Intel TBB Demo
// by Stefano Tommesani (www.tommesani.com) 2013
// this code is release under the Code Project Open License (CPOL) http://www.codeproject.com/info/cpol10.aspx
// The main points subject to the terms of the License are:
// -   Source Code and Executable Files can be used in commercial applications;
// -   Source Code and Executable Files can be redistributed; and
// -   Source Code can be modified to create derivative works.
// -   No claim of suitability, guarantee, or any warranty whatsoever is provided. The software is provided "as-is".
// -   The Article(s) accompanying the Work may not be distributed or republished without the Author's consent

// TBBDemo.cpp : definisce il punto di ingresso dell'applicazione console.
//

#include "stdafx.h"

#include <Windows.h>
#include <iostream>
#include <string>
#include <vector>
using namespace std;
#include "TBBBenchmark.h"
#include "TBBDemoRoutines.h"
#include "TBBDemoAMP.h"

typedef void (*TBBDemoFunction)(void *, void *, int, int, int);
typedef std::pair<TBBDemoFunction, std::string> TBBDemoImplementation;

int _tmain(int argc, _TCHAR* argv[])
{
    const int DEFAULT_IMAGE_WIDTH = 8 * 1024;
    const int DEFAULT_IMAGE_HEIGHT = 8 * 1024;
    const int DEFAULT_IMAGE_SIZE = (DEFAULT_IMAGE_WIDTH * DEFAULT_IMAGE_HEIGHT);
    const int RGBA_PIXEL_SIZE = 4;

    cout << "Intel TBB Demo by Stefano Tommesani (www.tommesani.com)" << endl;

    // create input image
    unsigned char *RGBAImage = new unsigned char[DEFAULT_IMAGE_SIZE * RGBA_PIXEL_SIZE];
    // fill RGBA image with random data
    srand(0x5555);
    for (int j = 0; j < (DEFAULT_IMAGE_SIZE * RGBA_PIXEL_SIZE); j++)
        RGBAImage[j] = rand() % 256;
    // create output image
    unsigned char *GrayImage = new unsigned char[DEFAULT_IMAGE_SIZE];

    // checks for correctness have been moved to unit testing project

    // check performance
    std::vector<TBBDemoImplementation> Implementations;
    Implementations.push_back(TBBDemoImplementation(&ProcessRGBSerial, "Serial"));
    Implementations.push_back(TBBDemoImplementation(&ProcessRGBTBB1, "TBB1"));
    Implementations.push_back(TBBDemoImplementation(&ProcessRGBTBB2, "TBB2"));
    Implementations.push_back(TBBDemoImplementation(&ProcessRGBTBB3, "TBB3"));
    Implementations.push_back(TBBDemoImplementation(&ProcessRGBSIMD, "SIMD1"));
    Implementations.push_back(TBBDemoImplementation(&ProcessRGBSIMD2, "SIMD2"));
    Implementations.push_back(TBBDemoImplementation(&ProcessRGBSIMD3, "SIMD3"));
    Implementations.push_back(TBBDemoImplementation(&ProcessRGBTBBSIMD, "TBB SIMD"));
    Implementations.push_back(TBBDemoImplementation(&ProcessRGBAMP, "AMP GPGPU"));

    const int PERFORMANCE_LOOPS = 5;  //< multiple loops to minimize benchmarking errors
    for (int i = 1; i <= PERFORMANCE_LOOPS; i++)
    {
        cout << "---- Iteration " << i << " ---- " << endl;
        for (auto ImplementationsPtr = Implementations.begin(); ImplementationsPtr != Implementations.end(); ImplementationsPtr++)
        {
            StartBenchmark();
            ImplementationsPtr->first(RGBAImage, GrayImage, DEFAULT_IMAGE_WIDTH, DEFAULT_IMAGE_HEIGHT, RGBA_PIXEL_SIZE);
            StopBenchmark(ImplementationsPtr->second);
        }
    }

    Implementations.clear();
    // free images
    delete[] RGBAImage;
    delete[] GrayImage;
    // wait for user to enter a key before closing
    char WaitForUser;
    cin >> WaitForUser;
    return 0;
}

 

TBBBenchmark.h

// Intel TBB Demo
// by Stefano Tommesani (www.tommesani.com) 2013
// this code is release under the Code Project Open License (CPOL) http://www.codeproject.com/info/cpol10.aspx
// The main points subject to the terms of the License are:
// -   Source Code and Executable Files can be used in commercial applications;
// -   Source Code and Executable Files can be redistributed; and
// -   Source Code can be modified to create derivative works.
// -   No claim of suitability, guarantee, or any warranty whatsoever is provided. The software is provided "as-is".
// -   The Article(s) accompanying the Work may not be distributed or republished without the Author's consent

#include <string>
#include <iostream>

void StartBenchmark();
void StopBenchmark(std::string Description);

 

TBBBenchmark.cpp

// Intel TBB Demo
// by Stefano Tommesani (www.tommesani.com) 2013
// this code is release under the Code Project Open License (CPOL) http://www.codeproject.com/info/cpol10.aspx
// The main points subject to the terms of the License are:
// -   Source Code and Executable Files can be used in commercial applications;
// -   Source Code and Executable Files can be redistributed; and
// -   Source Code can be modified to create derivative works.
// -   No claim of suitability, guarantee, or any warranty whatsoever is provided. The software is provided "as-is".
// -   The Article(s) accompanying the Work may not be distributed or republished without the Author's consent

#include "TBBBenchmark.h"
#include <Windows.h>

// benchmark data
LARGE_INTEGER StartTime, StopTime;

void StartBenchmark()
{
    QueryPerformanceCounter(&StartTime);
}

void StopBenchmark(std::string Description)
{
    QueryPerformanceCounter(&StopTime);
    std::cout << " Elapsed time of " << Description;
    std::cout.width(max(0, 20 - Description.length()));
    std::cout << (int)(StopTime.QuadPart - StartTime.QuadPart) << std::endl;
}
Last Updated on Wednesday, 01 May 2013 14:20  
View Stefano Tommesani's profile on LinkedIn

Latest Articles

Unit-testing file I/O 26 November 2017, 12.09 Testing
Unit-testing file I/O
Two good news: file I/O is unit-testable, and it is surprisingly easy to do. Let's see how it works! A software no-one asked for First, we need a piece of software that deals with files and that has to be unit-tested. The
Fixing Git pull errors in SourceTree 10 April 2017, 01.44 Software
Fixing Git pull errors in SourceTree
If you encounter the following error when pulling a repository in SourceTree: VirtualAlloc pointer is null, Win32 error 487 it is due to to the Cygwin system failing to allocate a 5 MB large chunk of memory for its heap at
Castle on the hill of crappy audio quality 19 March 2017, 01.53 Audio
Castle on the hill of crappy audio quality
As the yearly dynamic range day is close (March 31st), let's have a look at one of the biggest audio massacres of the year, Ed Sheeran's "Castle on the hill". First time I heard the song, I thought my headphones just got
Necessary evil: testing private methods 29 January 2017, 21.41 Testing
Necessary evil: testing private methods
Some might say that testing private methods should be avoided because it means not testing the contract, that is the interface implemented by the class, but the internal implementation of the class itself. Still, not all
I am right and you are wrong 28 December 2016, 14.23 Web
I am right and you are wrong
Have you ever convinced anyone that disagreed with you about a deeply held belief? Better yet, have you changed your mind lately on an important topic after discussing with someone else that did not share your point of

Translate