Background subtraction: ViBe

The ViBe algorithm

In the paper “ViBe: A universal background subtraction algorithm for video sequences“, Olivier Barnich and Marc Van Droogenbroeck introduce several innovative mechanisms for motion detection:

ViBeSearchingPixelsPixel model and classification process

Let’s denote by v(x) the value in a given Euclidean color space taken by the pixel located at x in the image, and by vi a background sample value with an index i. Each background pixel x is modeled by a collection of N background sample values
M(x) = {v1, v2, . . . , vN}
taken in previous frames. To classify a pixel value v(x) according to its corresponding model M(x), we compare it to the closest values within the set of samples by defining a sphere SR(v(x)) of radius R centered on v(x). The pixel value v(x) is then classified as background if the cardinality of the set intersection of this sphere and the collection of model samples M(x) is larger than or equal to a given threshold. The classification of a pixel value v(x) involves the computation of N distances between v(x) and model samples, and of N comparison with a thresholded Euclidean distance R.
The accuracy of the ViBe model is determined by two parameters only: the radius R of the sphere and the minimal cardinality. Experiments have shown that a unique radius R of 20 (for monochromatic images) and a cardinality of 2 are appropriate. There is no need to adapt these parameters during the background subtraction nor to change them for different pixel locations. 

Background model initialization from a single frame

Many popular techniques described in the literature need a sequence of several dozens of frames to initialize their models; such an approach makes sense from a statistical point of view as it seems necessary to gather a significant amount of data in order to estimate the temporal distribution of the background pixels. But this learning period cannot be allowed by many applications, as it may be necessary to

  • segment the foreground of a sequence that is even shorter than the typical initialization sequence required by some background subtraction algorithms
  • require the ability to provide an uninterrupted foreground detection, even in the presence of sudden light changes

A more convenient solution is to provide a technique that will initialize the background model from a single frame, so that the response to sudden illumination changes is straightforward: the existing background model is discarded and a new model is initialized instantaneously. What’s more, being able to provide a reliable foreground segmentation as early on as the second frame of a sequence has obvious benefits for short sequences in video-surveillance. Since there is no temporal information in a single frame, ViBe assumes that neighboring pixels share a similar temporal distribution, so it populates the pixel models with values found in the spatial neighborhood of each pixel. The size of the neighborhood needs to be chosen so that it is large enough to comprise a sufficient number of different samples, while keeping in mind that the statistical correlation between values at different locations decreases as the size of the neighborhood increases.
The only drawback is that the presence of a moving object in the first frame will introduce an artifact called a ghost (that is, a set of connected points, detected as in motion but not corresponding to any real moving object). In this particular case, the ghost is caused by the unfortunate initialization of pixel models with samples coming from the moving object. In subsequent frames, the object moves and uncovers the real background, which will be learned progressively through the regular model update process, making the ghost fade over time. ViBe’s update process ensures both a fast model recovery in the presence of a ghost and a slow incorporation of real moving objects into the background model. 

Updating the background model over time

The classification step of ViBe compares the current pixel value vt(x) directly to the samples contained in the background model of the previous frame, Mt−1(x) at time t − 1. But which samples have to be memorized by the model and for how long? The classical approach to the updating of the background history is to discard and replace old values after a number of frames or after a given period of time.
The question of including or not foreground pixel values in the model is one that is always raised for a background subtraction method based on samples; otherwise the model will not adapt to changing conditions. It comes down to a choice between a conservative and a blind update scheme:

  • conservative update policy never includes a sample belonging to a foreground region in the background model; unfortunately, it also leads to deadlock situations and everlasting ghosts: a background sample incorrectly classified as foreground prevents its background pixel model from being updated, so it can keep indefinitely the background pixel model from being updated and could cause a permanent misclassification
  • blind update is not sensitive to deadlocks: samples are added to the background model whether they have been classified as background or not; the principal drawback of this method is a poor detection of slow moving targets, which are progressively included in the background model.

The ViBe update method incorporates three important components:

  • a memoryless update policy, which ensures a smooth decaying lifespan for the samples stored in the background pixel models: instead of systematically removing the oldest sample from the pixel model, ViBe chooses the sample to be discarded randomly according to a uniform probability density function.
  • random time subsampling: the random replacement policy allow our pixel model to cover a large (theoretically infinite) time window with a limited number of samples; but in the presence of periodic or pseudo-periodic background motions, the use of fixed subsampling intervals might prevent the background model from properly adapting to these motions. So when a pixel value has been classified as belonging to the background, a random process determines whether this value is used to update the corresponding pixel model.
  • a mechanism that propagates background pixel samples spatially to ensure spatial consistency and to allow the adaptation of the background pixel models that are masked by the foreground: ViBe considers that neighboring background pixels share a similar temporal distribution and that a new background sample of a pixel should also update the models of neighboring pixels. According to this policy, background models hidden by the foreground will be updated with background samples from neighboring pixel locations from time to time. This allows a spatial diffusion of information regarding the background evolution that relies on samples classified exclusively as background. ViBe’s background model is thus able to adapt to a changing illumination and to structural evolutions (added or removed background objects) while relying on a strict conservative update scheme.

It is worth analysing the strengths of the ViBe algorithm:


  • Faster ghost suppression: ViBe’s spatial update mechanism speeds up the inclusion of ghosts in the background model so that the process is faster than the inclusion of real static foreground objects, because the borders of the foreground objects often exhibit colors that differ noticeably from those of the samples stored in the surrounding background pixel models. When a foreground object stops moving, the information propagation technique updates the pixel models located at its borders with samples coming from surrounding background pixels. But these samples are irrelevant: their colors do not match at all those of the borders of the object, so in subsequent frames, the object remains in the foreground, since background samples cannot diffuse inside the foreground object via its borders. By contrast, a ghost area often shares similar colors with the surrounding background, so when background samples from the area surrounding the ghost try to diffuse inside the ghost, they are likely to match the actual color of the image at the locations where they are diffused; as a result, the ghost is progressively eroded until it disappears entirely.
  • Resistance to camera displacements: small displacements are typically due to vibrations or wind and, with many background segmentation techniques, they cause significant numbers of false foreground detections. The spatial consistency of ViBe’s background model brings increased robustness against such small camera movements: since samples are shared between neighboring pixel models, small displacements of the camera introduce very few erroneous foreground detections. ViBe also has the capability of dealing with large displacements of the camera, at the price of a modification of the base algorithm. Since ViBe’s model is purely pixel-based, it can handle moving cameras by allowing pixel models to follow the corresponding physical pixels according to the movements of the camera (the movements of the camera can be estimated either using embedded motion sensors or directly from the video stream using dense optical flow).
  • Resilience to noise: two factors must be credited for ViBe’s resilience to noise:
    • the pixel models of ViBe comprise exclusively observed pixel values, so the pixel models adapt to noise automatically, as they are constructed from noisy pixel values
    • the pure conservative update scheme: by relying on pixel values classified exclusively as background, this model update policy prevents the inclusion of any outlier in the pixel models.
  • Downscaled version and embedded implementation: since ViBe has a low computational cost and relies exclusively on integer computations, it is particularly well suited to an embedded implementation


Performance analysis

After this long introduction to ViBe algorithms, it is time how it behaves on the test sequence used in this series of articles about background segmentation.

The first test adds a slight blur filter as pre-processing, and a 5×5 median filter as post-processing to remove noise from the results mask, as with other non-parametric algorithms:


This is the code fragment that receives an OpenCV image, applies the filters and returns the processing results inside an OpenCV mask:

void ViBeBGS::process(const cv::Mat &img_input, cv::Mat &img_output) {     if (img_input.empty())         return;       if (img_input.channels() == 3)         cv::cvtColor(img_input, gray_input_image, CV_BGR2GRAY);     else if (img_input.channels() == 1)             img_input.copyTo(gray_input_image);       cv::Mat filtered_input_image;     cv::GaussianBlur(gray_input_image, filtered_input_image, cv::Size(5,5), 1.5);     detector->Update(filtered_input_image.cols, filtered_input_image.rows, filtered_input_image.step, ViBe::PixelFormat_Gray8,;     cv::Mat mask(detector->GetComputedMaskHeight(), detector->GetComputedMaskWidth(), CV_8UC1,                  const_cast<void*>(detector->GetComputedMaskBuffer()), detector->GetComputedMaskStride());     cv::medianBlur(mask, img_output, 5); }

The ViBe SDK is extremely easy to use: after creating an instance of the ViBe::ViBeDetector() class, the user passes a grayscale image to the Update() method, and gets the resulting mask by calling the GetComputedMaskBuffer() method. Moving buffers in and out of OpenCV is effortless, as adding pre- and post-processing filters based on OpenCV image primitives.

ViBe authors claim that it has a low computational cost and it is suitable to embedded implementations. Benchmarking the code above results in a median elapsed time of 2357 microseconds per frame (2735 microseconds per frame on average), way below the computation load of other algorithms offering the same level of performance.

Still, we can remove both the pre- and post- filtering, and check if the claims about resilience against noise hold true, and how much the computational load is due to the ViBe algorithm alone:


This is the modified code, calling just the ViBe API:

void ViBeBGS::process(const cv::Mat &img_input, cv::Mat &img_output) {     if (img_input.empty())         return;       if (img_input.channels() == 3)         cv::cvtColor(img_input, gray_input_image, CV_BGR2GRAY);     else if (img_input.channels() == 1)             img_input.copyTo(gray_input_image);       detector->Update(gray_input_image.cols, gray_input_image.rows, gray_input_image.step, ViBe::PixelFormat_Gray8,;     img_output = cv::Mat(detector->GetComputedMaskHeight(), detector->GetComputedMaskWidth(), CV_8UC1,                  const_cast<void*>(detector->GetComputedMaskBuffer()), detector->GetComputedMaskStride()); }

Without pre- and post-processing, the computational load drops to 1506 microseconds per frame (2010 microseconds on average), slower only that the basic methods such as Frame Difference and Medians.


Still, a median filter is really a rough post-processing step, and some blob-oriented processing goes a long way toward reducing false alarms and properly outlining moving objects. The following code fragment adds two more steps of post-processing:

  1. blobs are extracted from the filtered mask, smaller blobs that have an area smaller than 20 pixels are dropped, as they are probably due to noise or misclassified parts of bigger moving objects, and the blobs are redrawn as filled areas to eliminate internal holes
  2. a morphological closure with size 7×7 is applied to further reduce partially closed holes in the blobs
void ViBeBGS::process(const cv::Mat &img_input, cv::Mat &img_output, cv::Mat &img_bgmodel) {   if (img_input.empty())       return;     if (img_input.channels() == 3)   {     cv::cvtColor(img_input, gray_input_image, CV_BGR2GRAY);     cv::Mat filtered_input_image;     cv::GaussianBlur(gray_input_image, filtered_input_image, cv::Size(5,5), 1.5);     detector->Update(filtered_input_image.cols, filtered_input_image.rows, filtered_input_image.step, ViBe::PixelFormat_Gray8,;     cv::Mat mask(detector->GetComputedMaskHeight(), detector->GetComputedMaskWidth(), CV_8UC1,                  const_cast<void*>(detector->GetComputedMaskBuffer()), detector->GetComputedMaskStride());     // median filtering     cv::medianBlur(mask, mask, 5);       // find blobs     std::vector<std::vector<cv::Point> > v;     std::vector<cv::Vec4i> hierarchy;     cv::findContours( mask, v, hierarchy, cv::RETR_EXTERNAL, cv::CHAIN_APPROX_SIMPLE);     mask = cv::Scalar(0, 0, 0);     for ( size_t i=0; i < v.size(); ++i )     {         // drop smaller blobs         if (cv::contourArea(v[i]) < 20)             continue;         // draw filled blob         cv::drawContours(mask, v, i, cv::Scalar(255,0,0), CV_FILLED, 8, hierarchy, 0, cv::Point() );      }       // morphological closure     cv::Mat element = cv::getStructuringElement(cv::MORPH_RECT, cv::Size(7, 7));     cv::morphologyEx(mask, mask, cv::MORPH_CLOSE, element);       mask.copyTo(img_output);   } }

The result of this enhanced post-processing phase is highlighted by the following video: blobs, mostly without internal holes, are more uniform and follow closely the outline of moving objects, and there are fewer false alarms due to winding trees.


To further test the base ViBe algorithm and the additional filtering, I used a subset of video sequences from the website (for a description of the various categories of videos, please see the table below in the ViBe+ description). The following table shows the output from the bare ViBe algorithm on the left, and the post-processed version in the right column. Being ViBe a non-parametric algorithm, no optimization was applied to any video sequence, and all results were achieved with standard settings. There are various parameters that can be tweaked in the post-processing phase, e.g. size of the morphology element, minimum size of blobs, leading to reduced errors, but in order to provide a baseline performance evaluation the post-processing code was always as shown above.


category Baseline

{mp4}ViBeHighway1|320|240{/mp4} {mp4}ViBeHighway3|320|240{/mp4}


category Baseline

{mp4}ViBePedestrians1|360|240{/mp4} {mp4}ViBePedestrians3|360|240{/mp4}


category Baseline

{mp4}ViBeOffice1|360|240{/mp4} {mp4}ViBeOffice3|360|240{/mp4}

canoe Canoe

category Dynamic Background

please refer to the description of ViBe+ to see how it improves handling of blinking pixels

{mp4}ViBeCanoe1|320|240{/mp4} {mp4}ViBeCanoe3|320|240{/mp4}

fountain01 Fountain 1

category Dynamic Background

{mp4}ViBeFountain011|432|288{/mp4} {mp4}ViBeFountain013|432|288{/mp4}

fountain2Fountain 2

category Dynamic Background

post-processing filters are effective as removing impulse noise in the output mask due to spilling water

{mp4}ViBeFountain021|432|288{/mp4} {mp4}ViBeFountain023|432|288{/mp4}


category Intermittent Object Motion

in this sequence moving objects are quite small, and the default post-processing parameters hide motion of walking people as the minimum size of blobs is too high

{mp4}ViBeParking1|320|240{/mp4} {mp4}ViBeParking3|320|240{/mp4}


category Intermittent Object Motion

{mp4}ViBeStreetlight1|320|240{/mp4} {mp4}ViBeStreetlight3|320|240{/mp4}


category Intermittent Object Motion

clear example of how ViBe avoids inserting foreground objects in the background model, and so it can quickly recover when these objects are taken away

{mp4}ViBeSofa1|320|240{/mp4} {mp4}ViBeSofa3|320|240{/mp4}


category Shadow

{mp4}ViBeBackdoor1|320|240{/mp4} {mp4}ViBeBackdoor3|320|240{/mp4}

busstationBus station

category Shadow

{mp4}ViBeBusstation1|360|240{/mp4} {mp4}ViBeBusstation3|360|240{/mp4}

PeopleinshadePeople in shade

category Shadow

{mp4}ViBePeopleinshade1|380|244{/mp4} {mp4}ViBePeopleinshade3|380|244{/mp4}


How ViBe+ improves on ViBe

ViBe+, described in the paper “Background Subtraction: Experiments and Improvements for ViBe” by M. Van Droogenbroeck and O. Paquot, improves on several aspects of the ViBe algorithm:

  • Distinction between the segmentation mask and the updating mask: the purpose of a background subtraction technique is to produce a binary mask with background and foreground pixels. In a conservative approach, the segmentation mask is used to determine which values are allowed to enter the background model, that is, the segmentation mask plays the role of an updating mask. But this is not a requirement, so ViBe+ processes the segmentation mask and the updating mask differently. As unique constrain, foreground pixels should never be used to update the model
  • Filtering connected components: to fill holes inside blobs, several area openings are appllied on both the segmentation and updating masks:
    • Segmentation mask: remove foreground blobs whose area is smaller or equal to 10 (pixels) and fill holes in the foreground whose area is smaller or equal to 20. Blobs that touch the border are kept regardless of their size.
    • Updating mask: fill holes in the foreground whose area is smaller or equal to 50. This operation is applied to limit the appearance of erroneous background seeds inside foreground objects.


  • Inhibition of propagation: a new mechanism to inhibit the spatial propagation is introduced: the spatial propagation consists in inserting a background value in the model of a 8-connected neighboring pixel taken randomly; this propagation mechanism diffuses values in the background and contributes to suppress ghosts and static objects over time. However, it is not always suitable to suppress static objects; this might better be decided at the blob level depending on the application. So the gradient on the inner border of background blobs is computed and the propagation is inhibited when the gradient (rescaled to the [0, 255] interval) is larger than 50, avoiding that background values cross object borders.
  • Adapted distance measure and thresholding: the distance metric in ViBe+ is inspired by the one described in “Real-time foreground-background segmentation using codebook model” by K. Kim, T. Chalidabhongse, D. Harwood, L. Davis. The distance for this codebook based background technique compares the intensities and computes some color distortion; the color distortion in ViBe+ is exactly the colordist() defined by equation (2) in the mentioned paper. This color distortion measure can be interpreted as a brightness-weighted version in the normalized color space. In ViBe+, a required condition for two values to match is that the color distortion is lower than 20. In addition, there is a second condition on the intensity values. Originally, ViBe considers that two intensities are close if their difference is lower than 20. In “Evaluation of background subtraction techniques for video surveillance”, Brutzer et al. suggest for ViBe to use a threshold in relation to the samples in the model for a better handling of camouflaged foreground. Therefore, ViBe+ computes the standard deviation m of the samples of a model and define a matching threshold as 0.5× m bounded to the [20, 40] interval.
  • A heuristic to detect blinking pixels: one of the major difficulties related to the use of sample-based models is the handling of multimodal background distributions because there is no explicit mechanism to adapt to them. However, ViBe+ detects if a pixel often switches between the background and the foreground (this pixel is then called a blinking pixel). For each pixel, ViBe+ stores the previous updating mask (prior to any modification) and a map with the blinking level. This level is determined as follows: if a pixel belongs to the inner border of the background and the current updating label is different from the previous updating label, then the blinking level is increased by 15 (the blinking level being kept within the [0, 150] interval), otherwise the level is decreased by 1;a pixel is considered as blinking if its level is larger or equal to 30, and if so, the pixel is removed from the updating mask. In other words, the blinking level can only increase at the frontier of the background mask but all blinking pixels from the updating mask get suppressed. Note that the detection of blinking pixels is deactivated when the camera is shaking.


All these improvements moving from ViBe to ViBe+ leads to the following results in the public dataset provided on the web site. The dataset contains 31 video sequences, grouped in 6 categories: baseline, dynamic background, camera jitter, intermittent object motion, shadow, and thermal.


Baseline This category contains four videos, two indoor and two outdoor. These videos represent a mixture of mild challenges typical of the next 4 categories. Some videos have subtle background motion, others have isolated shadows, some have an abandoned object and others have pedestrians that stop for a short while and then move away. These videos are fairly easy, but not trivial, to process, and are provided mainly as reference.
Dynamic Background There are six videos in this category depicting outdoor scenes with strong (parasitic) background motion. Two videos represent boats on shimmering water, two videos show cars passing next to a fountain, and the last two depict pedestrians, cars and trucks passing in front of a tree shaken by the wind
Camera Jitter This category contains one indoor and three outdoor videos captured by unstable (e.g., vibrating) cameras. The jitter magnitude varies from one video to another.
Shadows This category consists of two indoor and four outdoor videos exhibiting strong as well as faint shadows. Some shadows are fairly narrow while others occupy most of the scene. Also, some shadows are cast by moving objects while others are cast by trees and buildings.
Intermittent Object Motion This category contains six videos with scenarios known for causing “ghosting” artifacts in the detected motion, i.e., objects move, then stop for a short while, after which they start moving again. Some videos include still objects that suddenly start moving, e.g., a parked vehicle driving away, and also abandoned objects. This category is intended for testing how various algorithms adapt to background changes.
Thermal In this category, five videos (three outdoor and two indoor) have been captured by far-infrared cameras. These videos contain typical thermal artifacts such as heat stamps (e.g., bright spots left on a seat after a person gets up and leaves), heat reflection on floors and windows, and camouflage effects, when a moving object has the same temperature as the surrounding regions.





I would like to thank Prof. M. Van Droogenbroeck for taking the time of explaining the details of ViBe and ViBe+, and letting me access and test the ViBe SDK.ViBe is patented technology and requires a license for commercial usage. For more information about ViBe, please refer to the ViBe website at The description of the ViBe algorithm is extracted from the following paper: O. Barnich and M. Van Droogenbroeck.ViBe: A universal background subtraction algorithm for video sequences. In IEEE Transactions on Image Processing, 20(6):1709-1724, June 2011. The description of the ViBe+ algorithm is extracted from the following paper: M. Van Droogenbroeck and O. Paquot.Background Subtraction: Experiments and Improvements for ViBe. In Change Detection Workshop (CDW), Providence, Rhode Island, June 2012. Both papers can be downloaded at the following web address:

Leave a Reply

Your email address will not be published.