The task of marking foreground entities plays an important role in the video pre-processing pipeline as the initial phase of computer vision (CV) applications. As examples of such applications, we can perform monitoring, tracking, and recognition of the objects: traffic analysis, people detection, tracking of the animals and others.

Steeping into the idea behind these CV-systems we can observe that in most cases the initial steps contain background subtraction (BS), which helps to obtain relatively rough and rapid identifications of the objects in the video stream for their further subtle handling. In the current post, we are going to cover several noteworthy algorithms in terms of accuracy and processing time BS methods: SuBSENSE and LSBP-based GSoC method.

The rest of the post is divided into the following sub-topics:

Basic concepts and approaches of Background Subtraction
Descriptors and Types
The SubSENSE Algorithm
1. Implementation using BGS library
GSoC Algorithm
1. Implementation using OpenCV library
Evaluation
Evaluation Pipeline
Results

Background Subtraction: Basic Concepts and Approaches

Background subtraction methods solve the task of the foreground extraction by creating a background model. The full BS pipeline may contain the following phases:

background generation – processing N frames to provide the background image
background modeling – defining the model for background representation
background model update – introducing the model update algorithm for handling the changes, which occur over time
foreground detection – dividing pixels into sets of background or foreground.

Figure 1: BS basic pipeline

Background subtraction output consists of a binary mask, which separates frame pixels into two sets: foreground and background pixels.

It should be mentioned that frequently in the BS-approaches the focus is shifted to the implementation of the advanced background models and robust feature representation aspect.

100K+ Learners
3 Hours of Learning

Join Free OpenCV Bootcamp

15K+ Learners
3 Hours of Learning

Join Free TensorFlow Bootcamp

10K+ Learners
8 Hours of Learning

Join Free PyTorch Bootcamp

Descriptors

Here we touched upon another important concept – descriptors (features). Descriptors define the captured image region in the current video frame for its mapping with a known background model. The goal of this comparison is to distinguish the region from the background or foreground. It can be done, for example, with color, texture and edge descriptors.

Obviously, BS-algorithm design, including a combination of the features, should be relied on the initial object field analysis. It’s needed to consider the possible challenging factors: specific illumination, oscillations, movement of objects and others.

For instance, suppose most of the background area is statical. Then it’s assumed that the color of the same regions is fixed, hence, we can identify the background. However, there are different foreground objects and illumination variations, which can distort the colors.

Types of Descriptors

Let’s examine the sorts of features and specific challenges for them. The pixel values of the frames are available during video processing. Thus, the computation of the pixel domain descriptors is widespread in the BS-algorithms. Popular pixel domain descriptors:

color: descriptive object features. The components of the RGB color space are tightly connected, reacting to the illumination changes. There is no brightness and chroma separation (as in YCrCb). Color features are sensitive to the illumination, camouflage, shadows, which can affect the appearance of moving objects. That is the reason why usually they are combined with other features for more robustness.
edge: edge features are robust to the light variations and good for the detection of the moving objects. Edge features are sensitive to both high and low textured objects.
texture: texture features provide spatial information. They are robust to the illumination and shadows. For example, texture features are applied in the Local Binary Pattern (LBP).

Texture Features

In the current subchapter we will briefly overview the texture descriptors and their evolution.

Local Binary Pattern (LBP). LBP was introduced in 2005 as “a gray-scale invariant texture primitive statistic” for texture description and defined the starting point for further development of texture descriptors. The LBP operator produces a binary pattern (number) labeling the frame pixels of the specified area by thresholding each neighboring pixel value with the value of the center pixel.

Local Binary Similarity Patterns (LBSP). LBSP method was introduced in 2013 to solve the issue using absolute difference thresholding in comparison of the center and neighboring pixels. However, LBSP is not spatiotemporal, the information about features and intensity updates not simultaneously.

Self-balanced sensitivity segmenter (SuBSENSE). SuBSENSE was introduced in 2014. It uses improved spatiotemporal LBSP in combination with color features.
Background Subtraction using Local SVD Binary Pattern (LSBP). Local SVD Binary Pattern feature descriptor is robust to the illumination variations, shadows and noise. The coefficients of the singular value decomposition (SVD) used in LSBP characterize the illumination invariant.

In the following chapters we will explore SuBSENSE and GSoC methods in more detail.

SuBSENSE Algorithm

Overview

The below scheme presents the SuBSENSE functioning mechanism:

Suppose there is a video sequence as an input. Then $I_{t}(x)$ is the result of the $t$ -th frame (at time $t$ ) spatial analysis, where $x$ is a pixel index. The background model block is a non-parametric statical model. It produces the background at pixel locations denoted by $B(x)$ on the basis of 50 past representations (samples) $N$ . $S_{t}(x)$ is a segmentation output. Its has the following values:

$S_{t}(x) = 0$ : the pixel is marked as background, if there is an intersection of at least 2 samples with the representation of $x$ in the $t$ -th frame ( $I_{t}(x)$ )
$S_{t}(x) = 1$ : the pixel is automatically marked as foreground in the opposite cases.

SuBSENSE solves the background subtraction problem as a classification task, where a pixel value is analyzed due to its neighboring pixels in the feature space. Hence, $B(x)={B_{1}(x),B_{2}(x), ..., B_{N}(x)}$ – modeling of pixel $x$ relying on the $N$ samples. These samples are randomly chosen at the background model initialization time. The SuBSENSE analysis core is color comparison and LBSP descriptors, calculated on the color channels. Thus, $B_{n}(x)$ , where $n \in [1, N]$ include the following: $B_{n}(x)={R_{n}(x),G_{n}(x),B_{n}(x),LBSP_{n}^{R}(x),LBSP_{n}^{G}(x),LBSP_{n}^{B}(x)}$ .

$B(x)$ and $I_{t}(x)$ are matched through the color values and LBSP-descriptors.

For the colors comparison L1 distance is used, whereas descriptors are compared with the Hamming distance. The resulting mask is binary and can be described as:

$M_t(x) = \begin{cases} 1, & \mbox{if } {dist(I_{t}(x), B_{t}(x)) < D_{max}, \forall n}<2 \\ 0, & \mbox{otherwise } \end{cases}$

$D_{max}$ is the threshold of the maximum distance. Its value is dynamically assigned in accordance with the model loyalty, segmentation noise.

Implementation Using BGSLibrary

In the current subchapter we will experiment with background subtraction using BGS library API. It’s worth noting that the BGS framework was developed as a specialized OpenCV-based C++ project for video foreground-background separation. BGS library also has wrappers for Python, Java and MATLAB. Thus, BGS contains a wide range of background subtraction methods as it can be seen from its, for example, Python demo script.

As default input we will use a video sequence with a static background and dynamic foreground objects:

default="space_traffic.mp4"

Specify --input_video key to set another input video.

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

Click here to download the source code to this post

1. Video Processing

Upload and process video data with OpenCV VideoCapture:

# create VideoCapture object for further video processing
captured_video = cv2.VideoCapture(video_to_process)
# check video capture status
if not captured_video.isOpened:
    print("Unable to open: " + video_to_process)
    exit(0)

2. Model Initialization

Instantiate the model:

# instantiate background subtraction
background_subtr_method = bgs.SuBSENSE()

3. Obtaining results

Obtain the results (the initial size of frames was 1920×1080):

while True:
    # read video frames
    retval, frame = captured_video.read()

    # check whether the frames have been grabbed
    if not retval:
        break

    # resize video frames
    frame = cv2.resize(frame, (640, 360))

    # pass the frame to the background subtractor
    foreground_mask = background_subtr_method.apply(frame)
    # obtain the background without foreground mask
    img_bgmodel = background_subtr_method.getBackgroundModel()

4. Visualizing results

Visualize results with OpenCV imshow:

while True:
    # ...

    # show the current frame, foreground mask, subtracted result
    cv2.imshow("Initial Frames", frame)
    cv2.imshow("Foreground Masks", foreground_mask)
    cv2.imshow("Subtraction result", img_bgmodel)

    keyboard = cv2.waitKey(10)
    if keyboard == 27:
        break

The outputs are:

initial frame:

obtained foreground mask:

*Figure 6: Foreground mask obtained with BGS SuBSENSE BS method*

subtraction result:

*Figure 7: Background after subtracted foreground entities*

In the above case masks were detected more accurately, in general foreground objects were captured correctly, however, there are the following defects:

statical group of people in the left part of the frame was not detected at all:

almost statical objects with moving components:

Figure 9: BGS SuBSENSE artifact 2

the regions, where people are close to each other were combined into one shared mask:

GSoC Algorithm

Overview

During Google Summer of Code (GSoC) 2017 the advancement of LSPB was provided: BackgroundSubtractorGSOC. GSoC BS-method was introduced in order to make LSBP faster and more robust. The method relies on RGB color values instead of LSBP descriptors and achieves high performance on the CDnet-2012.

GSoC BS-implementation doesn`t refer to any article, therefore let’s view the basic points by exploring its source bgfg_gsoc.cpp. Firstly, we need to pay attention to the BackgroundSubtractorGSOC instantiation parameters:

Ptr< BackgroundSubtractorGSOC > createBackgroundSubtractorGSOC(
     int mc,
     int nSamples,
     float replaceRate,
     float propagationRate,
     int hitsThreshold,
     float alpha,
     float beta,
     float blinkingSupressionDecay,
     float blinkingSupressionMultiplier,
     float noiseRemovalThresholdFacBG,
     float noiseRemovalThresholdFacFG
)

There are the following meanings behind them:

mc: camera motion compensation flag
nSamples: number of samples to maintain at each point of the frame.
replaceRate: probability of replacing the old sample – how fast the model will be updated.
propagationRate: probability of propagating to neighbors.
hitsThreshold: how many positives the sample must get before it will be considered as a possible replacement.
alpha: scale coefficient for threshold.
beta: bias coefficient for threshold.
blinkingSupressionDecay: blinking suppression decay factor.
blinkingSupressionMultiplier: blinking suppression multiplier.
noiseRemovalThresholdFacBG: strength of the noise removal for background.
noiseRemovalThresholdFacFG: strength of the noise removal for foreground .

Keeping the above data in mind let’s examine the core BS-part described in apply() method. The computation core init is in apply():

parallel_for_(Range(0, sz.area()), ParallelGSOC(sz, this, frame, learningRate, fgMask));

The ParallelGSOC contains comparison operations of the neighboring pixels relying on RGB color features.

Another important point concerns the type of frame pixels. The pixels, which frequently switch between the foreground and background are defined as blinking. GSoC BS-approach applies a special heuristic for blinking pixels detection:

cv::add(blinkingSupression, (fgMask != prevFgMask) / 255, blinkingSupression, cv::noArray(), CV_32F);
blinkingSupression *= blinkingSupressionDecay;
fgMask.copyTo(prevFgMask);
Mat prob = blinkingSupression * (blinkingSupressionMultiplier * (1 - blinkingSupressionDecay) / blinkingSupressionDecay);

for (int i = 0; i < sz.height; ++i)
    for (int j = 0; j < sz.width; ++j)
        if (rng.uniform(0.0f, 1.0f) < prob.at< float >(i, j))
            backgroundModel->replaceOldest(i, j, BackgroundSampleGSOC(frame.at< Point3f >(i, j), 0, currentTime));

Here the blinkingSupression can be defined as a blinking pixel map obtained by current and previous mask XOR. Then the values obtained with blinking suppression coefficients are picked at random for classifying of the appropriate pixels as background.

Produced mask postprocessing step is final and consists of denoising and gaussian blur:

void BackgroundSubtractorGSOCImpl::postprocessing(Mat& fgMask) {
    removeNoise(fgMask, fgMask, size_t(noiseRemovalThresholdFacBG * fgMask.size().area()), 0);
    Mat invFgMask = 255 - fgMask;
    removeNoise(fgMask, invFgMask, size_t(noiseRemovalThresholdFacFG * fgMask.size().area()), 255);

    GaussianBlur(fgMask, fgMask, Size(5, 5), 0);
    fgMask = fgMask > 127;
}

The threshold value for the noise removal is produced with noiseRemovalThresholdFacBG and noiseRemovalThresholdFacBG multiplication on the mask area. Further the mask values are updated in accordance with the obtained threshold:

for (int i = 0; i < sz.height; ++i)
    for (int j = 0; j < sz.width; ++j)
        if (compArea[labels.at< int >(i, j)] < threshold)
            fgMask.at< uchar >(i, j) = filler;

Implementation Using OpenCV

In the current section we will experiment with background subtraction using the appropriate API from the OpenCV library by the example of default "space_traffic.mp4" video.

1. Video Processing

Upload and process video data with OpenCV VideoCapture:

# create VideoCapture object for further video processing
captured_video = cv2.VideoCapture(video_to_process)
# check video capture status
if not captured_video.isOpened:
    print("Unable to open: " + video_to_process)
    exit(0)

2. Model Initialization

Instantiate the model:

# instantiate background subtraction
background_subtr_method = cv2.bgsegm.createBackgroundSubtractorGSOC()

3. Obtaining results

Obtain the results (the initial size of frames was 1920×1080):

while True:
    # read video frames
    retval, frame = captured_video.read()

    # check whether the frames have been grabbed
    if not retval:
        break

    # resize video frames
    frame = cv2.resize(frame, (640, 360))

    # pass the frame to the background subtractor
    foreground_mask = background_subtr_method.apply(frame)
    # obtain the background without foreground mask
    background_img = background_subtr_method.getBackgroundImage()

4. Visualizing results

Visualize results with OpenCV imshow:

while True:
    # ...

    # show the current frame, foreground mask, subtracted result
    cv2.imshow("Initial Frames", frame)
    cv2.imshow("Foreground Masks", foreground_mask)
    cv2.imshow("Subtraction Result", background_img)

    keyboard = cv2.waitKey(10)
    if keyboard == 27:
        break

The outputs are:

initial frame:

obtained foreground mask:

*Figure 12: Foreground mask obtained with OpenCV BS-GSoC method*

subtraction result:

We can see that mostly foreground objects were correctly located. However, there are some artifacts:

masks of the foreground cover some extra space at the footing of the objects, which denotes their shadows:

statical objects were partially defined, only their moving components were detected, for example, moving man’s hand in the below picture:

Figure 15: OpenCV BS-GSoC artifact 2

or some parts of the non-dynamic people group in the left part of the frame:

It can be noted that the most challenging areas for both algorithms contain statical foreground objects or partly moving objects with some dynamic components.

Evaluation

Data Sets

In the current post, we will use two datasets from ChangeDetection.NET(CDNET): CDNET-2012 and CDNET-2014 to provide an evaluation of the proposed BS-methods. CDNET data set is frequently used video collection for evaluation of algorithms due to the variety of its content: categories, input frames and corresponding ground truth (GT) images. There are 6 categories in CDNET-2012 and 11 in CDNET-2014. Let’s quickly look through them and view the video fragments:

Common categories:

baseline: 4 videos with a statical background containing moving foreground objects

cameraJitter: 4 videos with slight camera oscillation effect

dynamicBackground: 6 videos with partly moving background and dynamic foreground

intermittentObjectMotion: 6 videos containing statical background with periodic moving foreground entities

shadow: 6 video sequences, which contain the shadows of the foreground objects

thermal: 5 videos obtained from a thermal camera

Introduced in CDNET-2014:

badWeather: 4 videos of traffic with poor visibility, distorted by snowfall images

lowFramerate: 4 video sequences with the low frame rate

nightVideos: 6 videos containing low illuminated views

PTZ: 4 video sequences obtained with a pan-tilt-zoom camera (dynamic foreground: rotation, zoom; slight oscillation effect)

turbulence: 4 videos distorted with a slight ripple.

100K+ Learners
3 Hours of Learning

Join Free OpenCV Bootcamp

15K+ Learners
3 Hours of Learning

Join Free TensorFlow Bootcamp

10K+ Learners
8 Hours of Learning

Join Free PyTorch Bootcamp

Evaluation Pipeline

To evaluate the algorithms we will use evaluator.py, based on opencv-contrib evaluation pipeline. To run the script we need to obtain the data set, in our case CDnet-2012 and CDnet-2014. The path to the data should be specified in --dataset_path required parameter. The below line initiates evaluation execution:

python evaluator.py --dataset_path ./cdnet_12

In the below lines we define the list of algorithms to evaluate (method creator, its title, passing arguments):

import cv2
import pybgs as bgs

ALGORITHMS_TO_EVALUATE = [
    (cv2.bgsegm.createBackgroundSubtractorGSOC, "GSoC", {}),
    (bgs.SuBSENSE, "SuBSENSE", {}),
]

Iterating over ALGORITHMS_TO_EVALUATE the specified background subtraction models are instantiated. To compute a foreground mask the apply(frame) method should be called. The mask list accumulates obtained foreground masks for further calculation of the algorithm quality metrics. Before we get their values, let’s remember the following key concepts:

true positives ( $TP$ ) – properly masked objects
true negatives ( $TN$ ) – properly not masked objects
false positives ( $FP$ ) – improperly masked objects
false negatives () – improperly not masked objects. Knowing , , and values we can calculate precision, recall and, finally, -measure and accuracy value:
1. precision – the ratio of true positives in the obtained results: $\frac{TP}{TP+FP}$
2. recall – the amount of true positives found among all the ground truth: $\frac{TP}{TP+FN}$
1. F1-measure (FM): $2\times\frac{precision\times recall}{precision+recall}=\frac{2TP}{2TP+FP+FN}$
2. Accuracy: $\frac{TP+TN}{TP+TN+FP+FN}$