• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer

Learn OpenCV

OpenCV, PyTorch, Keras, Tensorflow examples and tutorials

  • Home
  • Getting Started
    • Installation
    • PyTorch
    • Keras & Tensorflow
    • Resource Guide
  • Courses
    • Opencv Courses
    • CV4Faces (Old)
  • Resources
  • AI Consulting
  • About

Image Matting with state-of-the-art Method “F, B, Alpha Matting”

Pavel Semkin
September 28, 2020 Leave a Comment
Deep Learning Machine Learning PyTorch Segmentation Tutorial

September 28, 2020 By Leave a Comment

The foreground is the part of a view or picture, that is nearest to you when you look at it (Oxford dictionary). We, humans, are usually good at distinguishing foreground objects on images from the background. As computer vision algorithms became better and better at solving visual tasks it’s only natural we want to offload the foreground separation task to machines.

The problem of foreground-background separation is also called image matting and it arises in a wide spectrum of image-processing tasks. You can meet such algorithms in:

  • Film production sphere. Green-screens there are used to make a life of filmmakers easier allowing almost automatic foreground segmentation;
  • Video-conferencing apps. Some modern apps for web-conferencing (for example, Zoom) let users remove background from their web-cam image and replace it with something else;
  • Photo editing. The major application of image matting.

We know that foreground can be separated by applying semantic segmentation algorithms (like we did in one of the previous posts). Does this mean that the problem is virtually solved? Not exactly. The devil is always in details and in our case quite literally. The main difference between semantic segmentation and image matting is that in the latter we want our output to be extremely precise and continuous. Per-pixel labels identifying whether each particular pixel belongs to the foreground or background are no longer good enough for us since for a lot of natural objects like hair or fur the answer would be something in between. Most of the semantic segmentation algorithms are trained without making any emphasis on edges being that precise, so these models cannot give us the desired result. We need some special solution for our case, thus image matting.

We will return to the more formal problem definition later. Now let’s hop straight into the demo of the SOTA image matting algorithm to see how it works. We are grateful to the authors of the “F, B, Alpha Matting” article for providing the inference code which was used in our demo.

Download Code

Before we go over the explanation, you can download code from our GitHub repo:

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

Download Code

Demonstration

Let’s launch FBA Matting approach on real-life images. To apply FBA Matting algorithm we first need to generate a trimap (we’ll cover what it is later). In our demo, we’ll use a pre-trained DeepLabV3 to generate a segmentation mask with probabilities of each pixel belonging to a foreground class. After that we’ll use a number of dilation operations to mark border pixels and pixels with low foreground probability as unknown. Unfortunately, such an approach may lead to imprecise matting. You can see the difference between a labeled and a generated trimap below.

  • image-example
    Image
  • trimap-example
    Trimap
  • trimap-generated-example
    Generated trimap
Fig.1: The image, the groundtruth trimap and the generated trimap

We’ll discuss the code for trimap generation in the following parts. Overall, our inference of FBA Matting consists of the next steps:

  1. Produce segmentation of a person with DeepLabV3;
  2. Apply dilation to get a trimap;
  3. Use the generated trimap to predict the result with FBA Matting.

Here are the results of applying such an approach to the image of a cat can:

  • demo-image
    Image
  • demo-trimap
    Trimap
  • demo-fg
    FG
  • demo-bg
    BG
  • demo-alpha
    Alpha
  • demo-swapped-bg
    Green BG
Fig.2: Experimental results

You can try it by yourself. Please, follow the instructions from the repository’s README.

I've partnered with OpenCV.org to bring you official courses in Computer Vision, Machine Learning, and AI! Sign up now and take your skills to the next level!

Official Courses by OpenCV.org

Mathematical Formulation

Ok, now we saw that matting can generate pretty good results, but how does it work? Mathematically the issue is formulated as Compositing Equation:

    \[C_i=\alpha_iF_i+(1-\alpha_i)B_i\]

If \alpha_i equals 1 for pixel i, then it’s a pure foreground pixel.

Solving the Compositing Equation is an ill-posed issue as we’ve only 3 equations for 7 unknowns. In the past few years several deep-learning-based methods have boosted the state-of-the-art in the image matting field. There are a lot of successful approaches such as Deep Image Matting,
IndexNet Matting, GCA Matting, to name but a few. The current state-of-the-art is F, B, Alpha Matting and today we are going to discuss it. But before that we need to figure out what a trimap is.

Trimap problem

We need to remember that the main focus of the matting problem is a very precise separation of the foreground from the background. Thus matting doesn’t actually care much about what type of an object is depicted on the image. This problem is decoupled from the actual semantic segmentation and because of that a lot of matting algorithms require segmentation
mask – or timap – as an input. Basically, the trimap is a rough segmentation of an image into three region types: certain foreground, unknown, certain background.

  • image-example
    Image
  • trimap-example
    Trimap
Fig. 3: The image and its’ trimap

The trimap is usually concatenated with the corresponding image and then this 4-channel concatenation is used as an input for the model. However, this requirement imposes severe restrictions on the model usage. One should somehow generate the trimap from the image beforehand. It’s often expected to have some nice manually generated trimaps in order to run a matting algorithm on. Luckily, some authors suggest using a segmentation
network to predict the trimap from the source image – just what we did in our demo. We generate trimap using a segmentation mask, created by pre-trained DeepLabV3 as follows:

def trimap(probs, size, conf_threshold):
   """
   This function creates a trimap based on simple dilation algorithm
   Inputs [3]: an image with probabilities of each pixel being the foreground, size of dilation kernel,
   foreground confidence threshold
   Output    : a trimap
   """
   mask = (probs > 0.05).astype(np.uint8) * 255

   pixels = 2 * size + 1
   kernel = np.ones((pixels, pixels), np.uint8)

   dilation = cv2.dilate(mask, kernel, iterations=1)

   remake = np.zeros_like(mask)
   remake[dilation == 255] = 127  # Set every pixel within dilated region as probably foreground.
   remake[probs > conf_threshold] = 255  # Set every pixel with large enough probability as definitely foreground.

   return remake

F, B, Alpha Matting approach

The proposed method uses an encoder-decoder with U-Net style architecture. However, the most of previous approaches predict only alpha-matte while authors also predict F and B directly from this single encoder-decoder. In other words, model gives 7-channel output (1 for \alpha, 3 for F and 3 for B). Let’s now cover some tips and tricks the authors found particularly useful.

The encoder architecture is ResNet-50 with some modifications:

  • Three channels are used for an RGB image and the others are used for a trimap which in its turn is encoded with Gaussian blurs of the definite foreground and background masks at three different scales. Such encoding is the main difference from the previous approaches, as their authors usually encode the trimap as a single channel with values 1 for F, 0.5 for unknown and 0 for B.
  • The strides from layer3 and layer4 were removed and the dilations in these layers were increased to 2 and 4 respectively. The idea was to prevent lowering a feature map’s spatial resolution and process the information at the highest scales possible. Here is the encoder implementation:
class ResnetDilatedBN(nn.Module):
    def __init__(self, orig_resnet, dilate_scale=8):
        super(ResnetDilatedBN, self).__init__()
        from functools import partial

        if dilate_scale == 8:
            orig_resnet.layer3.apply(partial(self._nostride_dilate, dilate=2))
            orig_resnet.layer4.apply(partial(self._nostride_dilate, dilate=4))
        elif dilate_scale == 16:
            orig_resnet.layer4.apply(partial(self._nostride_dilate, dilate=2))

        # take pretrained resnet, except AvgPool and FC
        self.conv1 = orig_resnet.conv1
        self.bn1 = orig_resnet.bn1
        self.relu1 = orig_resnet.relu1
        self.conv2 = orig_resnet.conv2
        self.bn2 = orig_resnet.bn2
        self.relu2 = orig_resnet.relu2
        self.conv3 = orig_resnet.conv3
        self.bn3 = orig_resnet.bn3
        self.relu3 = orig_resnet.relu3
        self.maxpool = orig_resnet.maxpool
        self.layer1 = orig_resnet.layer1
        self.layer2 = orig_resnet.layer2
        self.layer3 = orig_resnet.layer3
        self.layer4 = orig_resnet.layer4

    def _nostride_dilate(self, m, dilate):
        classname = m.__class__.__name__
        if classname.find("Conv") != -1:
            # the convolution with stride
            if m.stride == (2, 2):
                m.stride = (1, 1)
                if m.kernel_size == (3, 3):
                    m.dilation = (dilate // 2, dilate // 2)
                    m.padding = (dilate // 2, dilate // 2)
            # other convolutions
            else:
                if m.kernel_size == (3, 3):
                    m.dilation = (dilate, dilate)
                    m.padding = (dilate, dilate)

Let’s discuss the input trimap generation process. First of all, we read the trimap as a gray-scale image and convert it into numpy array with shapes (H, W, 2) with the first channel for F and the second for B.

def read_trimap(name):
    trimap_im = cv2.imread(name, 0) / 255.0
    h, w = trimap_im.shape
    trimap = np.zeros((h, w, 2))
    trimap[trimap_im == 1, 1] = 1
    trimap[trimap_im == 0, 0] = 1
    return trimap

Then we scale the image and the trimap’s shapes so they are divisible by 8.

def scale_input(x: np.ndarray, scale: float, scale_type) -> np.ndarray:
    """ Scales inputs to multiple of 8. """
    h, w = x.shape[:2]
    h1 = int(np.ceil(scale * h / 8) * 8)
    w1 = int(np.ceil(scale * w / 8) * 8)
    x_scale = cv2.resize(x, (w1, h1), interpolation=scale_type)
    return x_scale

After that the trimap is transformed into a 6-channel form.

def dt(a):
    return cv2.distanceTransform((a * 255).astype(np.uint8), cv2.DIST_L2, 0)

def trimap_transform(trimap):
    h, w = trimap.shape[0], trimap.shape[1]

    clicks = np.zeros((h, w, 6))
    for k in range(2):
        if np.count_nonzero(trimap[:, :, k]) > 0:
            dt_mask = -dt(1 - trimap[:, :, k]) ** 2
            L = 320
            clicks[:, :, 3 * k] = np.exp(dt_mask / (2 * ((0.02 * L) ** 2)))
            clicks[:, :, 3 * k + 1] = np.exp(dt_mask / (2 * ((0.08 * L) ** 2)))
            clicks[:, :, 3 * k + 2] = np.exp(dt_mask / (2 * ((0.16 * L) ** 2)))

    return clicks

Then, a Pyramid Pooling layer is applied to the encoder features and the result is passed into the decoder with seven convolutional layers, three bilinear upsampling layers, and skip connections.

The Output contains 7 channels for \alpha, F, B. The authors claim that clamping
the values of \alpha between 0 and 1 with a hardtanh activation improves performance over using a sigmoid. The F and B logits are activated using the sigmoid.

Batch Normalization and Group Normalization

fter a number of experiments, the authors claimed that, quite counter-intuitively, the best results are achieved with a batch size equal to 1. However, it is impossible to use it as the original ResNet-50 has Batch Normalization layers which require batches with sizes larger than one. To overcome the issue, they suggest to use Group Normalization (32 channels per group) with Weight Standardization instead.

def norm(dim, bn=False):
    if bn is False:
        return nn.GroupNorm(32, dim)
    else:
        return nn.BatchNorm2d(dim)

A bouquet of Loss Functions

In most of the previous works a quite straight-forward loss politics have been used. For instance, in Deep Image Matting the final loss was formulated as a combination of an alpha-prediction loss:

    \[\mathcal{L}_{\alpha}^{i} = \sqrt{(\alpha_p^i - \alpha_g^i)^2 + \epsilon^2}, \alpha_p^i, \alpha_g^i \in [0, 1].\]

and a compositional loss:

    \[\mathcal{L}_{c}^{i} = \sqrt{(c_p^i-c_g^i)^2 + \epsilon^2}.\]

In “F, B, Alpha Matting” authors used a linear combination of eight different loss functions:

fba-losses

where \mathcal{L}_1^\alpha is \mathcal{L}_1 loss on alpha, \mathcal{L}_c^\alpha is the composition loss, \mathcal{L}_g^\alpha is the gradient loss, and \mathcal{L}_{lap}^\alpha is the Laplacian pyramid loss which is computed over multiple scales s of the Laplacian pyramid \mathcal{L}_{pyr}^s. For the F and B the
same losses were used as shown in the figure above.

The final loss was formulated as follows:

    \[\mathcal{L}_{FB\alpha} = \mathcal{L}_1^\alpha + \mathcal{L}_c^\alpha + \mathcal{L}_g^\alpha + \mathcal{L}_{lap}^\alpha + 0.25(\mathcal{L}_1^{FB} + \mathcal{L}_{lap}^{FB} + \mathcal{L}_{excl}^{FB}+ \mathcal{L}_{c}^{FB})\]

As mentioned in the paper, it’s not enough to calculate the loss depending on absolute values of \alpha only as these errors are not clearly visible. For example, errors in the reproduction of the hair strands shapes are more noticeable than slight errors in the overall opacity level. In Fig.2 you can see
that low L_1 doesn’t guarantee visually attractive predictions. That is why taking gradients in notice is quite important and it’s a reason to use gradient related losses such as \mathcal{L}_g^\alpha and \mathcal{L}_{lap}^\alpha.

visual-errors
Fig.4: Predictions with measured Sum of Absolute Difference

Fusion Strategy

One can see a problem with all those losses. Since the predictions for \alpha, F and B produced by the model are decoupled, the relationship from the Compositing Equation is not performed. To deal with it, the authors proposed a fusion module based on the maximum likelihood estimate of p(\alpha, F, B | \hat{\alpha}, \hat{F}, \hat{B}). They assumed the independence of the used errors and ignored any spatial dependence between pixels. It allowed them to build the likelihood model like this:

fusion-likelihood

They also ssumed Gaussian distribution for the errors:

errors-distribution-assumption

At last, they simplified the model by ignoring the gradient and Laplacian losses and replaced \mathcal{L}_1 with \mathcal{L}_2 Starting from, \hat{F}^{(0)} = \hat{F}, \hat{B}^{(0)} = \hat{B}, \hat{\alpha}^{(0)} = \hat{\alpha} they used the following scheme:

likelihood-iteration-scheme

The best results were achieved with \sigma^2_C, \sigma^2_F, \sigma^2_B = 1, \sigma^2_\alpha=10.

Overall, the fusion module helps to fuse all 3 predictions with respect to the Composing Equation.

Fusion module implementation:

def fba_fusion(alpha, img, F, B):
    F = alpha * img + (1 - alpha ** 2) * F - alpha * (1 - alpha) * B
    B = (1 - alpha) * img + (2 * alpha - alpha ** 2) * B - alpha * (1 - alpha) * F

    F = torch.clamp(F, 0, 1)
    B = torch.clamp(B, 0, 1)
    la = 0.1
    alpha = (alpha * la + torch.sum((img - B) * (F - B), 1, keepdim=True)) / (
        torch.sum((F - B) * (F - B), 1, keepdim=True) + la
    )
    alpha = torch.clamp(alpha, 0, 1)
    return alpha, F, B

Results

In Fig.3 and Fig.4 you can find results of the proposed method compared to the existing ones. TTA stands for test time augmentations, authors used a combination of rotation, flipping and scaling.

results-comparison
Fig.5: Results compared to the other methods
matting-results
Fig.6: Visualization of the results. CA stands for Context Aware Matting.

Conclusion

In our today’s post, we’ve discussed the challenging matting problem, tried the SOTA algorithm on our images and took a deeper look at how this algorithm works. We’ve tried to get around the requirement of manually generated trimap using a semantic segmentation network. It may be not a universal solution but it still produces decent results, especially if
one is ready to play with the parameters.

Subscribe & Download Code

If you liked this article and would like to download code (C++ and Python) and example images used in this post, please subscribe to our newsletter. You will also receive a free Computer Vision Resource Guide. In our newsletter, we share OpenCV tutorials and examples written in C++/Python, and Computer Vision and Machine Learning algorithms and news.

Subscribe Now


Tags: deep learning Image Matting Machine Learning PyTorch tutorial

Filed Under: Deep Learning, Machine Learning, PyTorch, Segmentation, Tutorial

About

I am an entrepreneur with a love for Computer Vision and Machine Learning with a dozen years of experience (and a Ph.D.) in the field.

In 2007, right after finishing my Ph.D., I co-founded TAAZ Inc. with my advisor Dr. David Kriegman and Kevin Barnes. The scalability, and robustness of our computer vision and machine learning algorithms have been put to rigorous test by more than 100M users who have tried our products. Read More…

Getting Started

  • Installation
  • PyTorch
  • Keras & Tensorflow
  • Resource Guide

Resources

Download Code (C++ / Python)

ENROLL IN OFFICIAL OPENCV COURSES

I've partnered with OpenCV.org to bring you official courses in Computer Vision, Machine Learning, and AI.
Learn More

Recent Posts

  • Making A Low-Cost Stereo Camera Using OpenCV
  • Optical Flow in OpenCV (C++/Python)
  • Introduction to Epipolar Geometry and Stereo Vision
  • Depth Estimation using Stereo matching
  • Classification with Localization: Convert any Keras Classifier to a Detector

Disclaimer

All views expressed on this site are my own and do not represent the opinions of OpenCV.org or any entity whatsoever with which I have been, am now, or will be affiliated.

GETTING STARTED

  • Installation
  • PyTorch
  • Keras & Tensorflow
  • Resource Guide

COURSES

  • Opencv Courses
  • CV4Faces (Old)

COPYRIGHT © 2020 - BIG VISION LLC

Privacy Policy | Terms & Conditions

We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.AcceptPrivacy policy