CenterNet: Objects as Points – Anchor Free Object Detection Explained

Anchor free object detection is powerful because of its speed and generalizability to other computer vision tasks. “CenterNet: Object as Points” is one of the milestones in the anchor-free object detection algorithm. In this post, we will discuss the fundamentals of object detection, anchor free (anchorless) vs. anchor-based object detection, CenterNet Object as Points paper, CenterNet pose estimation, and inference of the CenterNet model.

The objective of the blog post is to answer the following questions.
1. What is object detection in Machine Learning/Deep Learning?
2. What is an anchor in object detection?
3. What is anchor-based object detection?
4. What is anchor-free object detection?
5. Is anchor free better than anchor-based object detection?

People who will benefit most from this article are those who:

Understand the deep learning classification pipeline and want to learn deep learning-based object detection.
Have experience in anchor-based object detection and want to explore anchor free object detection.
Want to dive deep into CenterNet Object as Points algorithm.
Want to access models from the TensorFlow model zoo and use it in your application.

CenterNet in 5 mins

What is Object Detection in Machine Learning/Deep Learning?
What is an Anchor in Object Detection?
What is Anchor-Based Object Detection?
What is Anchor-Free Object Detection?
Is Anchor-Free Better Than Anchor-based Object Detection?
Components of Deep Learning-Based Object Detection
How Does CenterNet Work?
Ground Truth Encoding in CenterNet
Model Prediction Decoding in CenterNet
Loss Functions in CenterNet
CenterNet Pose Estimation
CenterNet model Inference using TensorFlow
Conclusion

What is Object Detection in Machine Learning/Deep Learning?

Object detection is a computer vision technique to localize and classify objects in an image or video. The deep learning-based object detection technique is highly successful with many applications using object detection, for example, surveillance and tracking.

What is an Anchor in Object Detection?

Pre-defined bounding boxes as proposals of the ground truth are called anchors in object detection. Let us understand it by an example.

Suppose you have an image of the size 300 x 300 as shown above (fig. 1). It is equally divided into 3 x 3 grid cells. A bounding box of size 80 x 75 is at the center of each grid cell. These predefined bounding boxes (yellow color) are called anchors. If any of these proposals have enough overlap with the ground truth bounding box, then it will be assigned the ground truth’s class labels, else it will be assigned background class labels (no object class labels).

One grid cell may have multiple pre-defined bounding boxes/proposals, as shown in the below image. (fig. 2).

Multiple anchors per grid cell — fig. 2: Multiple Anchors per Grid Cell

What is Anchor-Based Object Detection?

A deep learning-based object detection method that uses pre-defined bounding boxes (anchors) as proposals is known as anchor-based object detection.

Anchors are assigned class labels with a label assignment strategy. For example, in a naive label assignment strategy, if the maximum IoU (Intersection over Union) of an anchor with some ground truth is greater than 0.5, then the anchor will be assigned the ground truth label.

In anchor-based object detection, the model predicts the class labels of anchor boxes and how the anchor should be adjusted to match it with ground truth objects as below. (fig. 3).

fig. 3: Anchor-based Object Detection Model Prediction

What is Anchor Free Object Detection?

Anchor free object detection directly predicts the bounding box, but it predicts it with respect to some fixed reference in the image.

For example, consider an image of size 300 x 300 as below (fig. 4). It is equally divided into 6 x 6 grid cells. The ground truth center will fall in one of the grid cells. Therefore, only the grid cell is responsible for predicting the object’s width, height, and center deviation from the grid cell center.

Is Anchor Free Better Than Anchor-based Object Detection?

The initial success of the anchor-based method led to more research, which made it more accurate than anchor free object detection. However, recent success in anchor free object detection made it equivalent to the anchor-based method in terms of accuracy.

Following are a few advantages of anchor free methods over anchor-based:

Finding suitable anchor boxes (in shape and size) is crucial in training an excellent anchor-based object detection model. Finding suitable anchors is a complex problem and may need hyper-parameter tuning.
Using more anchors results in better accuracy in anchor-based object detection but using more anchors comes at a cost. The model needs more complex architecture, which leads to slower inference speed.
Anchor free object detection is more generalizable. It predicts objects as points that can easily be extended to key-points detection, 3D object detection, etc. However, the anchor-based object detection solution approach is limited to bounding box prediction.

Components of Deep Learning-Based Object Detection

Before discussing how CenterNet object detection works, let’s look at how deep learning-based object detection generally works.

Object detection has the following four components (fig. 5):

Object detection model
Ground truth encoding
Loss function
Model prediction decoding

Object Detection Model

A CNN (Convolution Neural Network) based architecture that maps input images to low-resolution features. These features have rich information about objects (object localization and its class) of the image (fig. 6).

However, a simple CNN architecture is less effective for an object detection task because deeper layers output (starting from input) lack localization information. FPN [3] or similar architecture solves this problem. We will discuss this in a future post.

100K+ Learners
3 Hours of Learning

Join Free OpenCV Bootcamp

15K+ Learners
3 Hours of Learning

Join Free TensorFlow Bootcamp

10K+ Learners
8 Hours of Learning

Join Free PyTorch Bootcamp

Ground Truth Encoding in Object Detection

To compare the model output and ground truth label using some loss, both should be of the same shape.

In fig. 6 (above), you can see that the model output size is $C \times 3 \times 3$ . However, the ground truth labels for an image have the following (or similar) format:

$[class_{id}, x_{min}, y_{min}, x_{max}, y_{max}]$

So for training, it is essential to convert these labels into the model output format. Converting ground truth in a specific format is called ground truth encoding.

Loss Function in Object Detection

Object detection model training also needs a loss function like other neural network training.

An object detection problem has two parts:

Localization of an object (predicting four numbers to represent a bounding box) is a regression problem.
Predicting the class label of the localized coordinate is a classification problem.

So it needs two different loss functions to solve these two problems. The weighted sum of localization and classification loss is the final loss of the object detection model prediction.

The number of background classes is much higher than the object class, which makes the object detection problem harder. So it needs a classification loss function that can deal with skewed data, for example, focal loss [4].

Object Detection Model Prediction Decoding

The model output needs to be decoded (reverse of ground truth encoding) to ground truth format, ( $[class_{id}, x_{min}, y_{min}, x_{max}, y_{max}]$ ).

Now that you know the deep learning-based object detection components let us see how CenterNet implements these components.

How Does CenterNet Work?

Before going further, it is essential to know the timeline of CenterNet. In April 2019, the following two papers were published in CVPR:

Object as Points by Xingyi Zhou, Dequan Wang, Philipp Krähenbühl [1]

CenterNet: Keypoint Triplets for Object Detection by Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, Qi Tian [2]

In the paper “CenterNet: Objects as Points,” the authors use the term CenterNet to refer to their algorithm. Let us go through the following aspects to understand how CenterNet works.

Ground truth encoding
Model prediction decoding
Model prediction decoding
CenterNet Model (backbone architecture)

Ground Truth Encoding in CenterNet

In CenterNet, an object is modeled as the center point of its bounding box. The bounding box size and other object properties are inferred from the keypoint feature at the center (fig. 7).

Ground truth encoding depends on the object detection model output. So, first, let us look at the model output (fig. 8).

fig. 8: CenterNet Object Detection Model Output

The model has three output heads: the keypoint heatmap, object size, and local offset. First, it takes an input image of shape $3 \times W \times H$ . The output stride is $R$ , so the output width and height are $\lfloor\frac{W}{R}\rfloor$ and $\lfloor\frac{H}{R}\rfloor$ , respectively. And what about the depth (channels) of each output head? The depth of the keypoint heatmap is $C$ (number of object classes), the object size is $2$ , and the local offset is $2$ . Let’s look at why this is so.

Keypoint Heatmap in CenterNet

It has one channel for each object class. So the depth is the number of an object class (e.g., 80 for the COCO dataset). The keypoint heatmap is responsible for keypoint encoding and prediction.

How is the CenterNet keypoint heatmap encoded?

Let us assume a point $p$ , $(p_x, p_y)$ is the center of a bounding box. The output stride of the CenterNet model is $R$ , so the point in the heatmap space will be $\tilde{p}$ , $(\lfloor\frac{p_x}{R}\rfloor, \lfloor\frac{p_y}{R}\rfloor) = (\tilde{p}_x, \tilde{p}_y)$ (let).

The ground truth encoding (keypoint value, at $(x, y)$ for class $c$ ), $Y_{xyc}$ in the heatmap space is assigned using Gaussians kernel as follows:

(1) $\begin{equation*} $Y_{xyc} = exp (- \frac{(x-\tilde{p}_x)^2 + (y-\tilde{p}_y)^2}{2\sigma^2_p})$ \end{equation*}$

Here, $\sigma_p$ is the object size-adaptive standard deviation. It means $\sigma_p$ depends on the bounding box shape and size. How do we calculate it? Have a look at the following image (fig. 9).

(2) $\begin{equation*} $\sigma_p = \frac{r}{3}$ \end{equation*}$

where $r$ is the radius.

Find the radius as follows:

Assume a circle of radius $r$ on the top-left and bottom-right corner of a bounding box.

The largest possible radius that satisfies the property: Any pair of points, one from the top-left circle and the other from the bottom-right circle, generate the bounding box (dotted green line) with at least $t$ IoU with ground truth (solid red line).

Look at heatmap values around the bounding box center in heatmap space (fig. 10).

What if the image has more than two instances of the same class? In this case, $Y_{xyc}$ will take the maximum value of all.

Local Offset in CenterNet

Mapping to the heatmap space, the points lose their precision because of the CenterNet model output stride, $R$ .

For example, if a point, $p = (123, 200)$ , is in the original image space.

In heatmap space (let’s $R=4$ ), the point will be $\tilde{p} = (\lfloor\frac{123}{4}\rfloor, \lfloor\frac{200}{4}\rfloor) = (30, 50)$ .

If one remaps it to the original image space only using $R = 4$ , the point remaps to $(30\times4, 50\times4) = (120, 200)$ .

For the $x$ -coordinate, there is an error of $3$ pixels.

To fix such errors, the model also predicts the local offset of the heatmap points. So it has two channels for improving the $x$ -coordinate and $y$ -coordinate.

The ground truth encoding of offset for point p_tilde is the following:

(3) $\begin{equation*} $O_{\tilde{p}} = \frac{p}{R} - \tilde{p}$ \end{equation*}$

Using the offset, you can obtain the precise point in the original image. Let us review the above example.

$O_{\tilde{p}} = (\frac{123}{4} - 30, \frac{200}{4}-50) = (0.75, 0)$

$\tilde{p} = (30, 50)$

Re-maps to the original image space:

$(O_{\tilde{p}} + \tilde{p}) \times R = (30.75, 50) \times 4 = (123, 200)$

It gets the original point $(123, 200)$ in the image using offset value and heatmap.

Object Size in CenterNet

It predicts the width and height of the bounding box, so its depth is two.

How do we achieve ground truth encoding for object size?

Let $[x_{min}, y_{min}, x_{max}, y_{max}]$ be a bounding box of the class label, $c$ , in the original image space.

So the center point in image space is:

$p = (\frac{x_{min}+x_{max}}{2}, \frac{y_{min} + y_{max}}{2})$

It’s center point, $\tilde{p}$ in the model output space (stride, $R$ ) is:

$\tilde{p} = \lfloor\frac{p}{R}\rfloor$

Then, the object size label (encoding) in the model output space, $s_{\tilde{p}}$ , is:

(4) $\begin{equation*} $s_{\tilde{p}} = (\frac{x_{max}-x_{min}}{2}, \frac{y_{max}-y_{min}}{2})$ \end{equation*}$

Width and height encoding is only needed for bounding box centers. However, the model will predict values for other points as well. Predictions other than the bounding box center will not contribute to training.

Model Prediction Decoding in CenterNet

Decoding is the reverse of encoding. The final prediction will be a combination of all three head outputs combination.

Heatmap prediction,
Offset prediction, and
Size prediction

Heatmap for bounding box center (keypoint): The point at which all eight neighboring points have a lower value becomes a potential point for the bounding box center (key point). Select the top $100$ points to form each channel (for each class).

Offset for bounding box center correction: Once the potential locations are found from the heatmap, it will be corrected using offset prediction.

For example, If a heatmap point is $(50, 25)$ and offsets prediction at $(50, 25)$ is $(0.4, 0.5)$ and assume the output stride, $R=4$ .

So the center in the image space will be $((50+0.4)\times4, (25+0.5)\times4)$ .

Size of bounding box width and height prediction: For each selected point on the heatmap, size prediction will give the width and height of the bounding box.

Loss Functions in CenterNet

As the CenterNet model has three output heads, it also needs three loss functions.

Heatmap Loss
Offset Loss
Object Loss

Let us use the same notations in ground truth encoding. Let us denote the corresponding prediction with hat (^).

$heatmap_{gt}: Y_{xyc}, \hfill heatmap_{prediction}: \hat{Y}_{xyc} \newline offset_{gt}: O_{\tilde{p}}, \hfill offset_{prediction}: \hat{O}_{\tilde{p}} \newline objectsize_{gt}: s_{\tilde{p}}, \hfill objectsize_{prediction}: \hat{s}_{\tilde{p}} \newline$

Heatmap Loss:

(5) $\begin{equation*} $L_k = -\frac{1}{N} \sum_{xyc} \left\{ \begin{array}{ c l } (1-\hat{Y}_{xyc})^\alpha log(\hat{Y}_{xyc}) & \quad \textrm{if } Y_{xyc} = 1 \\ (1-Y_{xyc})^\beta (\hat{Y}_{xyc})^\alpha log(1-\hat{Y}_{xyc}) & \quad \textrm{otherwise} \end{array} \right. \end{equation*}$

Where $N$ is a number of bounding boxes (key points).

$\alpha$ and $\beta$ are hyper-parameters of the focal loss.

The first part, where $Y_{xyc} = 1$ :

$L_{Y_{xyc}=1} = (1-\hat{Y}_{xyc})^\alpha log(\hat{Y}_{xyc})$

It is a focal loss for positive class.

Focal loss is a modified version of cross-entropy loss $(-log\hat{Y}_{xyc})$ where it discounts the cross-entropy loss by a factor $(1-\hat{Y}_{xyc})^\alpha$ . You can notice that if $\hat{Y}_{xyc}$ is close to zero then $(1-\hat{Y}_{xyc})$ is close to one. So the loss is significantly less. However, if $\hat{Y}_{xyc}$ is high, then $(1-\hat{Y}_{xyc})$ becomes very small, then the loss discounts are heavy

In other words, the loss contribution from confident prediction is much less compared to a not-confident one.

Hence focal loss is powerful in dealing with class imbalance.

The second part: where, $Y_{xyc} != 1$

$L_{Y_{xyc}!=1} = (1-Y_{xyc})^\beta (\hat{Y}_{xyc})^\alpha log(1-\hat{Y}_{xyc})$

Let’s break it into two:

The focal loss component for negative class: $(\hat{Y}_{xyc})^\alpha log(1-\hat{Y}_{xyc})$
The multiplying factor: $(1-Y_{xyc})^\beta$

The first part is a focal loss for the negative class.

The second multiplying factor $(1-Y_{xyc})^\beta$ is a discounting factor of negative class focal loss. The value of $Y_{xyc}$ is high (close to one) near the bounding box center. So the value of $(1-Y_{xyc})$ is small near the bounding box center, and $(1-Y_{xyc})$ increases as it goes away from the bounding box center. This means that points near the bounding box center are less negative than those farther.

Offset Loss: For offset, it uses mean absolute error (MAE).

(6) $\begin{equation*} $L_{off} = \frac{1}{N} \sum_{k=1}^N |\hat{O}_{\tilde{p}} - O_{\tilde{p}}| \end{equation*}$

It calculates loss only at the bounding box center (keypoints) in output space.

Object loss: It also uses the mean absolute error (MAE).

(7) $\begin{equation*} $L_{size} = \frac{1}{N} \sum_{k=1}^N |\hat{s}_{\tilde{p}} - s_{\tilde{p}}| \end{equation*}$

It calculates loss only at the bounding box center (keypoints) in output space. Other points do not have width and height labels.

CenterNet Model (Backbone Architecture)

CenterNet uses three different backbones in their experiments (fig. 11):

ResNet
Deep Layer Aggregation (DLA)
A Stacked Hourglass

ResNet and DLA are modified. In addition, up-convolution is added to increase the output resolution. Finally, deformable convolution is added to make the model more robust to geometric variation of objects.

CenterNet Pose Estimation

Before seeing how CenterNet solves pose estimation, let us restate how it solved the object detection problem.

CenterNet is predicting the following to solve object detection problem:

Heatmap: Each class has one channel to predict the corresponding class’s key points. Here key points are bounding box center.
Offset: Further, offset fixes key points’ precision.
Object size: It predicts two features, the object’s width and height at its bounding box center.

Have a look at the image below (fig. 12).

Joint Heatmap: Let us assume $k$ joints as different class labels in the pose estimation. So heatmap needs $k$ -channels to predict these classes.

Joint Offset: It improves the precision of joints’ key points.

Joint Locations: It predicts the locations of joints from the object center. To represent $k$ points in $2D$ space, it needs $2k$ numbers. So it predicts $2k$ features at the object center.

The above explanation is about encoding/prediction. What about decoding?

It gets key points from the heatmap, which exceeds a threshold value of, say, $0.1$ .

Key points with respect to the object center are also predicted in joint locations.

These two predictions are mapped using minimum distance for each class. However, the heatmap prediction is the final prediction. So essentially, this exercise is to find the map key points to its object center.

Finally, you have the object center, and key points belong to the object center, which completes the decoding.

CenterNet Results

Speed vs. Accuracy Plot (fig. 13):

Speed-accuracy trade-off for real-time detectors. CenterNet outperforms a range of state-of-the-art algorithms.

Below is the speed-accuracy trade-off of different CenterNet models (fig. 14).

Here, AP is Average Precision.

It uses three augmentations for AP calculation:

Inference with the original image (NA).
Inference with the original and flipped images, the final prediction is average to two predictions (F).
Inference with the original flipped and multiscale resolution, the final prediction uses NMS (MS).

The FPS reports are with image size 512 x 512 on TITAN-V GPU.

Sample Prediction

Below is the prediction from the CenterNet model for object detection (1st row) and pose estimation (2nd and 3rd row) (fig. 15).

CenterNet model Inference using TensorFlow

The authors have provided the code on GitHub [6]. However, we will use TensorFlow Hub [7] to download the pre-trained CenterNet models and make inferences.

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

Click here to download the source code to this post

import cv2
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt

import urllib

Select CenterNet Models

TensorFlow hub has the following CenterNet pre-trained model:

ResNet18
ResNet50
ResNet101
DLA-34
Hourglass104

Here, we will use ResNet101 and Hourglass104 for inference.

models = {
          'Resnet101':'https://tfhub.dev/tensorflow/centernet/resnet101v1_fpn_512x512/1',   
          'HourGlass104':'https://tfhub.dev/tensorflow/centernet/hourglass_512x512/1'
         }

Download Images

Images URLS.

images = [
          'https://farm7.staticflickr.com/6073/6032446158_85fa667cd2_z.jpg',
          'https://farm9.staticflickr.com/8538/8678472399_886f8eabec_z.jpg',
          'https://farm6.staticflickr.com/5485/10028794463_d8cbb38932_z.jpg',
          'https://farm4.staticflickr.com/3057/2475401198_0a342a907e_z.jpg'
         ]

Download images.

# Download images.
for i in range(len(images)):
    urllib.request.urlretrieve(images[i], "img{}.jpg".format(i+1))

Read images.

# Read Images.
img1 = cv2.imread('img1.jpg')
img2 = cv2.imread('img2.jpg')
img3 = cv2.imread('img3.jpg')
img4 = cv2.imread('img4.jpg')

Plot images.

def plot_images(img_list, title=None, row=1, column=2, 
                fig_size=(10, 15)):
    plt.figure(figsize=fig_size)
    for i, img in enumerate(img_list):
        plt.subplot(row, column, i+1)
        plt.imshow(img[...,::-1])
        plt.axis('off')
        plt.title(title[i] if title else 'img{}'.format(i+1))
    plt.show()

image_list = [img1, img2, img3, img4]
plot_images(image_list, row=2, column=2, fig_size=(15, 10))

COCO Classes

TensorFlow models are trained on the COCO dataset. So we need the class name for the class id map.

category_index = {1: 'person', 2: 'bicycle', 3: 'car', 4: 'motorcycle', 
                  5: 'airplane', 6: 'bus', 7: 'train', 8: 'truck', 9: 'boat', 
                  10: 'traffic light', 11: 'fire hydrant', 13: 'stop sign', 
                  14: 'parking meter', 15: 'bench', 16: 'bird', 17: 'cat', 
                  18: 'dog', 19: 'horse', 20: 'sheep', 21: 'cow', 
                  22: 'elephant', 23: 'bear', 24: 'zebra', 25: 'giraffe', 
                  27: 'backpack', 28: 'umbrella', 31: 'handbag', 32: 'tie', 
                  33: 'suitcase', 34: 'frisbee', 35: 'skis', 36: 'snowboard', 
                  37: 'sports ball', 38: 'kite', 39: 'baseball bat', 
                  40: 'baseball glove', 41: 'skateboard', 42: 'surfboard', 
                  43: 'tennis racket', 44: 'bottle', 46: 'wine glass', 
                  47: 'cup', 48: 'fork', 49: 'knife', 50: 'spoon', 51: 'bowl',  
                  52: 'banana',  53: 'apple',  54: 'sandwich',  55: 'orange',  
                  56: 'broccoli',  57: 'carrot',  58: 'hot dog',  59: 'pizza',  
                  60: 'donut', 61: 'cake', 62: 'chair', 63: 'couch', 
                  64: 'potted plant', 65: 'bed', 67: 'dining table', 
                  70: 'toilet', 72: 'tv', 73: 'laptop', 74: 'mouse', 
                  75: 'remote', 76: 'keyboard', 77: 'cell phone', 
                  78: 'microwave', 79: 'oven', 80: 'toaster', 81: 'sink', 
                  82: 'refrigerator', 84: 'book', 85: 'clock', 86: 'vase', 
                  87: 'scissors', 88: 'teddy bear', 89: 'hair drier', 
                  90: 'toothbrush'}

Class IDs to Color IDs

Let us create color IDs for each class so we can use them to plot bounding boxes of different classes with different colors.

R = np.array(np.arange(0, 256, 63))
G = np.roll(R, 2)
B = np.roll(R, 4)

COLOR_IDS = np.array(np.meshgrid(R, G, B)).T.reshape(-1, 3)

Load Model from TF Hub

To load the model from TF Hub, use the following code.

# ResNet101.
resnet = hub.load(models['Resnet101'])
# Hourglass104.
hourglass = hub.load(models['HourGlass104'])

Run Inference

An inference outcome is a dictionary with the following keys:

Number of detections
Detection boxes
Detection scores
Detection classes

Let’s make an inference.

# Hourglass104 inference.
result_hourglass = hourglass(np.array([img1]))

Let’s print the keys of the predicted dictionary.

result_hourglass.keys()

dict_keys(['detection_classes', 'detection_scores', 'num_detections', 'detection_boxes'])

Let’s check the shapes of the prediction.

print('num_detections shape\t:{}'.format(result_hourglass['num_detections'].shape))
print('detection_boxes shape\t:{}'.format(result_hourglass['detection_boxes'].shape))
print('detection_scores shape\t:{}'.format(result_hourglass['detection_scores'].shape))
print('detection_classes shape\t:{}'.format(result_hourglass['detection_classes'].shape))

num_detections shape :(1,)
detection_boxes shape :(1, 100, 4)
detection_scores shape :(1, 100)
detection_classes shape :(1, 100)

You can see that the total number of bounding boxes is 100.

Function to Convert Tensors to Numpy

Let’s write a function to convert the TF tensor to NumPy. The function takes the model prediction and returns boxes, scores, and classes.

def to_numpy(prediction):
    result = dict()
    bboxes = prediction['detection_boxes'][0].numpy()
    scores = prediction['detection_scores'][0].numpy()
    # class ids are int
    classes = prediction['detection_classes'][0].numpy().astype(int)
    return bboxes, scores, classes

print_count = 5
bboxes, scores, classes = to_numpy(result_hourglass)
print('detection_boxes:\n{}'.format(bboxes[:print_count]))
print('detection_scores:\n{}'.format(scores[:print_count]))
print('detection_classes:\n{}'.format(classes[:print_count]))

detection_boxes:
[[0.36475214 0.2544147  0.8116531  0.7097043 ]
 [0.15214191 0.6827493  0.84660524 0.82337785]
 [0.30334035 0.85521215 0.4586329  0.94833416]
 [0.36117637 0.19265035 0.48011312 0.2846129 ]
 [0.390655   0.6316224  0.4466164  0.7015844 ]]
detection_scores:
[0.95722336 0.9190534  0.575393   0.48818225 0.46633196]
detection_classes:
[22  1 62 15 44]

Function to Filter Confident Predictions

def filter_detections_on_score(boxes, scores, classes, score_thresh=0.3):
    ids = np.where(scores >= score_thresh)
    return boxes[ids], scores[ids], classes[ids]

score_thresh = 0.30
bboxes, scores, classes = filter_detections_on_score(bboxes, scores, classes, 
                                                    score_thresh)
print('detection_boxes:\n{}'.format(bboxes))
print('detection_scores:\n{}'.format(scores))
print('detection_classes:\n{}'.format(classes))

detection_boxes:
[[0.36475214 0.2544147  0.8116531  0.7097043 ]
 [0.15214191 0.6827493  0.84660524 0.82337785]
 [0.30334035 0.85521215 0.4586329  0.94833416]
 [0.36117637 0.19265035 0.48011312 0.2846129 ]
 [0.390655   0.6316224  0.4466164  0.7015844 ]
 [0.3120035  0.02235281 0.52857524 0.1586653 ]
 [0.3826255  0.02449112 0.49790186 0.15688817]
 [0.2699546  0.7290414  0.30082417 0.75153786]]
detection_scores:
[0.95722336 0.9190534  0.575393   0.48818225 0.46633196 0.36656532
 0.36645815 0.31047532]
detection_classes:
[22  1 62 15 44 15 15 32]

Function to Convert Normalized Outputs to Pixel

You can see that the bounding box coordinates are in normalized form. So let’s write a function to convert it into pixel form.

def normalize_to_pixels_bboxs(bboxes, img):
    img_height, img_width, _ = img.shape
    bboxes[:, 0] *= img_height
    bboxes[:, 1] *= img_width
    bboxes[:, 2] *= img_height
    bboxes[:, 3] *= img_width
    return bboxes.astype(int)

bboxes = normalize_to_pixels_bboxs(bboxes, img1)
print('detection_boxes:\n{}'.format(bboxes))

detection_boxes:
[[184 162 409 454]
 [ 76 436 427 526]
 [153 547 231 606]
 [182 123 242 182]
 [197 404 225 449]
 [157  14 266 101]
 [193  15 251 100]
 [136 466 151 480]]

Function to Annotate Detections

def add_prediction_to_image(img, bboxes, scores, classes, id_class_map=category_index, colors=COLOR_IDS):
    img_with_bbox = img.copy()
    for box, score, cls in zip(bboxes, scores, classes):
        top, left, bottom, right = box
        class_name = id_class_map[cls]

        # Bounding box annotations.
        color = tuple(colors[cls % len(COLOR_IDS)].tolist())[::-1]
        img_with_bbox = cv2.rectangle(img_with_bbox, (left, top), (right, bottom), color, thickness=2)
        display_txt = '{}: {:.2f}'.format(class_name, score)
        ((text_width, text_height), _) = cv2.getTextSize(display_txt, cv2.FONT_HERSHEY_SIMPLEX, 1.0, 2) 
        img_with_bbox = cv2.rectangle(img_with_bbox, (left, top - int(0.9 * text_height)), (left + int(0.4*text_width), top), color, thickness=-1)
        img_with_bbox = cv2.putText(img_with_bbox, display_txt, (left, top - int(0.3 * text_height)), cv2.FONT_HERSHEY_SIMPLEX, 0.4, (0, 0, 0), 1)

    return img_with_bbox

annotated_img = add_prediction_to_image(img1, bboxes, scores, classes)
plot_images([annotated_img], ['Hourglass 104'],row=1, column=1, fig_size=(10, 10))

Wrap-Up Inference, Annotation, and Plotting

It’s time to put everything together.

The following function accepts an image, the model, and a score threshold. The image is forward passed to get all the bounding box predictions. These are further filtered by the function filter_detection_on_score and annotated by the function add_prediction_to_image. This function returns the image with predictions annotated.

def infer_and_add_prediction_to_image(img, model, score_thresh=0.3):
    prediction = model(np.array([img]))
    bboxes, scores, classes = to_numpy(prediction)

    bboxes, scores, classes = filter_detections_on_score(bboxes, scores, classes, 
                                                        score_thresh)
    boxes = normalize_to_pixels_bboxs(bboxes, img)
    img_with_bboxes = add_prediction_to_image(img, boxes, scores, classes)
    return img_with_bboxes

Let’s use the above function with img1 and the ResNet model.

annotated_img = infer_and_add_prediction_to_image(img1, resnet)
plot_images([annotated_img], ['ResNet 101'], row=1, column=1, fig_size=(10, 10))

Let’s write a wrapper function that takes an image, makes inferences on both Hourglass and ResNet models, and plots filtered detection.

def show_hourglass_resnet_inference(img, score_thresh=0.3):
    hourglass_infer = infer_and_add_prediction_to_image(img, hourglass)
    resnet_infer = infer_and_add_prediction_to_image(img, resnet)
    image_list = [hourglass_infer, resnet_infer]
    titles = ['Hourglass 104', 'ResNet 101']
    plot_images(image_list, titles, row=1, column=2, fig_size=(20, 10))

Results

Let’s use the above function for all four images one by one.

show_hourglass_resnet_inference(img1)

show_hourglass_resnet_inference(img2)

show_hourglass_resnet_inference(img3)

show_hourglass_resnet_inference(img4)

Here, you can notice that Hourglass 104 model is better than ResNet 101.

Conclusion

CenterNet Object as Points is an anchor free object detector. Because of its generalizability, it can be used for human pose estimation, 3D detection, and much more.

It uses heatmaps to select the predicted object that removes the requirements of NMS.

TensorFlow hub has pre-trained CenterNet models with Hourglass, ResNet, and DLA backbone that can be used for inference and finetuning.