Speeding up Dlib’s Facial Landmark Detector

In this tutorial I will explore a few ways to speed up Dlib’s Facial Landmark Detector.

Dlib’s Facial Landmark Detector

Dlib has a very good implementation of a very fast facial landmark detector. I had reviewed it in my post titled Facial Landmark Detection.

Subsequently, I wrote a series of posts that utilize Dlib’s facial landmark detector.

There are two example files in Dlib that deal with facial landmark detection

For Images : dlib/examples/face_landmark_detection_ex.cpp
For Videos : dlib/examples/webcam_face_pose_ex.cpp

The tricks used in this post are included in my version of Dlib in the following files

For Images : dlib/examples/face_landmark_detection_to_file.cpp
For Videos : dlib/examples/webcam_face_pose_fast.cpp

This post fully explains all the tricks and provides snippets of code. To get access to the above files, and code and images used in all other posts please subscribe to our newsletter.

About the only complaint I have heard from readers of this blog about Dlib’s facial landmark detector is that it is slow. Is it really slow ? Yes and No. Out of the box it appears to be slow, but that is not because of bad implementation of the Facial Landmark Detector. Let’s find out the bottlenecks and how to improve the speed.

How to make Dlib’s Facial Landmark Detector faster ?

Dlib’s facial landmark detector implements a paper that can detect landmarks in just 1 millisecond! That is 1000 frames a second. You will never get 1000 fps because you first need to detect the face before doing landmark detection and that takes a few 10s of milliseconds. But you can easily do 30 fps with the optimizations listed below.

100K+ Learners
3 Hours of Learning

Join Free OpenCV Bootcamp

15K+ Learners
3 Hours of Learning

Join Free TensorFlow Bootcamp

10K+ Learners
8 Hours of Learning

Join Free PyTorch Bootcamp

Compile Dlib in Release Mode with Optimizations turned on

As mentioned in the Dlib’s documentation, it is critical to compile Dlib in release mode with appropriate compiler instructions turned on.

Using CMAKE

cd dlib/examples
mkdir build
cd build

# Enable compiler instructions.
# In the example below I have enabled SSE4
# Use the one that is appropriate for you

# SSE2 works for most Intel or AMD chip.
# cmake .. -DUSE_SSE2_INSTRUCTIONS=ON

# SSE4 works for most current machines
cmake .. -DUSE_SSE4_INSTRUCTIONS=ON

# AVX works on processors released after 2011.
# cmake .. -DUSE_AVX_INSTRUCTIONS=ON

# Compile in release mode
cmake --build . --config Release

If you are using Intel or AMD chip enable at least SSE2 instructions. AVX is the fastest but requires a CPU from at least 2011. SSE4 is the next fastest and is supported by most current machines.

Using Visual Studios

People often make this mistake while using Visual Studios because by default they are working in the debug mode. You can see detailed explanation and how to fix it here.

Using QT

Similarly while using QT you need to turn on Release mode as show below.

Speed Up Face Detection

The following steps will help speed up face detection with small ( probably negligible ) loss in accuracy.

Resize Frame

Facial Landmark Detector algorithms usually require the user to provide a bounding box containing a face. The algorithm takes as input this box and returns the landmarks. The time reported by these algorithms is only the time required to do landmark detection and not the face detection. Landmark detection algorithms can run in less than 5 milliseconds, but face detection can take a long time ( 30 milliseconds ). The speed of face detection depends on the the resolution of the image because with smaller resolution images, you look for a smaller range of face sizes. The downside is that you will miss out smaller faces, but in most of the applications I have listed above we have one person looking at the webcam from arm’s length.

An easy way to speed up face detection is to resize the frame. My webcam records video at 720p ( i.e. 1280×720 ) resolution and I resize the image to a quarter of that for face detection. The bounding box obtained should be resized by dividing the coordinates by the scale used for resizing the original frame. This allows us to do facial landmark detection at full resolution.

Skip frame

Typically webcams record video at 30 fps. In a typical application you are sitting right in front of the webcam and not moving much. So there is no need to detect the face in every frame. We can simply do facial landmark detection based on facial bounding box obtained a few frames earlier. If you do face detection every 3 frames, you can have just sped up landmark detection by almost three times.

Is is possible to do better than using the previous location of the frame ? Yes, we can use Kalman filtering to predict the location of the face in frames where detection is not done, but in a webcam application it is an overkill.

The snippet of code for the above optimizations is show below. Check out the highlighted lines.

#define FACE_DOWNSAMPLE_RATIO 4
#define SKIP_FRAMES 2

cv::VideoCapture cap(0);
cv::Mat im;
cv::Mat im_small, im_display;

frontal_face_detector detector = get_frontal_face_detector();
shape_predictor pose_model;
deserialize("shape_predictor_68_face_landmarks.dat") >> pose_model;

int count = 0;
std::vector<rectangle> faces;

// Grab a frame
cap >> im;

// Resize image for face detection
cv::resize(im, im_small, cv::Size(), 1.0/FACE_DOWNSAMPLE_RATIO, 1.0/FACE_DOWNSAMPLE_RATIO);

// Change to dlib's image format. No memory is copied.
cv_image<bgr_pixel> cimg_small(im_small);
cv_image<bgr_pixel> cimg(im);

// Detect faces on resize image
if ( count % SKIP_FRAMES == 0 )
{
    faces = detector(cimg_small);
}

// Find the pose of each face.
std::vector<full_object_detection> shapes;
for (unsigned long i = 0; i < faces.size(); ++i)
{
    // Resize obtained rectangle for full resolution image.
     rectangle r(
                   (long)(faces[i].left() * FACE_DOWNSAMPLE_RATIO),
                   (long)(faces[i].top() * FACE_DOWNSAMPLE_RATIO),
                   (long)(faces[i].right() * FACE_DOWNSAMPLE_RATIO),
                   (long)(faces[i].bottom() * FACE_DOWNSAMPLE_RATIO)
                );

    // Landmark detection on full sized image
    full_object_detection shape = pose_model(cimg, r);
    shapes.push_back(shape);

    // Custom Face Render
    render_face(im, shape);
}

Optimizing Display

When I first tried speeding up facial landmark detector, I was surprised to find that a third of the time was spent in drawing the landmarks and displaying the frame. I did two optimizations that helped speed up things

Resize Frame

I resized the image to half resolution for display. This makes a huge difference because when the resolution is changed from 720p to 360p, the actual number of pixels that need to be displayed goes down by a factor of 4.

Custom Face Renderer

Dlib’s face render didn’t work very well for me; the frames did not render smoothly. So I wrote my own using OpenCV’s polylines. The code is shown below

#ifndef BIGVISION_RENDER_FACE_H_
#define BIGVISION_RENDER_FACE_H_
#include <dlib/image_processing/frontal_face_detector.h>
#include <opencv2/highgui/highgui.hpp>

void draw_polyline(cv::Mat &img, const dlib::full_object_detection& d, const int start, const int end, bool isClosed = false)
{
    std::vector <cv::Point> points;
    for (int i = start; i <= end; ++i)
    {
        points.push_back(cv::Point(d.part(i).x(), d.part(i).y()));
    }
    cv::polylines(img, points, isClosed, cv::Scalar(255,0,0), 2, 16);

}

void render_face (cv::Mat &img, const dlib::full_object_detection& d)
{
    DLIB_CASSERT
    (
     d.num_parts() == 68,
     "\n\t Invalid inputs were given to this function. "
     << "\n\t d.num_parts():  " << d.num_parts()
     );

    draw_polyline(img, d, 0, 16);           // Jaw line
    draw_polyline(img, d, 17, 21);          // Left eyebrow
    draw_polyline(img, d, 22, 26);          // Right eyebrow
    draw_polyline(img, d, 27, 30);          // Nose bridge
    draw_polyline(img, d, 30, 35, true);    // Lower nose
    draw_polyline(img, d, 36, 41, true);    // Left eye
    draw_polyline(img, d, 42, 47, true);    // Right Eye
    draw_polyline(img, d, 48, 59, true);    // Outer lip
    draw_polyline(img, d, 60, 67, true);    // Inner lip

}

#endif // BIGVISION_RENDER_FACE_H_

I also tried rendering all the points using a single polyline hoping to see some improvement in speed, but there was no difference in speed at all.

Results

Using the above optimizations I am able to get a speed of 70 fps on videos recorded at 120 fps. On my webcam I get 27-30 fps because we are limited by the recording speed of the webcam. The reported numbers include the time needed to read the frame from camera or video file, face detection, facial landmark detection and display at half resolution.