In this tutorial I will explore a few ways to speed up Dlib’s Facial Landmark Detector.
Dlib’s Facial Landmark Detector
Dlib has a very good implementation of a very fast facial landmark detector. I had reviewed it in my post titled Facial Landmark Detection.
Subsequently, I wrote a series of posts that utilize Dlib’s facial landmark detector.
There are two example files in Dlib that deal with facial landmark detection
- For Images : dlib/examples/face_landmark_detection_ex.cpp
- For Videos : dlib/examples/webcam_face_pose_ex.cpp
- For Images : dlib/examples/face_landmark_detection_to_file.cpp
- For Videos : dlib/examples/webcam_face_pose_fast.cpp
This post fully explains all the tricks and provides snippets of code. To get access to the above files, and code and images used in all other posts please subscribe to our newsletter.
About the only complaint I have heard from readers of this blog about Dlib’s facial landmark detector is that it is slow. Is it really slow ? Yes and No. Out of the box it appears to be slow, but that is not because of bad implementation of the Facial Landmark Detector. Let’s find out the bottlenecks and how to improve the speed.
How to make Dlib’s Facial Landmark Detector faster ?
Dlib’s facial landmark detector implements a paper that can detect landmarks in just 1 millisecond! That is 1000 frames a second. You will never get 1000 fps because you first need to detect the face before doing landmark detection and that takes a few 10s of milliseconds. But you can easily do 30 fps with the optimizations listed below.
Compile Dlib in Release Mode with Optimizations turned on
As mentioned in the Dlib’s documentation, it is critical to compile Dlib in release mode with appropriate compiler instructions turned on.
Using CMAKE
cd dlib/examples
mkdir build
cd build
# Enable compiler instructions.
# In the example below I have enabled SSE4
# Use the one that is appropriate for you
# SSE2 works for most Intel or AMD chip.
# cmake .. -DUSE_SSE2_INSTRUCTIONS=ON
# SSE4 works for most current machines
cmake .. -DUSE_SSE4_INSTRUCTIONS=ON
# AVX works on processors released after 2011.
# cmake .. -DUSE_AVX_INSTRUCTIONS=ON
# Compile in release mode
cmake --build . --config Release
If you are using Intel or AMD chip enable at least SSE2 instructions. AVX is the fastest but requires a CPU from at least 2011. SSE4 is the next fastest and is supported by most current machines.
Using Visual Studios
People often make this mistake while using Visual Studios because by default they are working in the debug mode. You can see detailed explanation and how to fix it here.
Using QT
Similarly while using QT you need to turn on Release mode as show below.
Speed Up Face Detection
The following steps will help speed up face detection with small ( probably negligible ) loss in accuracy.
Resize Frame
Facial Landmark Detector algorithms usually require the user to provide a bounding box containing a face. The algorithm takes as input this box and returns the landmarks. The time reported by these algorithms is only the time required to do landmark detection and not the face detection. Landmark detection algorithms can run in less than 5 milliseconds, but face detection can take a long time ( 30 milliseconds ). The speed of face detection depends on the the resolution of the image because with smaller resolution images, you look for a smaller range of face sizes. The downside is that you will miss out smaller faces, but in most of the applications I have listed above we have one person looking at the webcam from arm’s length.
An easy way to speed up face detection is to resize the frame. My webcam records video at 720p ( i.e. 1280×720 ) resolution and I resize the image to a quarter of that for face detection. The bounding box obtained should be resized by dividing the coordinates by the scale used for resizing the original frame. This allows us to do facial landmark detection at full resolution.
Skip frame
Typically webcams record video at 30 fps. In a typical application you are sitting right in front of the webcam and not moving much. So there is no need to detect the face in every frame. We can simply do facial landmark detection based on facial bounding box obtained a few frames earlier. If you do face detection every 3 frames, you can have just sped up landmark detection by almost three times.
Is is possible to do better than using the previous location of the frame ? Yes, we can use Kalman filtering to predict the location of the face in frames where detection is not done, but in a webcam application it is an overkill.
The snippet of code for the above optimizations is show below. Check out the highlighted lines.
#define FACE_DOWNSAMPLE_RATIO 4
#define SKIP_FRAMES 2
cv::VideoCapture cap(0);
cv::Mat im;
cv::Mat im_small, im_display;
frontal_face_detector detector = get_frontal_face_detector();
shape_predictor pose_model;
deserialize("shape_predictor_68_face_landmarks.dat") >> pose_model;
int count = 0;
std::vector<rectangle> faces;
// Grab a frame
cap >> im;
// Resize image for face detection
cv::resize(im, im_small, cv::Size(), 1.0/FACE_DOWNSAMPLE_RATIO, 1.0/FACE_DOWNSAMPLE_RATIO);
// Change to dlib's image format. No memory is copied.
cv_image<bgr_pixel> cimg_small(im_small);
cv_image<bgr_pixel> cimg(im);
// Detect faces on resize image
if ( count % SKIP_FRAMES == 0 )
{
faces = detector(cimg_small);
}
// Find the pose of each face.
std::vector<full_object_detection> shapes;
for (unsigned long i = 0; i < faces.size(); ++i)
{
// Resize obtained rectangle for full resolution image.
rectangle r(
(long)(faces[i].left() * FACE_DOWNSAMPLE_RATIO),
(long)(faces[i].top() * FACE_DOWNSAMPLE_RATIO),
(long)(faces[i].right() * FACE_DOWNSAMPLE_RATIO),
(long)(faces[i].bottom() * FACE_DOWNSAMPLE_RATIO)
);
// Landmark detection on full sized image
full_object_detection shape = pose_model(cimg, r);
shapes.push_back(shape);
// Custom Face Render
render_face(im, shape);
}
Optimizing Display
When I first tried speeding up facial landmark detector, I was surprised to find that a third of the time was spent in drawing the landmarks and displaying the frame. I did two optimizations that helped speed up things
Resize Frame
I resized the image to half resolution for display. This makes a huge difference because when the resolution is changed from 720p to 360p, the actual number of pixels that need to be displayed goes down by a factor of 4.
Custom Face Renderer
Dlib’s face render didn’t work very well for me; the frames did not render smoothly. So I wrote my own using OpenCV’s polylines. The code is shown below
#ifndef BIGVISION_RENDER_FACE_H_
#define BIGVISION_RENDER_FACE_H_
#include <dlib/image_processing/frontal_face_detector.h>
#include <opencv2/highgui/highgui.hpp>
void draw_polyline(cv::Mat &img, const dlib::full_object_detection& d, const int start, const int end, bool isClosed = false)
{
std::vector <cv::Point> points;
for (int i = start; i <= end; ++i)
{
points.push_back(cv::Point(d.part(i).x(), d.part(i).y()));
}
cv::polylines(img, points, isClosed, cv::Scalar(255,0,0), 2, 16);
}
void render_face (cv::Mat &img, const dlib::full_object_detection& d)
{
DLIB_CASSERT
(
d.num_parts() == 68,
"\n\t Invalid inputs were given to this function. "
<< "\n\t d.num_parts(): " << d.num_parts()
);
draw_polyline(img, d, 0, 16); // Jaw line
draw_polyline(img, d, 17, 21); // Left eyebrow
draw_polyline(img, d, 22, 26); // Right eyebrow
draw_polyline(img, d, 27, 30); // Nose bridge
draw_polyline(img, d, 30, 35, true); // Lower nose
draw_polyline(img, d, 36, 41, true); // Left eye
draw_polyline(img, d, 42, 47, true); // Right Eye
draw_polyline(img, d, 48, 59, true); // Outer lip
draw_polyline(img, d, 60, 67, true); // Inner lip
}
#endif // BIGVISION_RENDER_FACE_H_
I also tried rendering all the points using a single polyline hoping to see some improvement in speed, but there was no difference in speed at all.
Results
Using the above optimizations I am able to get a speed of 70 fps on videos recorded at 120 fps. On my webcam I get 27-30 fps because we are limited by the recording speed of the webcam. The reported numbers include the time needed to read the frame from camera or video file, face detection, facial landmark detection and display at half resolution.