In this tutorial we will learn how to estimate the pose of a human head in a photo using OpenCV and Dlib.

In many applications, we need to know how the head is tilted with respect to a camera. In a virtual reality application, for example, one can use the pose of the head to render the right view of the scene. In a driver assistance system, a camera looking at a driver’s face in a vehicle can use head pose estimation to see if the driver is paying attention to the road. And of course one can use head pose based gestures to control a hands-free application / game. For example, yawing your head left to right can signify a NO. But if you are from southern India, it can signify a YES! To understand the full repertoire of head pose based gestures used by my fellow Indians, please partake in the hilarious video below.

My point is that estimating the head pose is useful. Sometimes.

Before proceeding with the tutorial, I want to point out that this post belongs to a series I have written on face processing. Some of the articles below are useful in understanding this post and others complement it.

## What is pose estimation ?

In computer vision the pose of an object refers to its relative orientation and position with respect to a camera. You can change the pose by either moving the object with respect to the camera, or the camera with respect to the object.

The pose estimation problem described in this tutorial is often referred to as **Perspective-n-Point** problem or PNP in computer vision jargon. As we shall see in the following sections in more detail, in this problem the goal is to find the pose of an object when we have a calibrated camera, and we know the locations of **n** 3D points on the object and the corresponding 2D projections in the image.

## How to mathematically represent camera motion ?

A 3D rigid object has only two kinds of motions with respect to a camera.

**Translation**: Moving the camera from its current 3D location to a new 3D location is called translation. As you can see translation has 3 degrees of freedom — you can move in the X, Y or Z direction. Translation is represented by a vector which is equal to .**Rotation**: You can also rotate the camera about the , and axes. A rotation, therefore, also has three degrees of freedom. There are many ways of representing rotation. You can represent it using Euler angles ( roll, pitch and yaw ), a rotation matrix, or a direction of rotation (i.e. axis ) and angle.

So, estimating the pose of a 3D object means finding 6 numbers — three for translation and three for rotation.

## What do you need for pose estimation ?

To calculate the 3D pose of an object in an image you need the following information

**2D coordinates of a few points**: You need the 2D (x,y) locations of a few points in the image. In the case of a face, you could choose the corners of the eyes, the tip of the nose, corners of the mouth etc. Dlib’s facial landmark detector provides us with many points to choose from. In this tutorial, we will use the tip of the nose, the chin, the left corner of the left eye, the right corner of the right eye, the left corner of the mouth, and the right corner of the mouth.**3D locations of the same points**: You also need the 3D location of the 2D feature points. You might be thinking that you need a 3D model of the person in the photo to get the 3D locations. Ideally yes, but in practice, you don’t. A generic 3D model will suffice. Where do you get a 3D model of a head from ? Well, you really don’t need a full 3D model. You just need the 3D locations of a few points in some arbitrary reference frame. In this tutorial, we are going to use the following 3D points.- Tip of the nose : ( 0.0, 0.0, 0.0)
- Chin : ( 0.0, -330.0, -65.0)
- Left corner of the left eye : (-225.0f, 170.0f, -135.0)
- Right corner of the right eye : ( 225.0, 170.0, -135.0)
- Left corner of the mouth : (-150.0, -150.0, -125.0)
- Right corner of the mouth : (150.0, -150.0, -125.0)
Note that the above points are in some arbitrary reference frame / coordinate system. This is called the

**World Coordinates**( a.k.a Model Coordinates in OpenCV docs ) .

**Intrinsic parameters of the camera**. As mentioned before, in this problem the camera is assumed to be calibrated. In other words, you need to know the focal length of the camera, the optical center in the image and the radial distortion parameters. So you need to calibrate your camera. Of course, for the lazy dudes and dudettes among us, this is too much work. Can I supply a hack ? Of course, I can! We are already in approximation land by not using an accurate 3D model. We can approximate the optical center by the center of the image, approximate the focal length by the width of the image in pixels and assume that radial distortion does not exist. Boom! you did not even have to get up from your couch!

## How do pose estimation algorithms work ?

There are several algorithms for pose estimation. The first known algorithm dates back to 1841. It is beyond the scope of this post to explain the details of these algorithms but here is a general idea.

There are three coordinate systems in play here. The 3D coordinates of the various facial features shown above are in **world coordinates**. If we knew the rotation and translation ( i.e. pose ), we could transform the 3D points in world coordinates to 3D points in **camera coordinates**. The 3D points in camera coordinates can be projected onto the image plane ( i.e. **image coordinate system** ) using the intrinsic parameters of the camera ( focal length, optical center etc. ).

Let’s dive into the image formation equation to understand how these above coordinate systems work. In the figure above, is the center of the camera and plane shown in the figure is the image plane. We are interested in finding out what equations govern the projection of the 3D point onto the image plane.

Let’s assume we know the location of a 3D point in World Coordinates. If we know the rotation ( a 3×3 matrix ) and translation ( a 3×1 vector ), of the world coordinates with respect to the camera coordinates, we can calculate the location of the point in the camera coordinate system using the following equation.

(1)

In expanded form, the above equation looks like this

(2)

If you have ever taken a Linear Algebra class, you will recognize that if we knew sufficient number of point correspondences ( i.e. and ), the above is a linear system of equations where the and are unknowns and you can trivially solve for the unknowns.

As you will see in the next section, we know only up to an unknown scale, and so we do not have a simple linear system.

### Direct Linear Transform

We do know many points on the 3D model ( i.e. ), but we do not know . We only know the location of the 2D points ( i.e. ). In the absence of radial distortion, the coordinates of point in the image coordinates is given by

(3)

where, and are the focal lengths in the x and y directions, and is the optical center. Things get slightly more complicated when radial distortion is involved and for the purpose of simplicity I am leaving it out.

What about that in the equation ? It is an unknown scale factor. It exists in the equation due to the fact that in any image we do not know the depth. If you join any point in 3D to the center of the camera, the point , where the ray intersects the image plane is the image of . Note that all the points along the ray joining the center of the camera and point produce the same image. In other words, using the above equation, you can only obtain up to a scale .

Now this messes up equation 2 because it is no longer the nice linear equation we know how to solve. Our equation looks more like

(4)

Fortunately, the equation of the above form can be solved using some algebraic wizardry using a method called Direct Linear Transform (DLT). You can use DLT any time you find a problem where the equation is almost linear but is off by an unknown scale.

### Levenberg-Marquardt Optimization

The DLT solution mentioned above is not very accurate because of the following reasons . First, rotation has three degrees of freedom but the matrix representation used in the DLT solution has 9 numbers. There is nothing in the DLT solution that forces the estimated 3×3 matrix to be a rotation matrix. More importantly, the DLT solution does not minimize the correct objective function. Ideally, we want to minimize the **reprojection error** that is described below.

As shown in the equations 2 and 3, if we knew the right pose ( and ), we could predict the 2D locations of the 3D facial points on the image by projecting the 3D points onto the 2D image. In other words, if we knew and we could find the point in the image for every 3D point .

We also know the 2D facial feature points ( using Dlib or manual clicks ). We can look at the distance between projected 3D points and 2D facial features. When the estimated pose is perfect, the 3D points projected onto the image plane will line up almost perfectly with the 2D facial features. When the pose estimate is incorrect, we can calculate a **re-projection error ** measure — the sum of squared distances between the projected 3D points and 2D facial feature points.

As mentioned earlier, an approximate estimate of the pose ( and ) can be found using the DLT solution. A naive way to improve the DLT solution would be to randomly change the pose ( and ) slightly and check if the reprojection error decreases. If it does, we can accept the new estimate of the pose. We can keep perturbing and again and again to find better estimates. While this procedure will work, it will be very slow. Turns out there are principled ways to iteratively change the values of and so that the reprojection error decreases. One such method is called Levenberg-Marquardt optimization. Check out more details on Wikipedia.

## OpenCV solvePnP

In OpenCV the function **solvePnP** and **solvePnPRansac** can be used to estimate pose.

**solvePnP** implements several algorithms for pose estimation which can be selected using the parameter **flag**. By default it uses the flag **SOLVEPNP_ITERATIVE** which is essentially the DLT solution followed by Levenberg-Marquardt optimization. **SOLVEPNP_P3P** uses only 3 points for calculating the pose and it should be used only when using **solvePnPRansac**.

In OpenCV 3, two new methods have been introduced — **SOLVEPNP_DLS** and **SOLVEPNP_UPNP**. The interesting thing about ** SOLVEPNP_UPNP ** is that it tries to estimate camera internal parameters also.

**C++**: bool **solvePnP**(InputArray objectPoints, InputArray imagePoints, InputArray cameraMatrix, InputArray distCoeffs, OutputArray rvec, OutputArray tvec, bool useExtrinsicGuess=false, int flags=SOLVEPNP_ITERATIVE )

**Python**: **cv2.solvePnP**(objectPoints, imagePoints, cameraMatrix, distCoeffs[, rvec[, tvec[, useExtrinsicGuess[, flags]]]]) → retval, rvec, tvec

**Parameters:**

**objectPoints **– Array of object points in the world coordinate space. I usually pass vector of N 3D points. You can also pass Mat of size Nx3 ( or 3xN ) single channel matrix, or Nx1 ( or 1xN ) 3 channel matrix. I would highly recommend using a vector instead.

**imagePoints** – Array of corresponding image points. You should pass a vector of N 2D points. But you may also pass 2xN ( or Nx2 ) 1-channel or 1xN ( or Nx1 ) 2-channel Mat, where N is the number of points.

**cameraMatrix** – Input camera matrix . Note that , can be approximated by the image width in pixels under certain circumstances, and the and can be the coordinates of the image center.

**distCoeffs** – Input vector of distortion coefficients (, , , [, [, , , ],[, , , ]]) of 4, 5, 8 or 12 elements. If the vector is NULL/empty, the zero distortion coefficients are assumed. Unless you are working with a Go-Pro like camera where the distortion is huge, we can simply set this to NULL. If you are working with a lens with high distortion, I recommend doing a full camera calibration.

**rvec** – Output rotation vector.

**tvec** – Output translation vector.

**useExtrinsicGuess** – Parameter used for SOLVEPNP_ITERATIVE. If true (1), the function uses the provided rvec and tvec values as initial approximations of the rotation and translation vectors, respectively, and further optimizes them.

**flags** –

Method for solving a PnP problem:

**SOLVEPNP_ITERATIVE** Iterative method is based on Levenberg-Marquardt optimization. In this case, the function finds such a pose that minimizes reprojection error, that is the sum of squared distances between the observed projections imagePoints and the projected (using projectPoints() ) objectPoints .

**SOLVEPNP_P3P** Method is based on the paper of X.S. Gao, X.-R. Hou, J. Tang, H.-F. Chang “Complete Solution Classification for the Perspective-Three-Point Problem”. In this case, the function requires exactly four object and image points.

**SOLVEPNP_EPNP** Method has been introduced by F.Moreno-Noguer, V.Lepetit and P.Fua in the paper “EPnP: Efficient Perspective-n-Point Camera Pose Estimation”.

The flags below are only available for **OpenCV 3**

**SOLVEPNP_DLS** Method is based on the paper of Joel A. Hesch and Stergios I. Roumeliotis. “A Direct Least-Squares (DLS) Method for PnP”.

**SOLVEPNP_UPNP** Method is based on the paper of A.Penate-Sanchez, J.Andrade-Cetto, F.Moreno-Noguer. “Exhaustive Linearization for Robust Camera Pose and Focal Length Estimation”. In this case the function also estimates the parameters f_x and f_y assuming that both have the same value. Then the cameraMatrix is updated with the estimated focal length.

## OpenCV solvePnPRansac

**solvePnPRansac** is very similar to ** solvePnP ** except that it uses Random Sample Consensus ( RANSAC ) for robustly estimating the pose.

Using RANSAC is useful when you suspect that a few data points are extremely noisy. For example, consider the problem of fitting a line to 2D points. This problem can be solved using linear least squares where the distance of all points from the fitted line is minimized. Now consider one bad data point that is wildly off. This one data point can dominate the least squares solution and our estimate of the line would be very wrong. In RANSAC, the parameters are estimated by randomly selecting the minimum number of points required. In a line fitting problem, we randomly select two points from all data and find the line passing through them. Other data points that are close enough to the line are called inliers. Several estimates of the line are obtained by randomly selecting two points, and the line with the maximum number of inliers is chosen as the correct estimate.

The usage of ** solvePnPRansac ** is shown below and parameters specific to **solvePnPRansac** are explained.

**C++**: void

**solvePnPRansac**(InputArray objectPoints, InputArray imagePoints, InputArray cameraMatrix, InputArray distCoeffs, OutputArray rvec, OutputArray tvec, bool useExtrinsicGuess=false, int iterationsCount=100, float reprojectionError=8.0, int minInliersCount=100, OutputArray inliers=noArray(), int flags=ITERATIVE )

**Python: cv2.solvePnPRansac**(objectPoints, imagePoints, cameraMatrix, distCoeffs[, rvec[, tvec[, useExtrinsicGuess[, iterationsCount[, reprojectionError[, minInliersCount[, inliers[, flags]]]]]]]]) → rvec, tvec, inliers

** iterationsCount ** – The number of times the minimum number of points are picked and the parameters estimated.** reprojectionError ** – As mentioned earlier in RANSAC the points for which the predictions are close enough are called “inliers”. This parameter value is the maximum allowed distance between the observed and computed point projections to consider it an inlier.**minInliersCount** – Number of inliers. If the algorithm at some stage finds more inliers than minInliersCount , it finishes.**inliers** – Output vector that contains indices of inliers in objectPoints and imagePoints .

## OpenCV POSIT

OpenCV used to a pose estimation algorithm called POSIT. It is still present in the C API ( **cvPosit** ), but is not part of the C++ API. POSIT assumes a scaled orthographic camera model and therefore you do not need to supply a focal length estimate. This function is now obsolete and I would recommend using one of the algorithms implemented in **solvePnp**.

## OpenCV Pose Estimation Code : C++ / Python

In this section, I have shared example code in C++ and Python for head pose estimation in a single image. You can download the image headPose.jpg here.

The locations of facial feature points are hard coded and if you want to use your own image, you will need to change the vector **image_points**

**Download Code**To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

** C++ **

```
#include <opencv2/opencv.hpp>
using namespace std;
using namespace cv;
int main(int argc, char **argv)
{
// Read input image
cv::Mat im = cv::imread("headPose.jpg");
// 2D image points. If you change the image, you need to change vector
std::vector<cv::Point2d> image_points;
image_points.push_back( cv::Point2d(359, 391) ); // Nose tip
image_points.push_back( cv::Point2d(399, 561) ); // Chin
image_points.push_back( cv::Point2d(337, 297) ); // Left eye left corner
image_points.push_back( cv::Point2d(513, 301) ); // Right eye right corner
image_points.push_back( cv::Point2d(345, 465) ); // Left Mouth corner
image_points.push_back( cv::Point2d(453, 469) ); // Right mouth corner
// 3D model points.
std::vector<cv::Point3d> model_points;
model_points.push_back(cv::Point3d(0.0f, 0.0f, 0.0f)); // Nose tip
model_points.push_back(cv::Point3d(0.0f, -330.0f, -65.0f)); // Chin
model_points.push_back(cv::Point3d(-225.0f, 170.0f, -135.0f)); // Left eye left corner
model_points.push_back(cv::Point3d(225.0f, 170.0f, -135.0f)); // Right eye right corner
model_points.push_back(cv::Point3d(-150.0f, -150.0f, -125.0f)); // Left Mouth corner
model_points.push_back(cv::Point3d(150.0f, -150.0f, -125.0f)); // Right mouth corner
// Camera internals
double focal_length = im.cols; // Approximate focal length.
Point2d center = cv::Point2d(im.cols/2,im.rows/2);
cv::Mat camera_matrix = (cv::Mat_<double>(3,3) << focal_length, 0, center.x, 0 , focal_length, center.y, 0, 0, 1);
cv::Mat dist_coeffs = cv::Mat::zeros(4,1,cv::DataType<double>::type); // Assuming no lens distortion
cout << "Camera Matrix " << endl << camera_matrix << endl ;
// Output rotation and translation
cv::Mat rotation_vector; // Rotation in axis-angle form
cv::Mat translation_vector;
// Solve for pose
cv::solvePnP(model_points, image_points, camera_matrix, dist_coeffs, rotation_vector, translation_vector);
// Project a 3D point (0, 0, 1000.0) onto the image plane.
// We use this to draw a line sticking out of the nose
vector<Point3d> nose_end_point3D;
vector<Point2d> nose_end_point2D;
nose_end_point3D.push_back(Point3d(0,0,1000.0));
projectPoints(nose_end_point3D, rotation_vector, translation_vector, camera_matrix, dist_coeffs, nose_end_point2D);
for(int i=0; i < image_points.size(); i++)
{
circle(im, image_points[i], 3, Scalar(0,0,255), -1);
}
cv::line(im,image_points[0], nose_end_point2D[0], cv::Scalar(255,0,0), 2);
cout << "Rotation Vector " << endl << rotation_vector << endl;
cout << "Translation Vector" << endl << translation_vector << endl;
cout << nose_end_point2D << endl;
// Display image.
cv::imshow("Output", im);
cv::waitKey(0);
}
```

** Python **

```
#!/usr/bin/env python
import cv2
import numpy as np
# Read Image
im = cv2.imread("headPose.jpg");
size = im.shape
#2D image points. If you change the image, you need to change vector
image_points = np.array([
(359, 391), # Nose tip
(399, 561), # Chin
(337, 297), # Left eye left corner
(513, 301), # Right eye right corne
(345, 465), # Left Mouth corner
(453, 469) # Right mouth corner
], dtype="double")
# 3D model points.
model_points = np.array([
(0.0, 0.0, 0.0), # Nose tip
(0.0, -330.0, -65.0), # Chin
(-225.0, 170.0, -135.0), # Left eye left corner
(225.0, 170.0, -135.0), # Right eye right corne
(-150.0, -150.0, -125.0), # Left Mouth corner
(150.0, -150.0, -125.0) # Right mouth corner
])
# Camera internals
focal_length = size[1]
center = (size[1]/2, size[0]/2)
camera_matrix = np.array(
[[focal_length, 0, center[0]],
[0, focal_length, center[1]],
[0, 0, 1]], dtype = "double"
)
print "Camera Matrix :\n {0}".format(camera_matrix)
dist_coeffs = np.zeros((4,1)) # Assuming no lens distortion
(success, rotation_vector, translation_vector) = cv2.solvePnP(model_points, image_points, camera_matrix, dist_coeffs, flags=cv2.CV_ITERATIVE)
print "Rotation Vector:\n {0}".format(rotation_vector)
print "Translation Vector:\n {0}".format(translation_vector)
# Project a 3D point (0, 0, 1000.0) onto the image plane.
# We use this to draw a line sticking out of the nose
(nose_end_point2D, jacobian) = cv2.projectPoints(np.array([(0.0, 0.0, 1000.0)]), rotation_vector, translation_vector, camera_matrix, dist_coeffs)
for p in image_points:
cv2.circle(im, (int(p[0]), int(p[1])), 3, (0,0,255), -1)
p1 = ( int(image_points[0][0]), int(image_points[0][1]))
p2 = ( int(nose_end_point2D[0][0][0]), int(nose_end_point2D[0][0][1]))
cv2.line(im, p1, p2, (255,0,0), 2)
# Display image
cv2.imshow("Output", im)
cv2.waitKey(0)
```

## Real time pose estimation using Dlib

The video included in this post was made using my fork of dlib which is freely available for subscribers of this blog. If you have already subscribed, please check the welcome email for link to my dlib fork and check out this file

**dlib/examples/webcam_head_pose.cpp**

If you have not subscribed yet, please do so in the section below

Nitheesh A S says

Great article here! Found a small mistake though. Solvepnp’s P3P method takes not 3, but 4 points, including the origin of the model.

And can you explain further regarding why you recommend using P3P only with ransac?

Satya Mallick says

Thanks. P3P uses the minimum number of points and not all points and therefore the estimates can be noisy. RANSAC provides the robustness against noise by sampling the minimum number of points multiple times and selecting the model that has the maximum number of inliers.

Nitheesh A S says

Thanks for replying. I had tried to use P3P with RANSAC sometime back, but wasn’t able to get good results. Iterative was good compared to using P3P with RANSAC. Maybe the parameters I used were wrong. I experimented with default parameters as well as some custom params. Didnt seem to give a good output. If you were able to make RANSAC work well, can you post them too?

Satya Mallick says

In most cases where the noise is small and the number of 3D to 2D matches are small, the iterative method will work better. People use RANSAC when there is a large amount of noise but they have a large number of matches. Imagine you have a 3D model of an arbitrary scene with a texture map and you are using SIFT to match features. You can get hundreds of 3D to 2D matches in such applications buy a lot ( say 30-40%) of the matches will be incorrect. In such cases the iterative method will fail miserably and RANSAC will do a very reasonable job.

atv says

Hi Satya,

Can i use this to create a 3d mesh on the face, and could i also use this for eye blink detection?

Thanks,

Satya Mallick says

No. This just gives the direction in which the face is looking.

atv says

Hey Satya, thanks for your reply. Any existing code that does such a thing, maybe in dlib?

Satya Mallick says

dlib will allow you to track 68 points on the face which you can triangulate to create a rough 2D mesh. There are a few techniques for calculating 3D mesh (e.g. 3D morphable model), but I don’t know one that is implemented in a library like opencv or dlib.

Supra says

In OpenCV 3.1.0 for raspberry pi 3. I removed this flags=cv2.CV_ITERATIVE.

It will worked If I removed flags=cv2.CV_ITERATIVE.

Thanks, Mallick

Satya Mallick says

That is odd. Unfortunately, I don’t have a way to quickly test. But if someone else also points this out, I will change the code.

Supra says

This is for only Raspberry Pi 3. Not pc. Usually, I used Raspberry pi 3 all times.

Supra says

In Python 2/3, why did u used semi-colons?

U don;t needed semi-colon @ the end of brace brackets

Satya Mallick says

Sorry that was a typo. Semi-colons are not needed. Fixed.

Kamble Tanaji says

Thanks Dr. Satya Mallick !! I get interest to read your all the posts. The posts are very informative and clears each and every detail in minimum words. Still I have not implemented the work you shared. But, once I will implement it, definitely my interest in OpenCV will increase more..

Satya Mallick says

Thanks for the kind words.

Kamble Tanaji says

I want to learn OpenCV by implementing your work. I prefer the Ubuntu platform. Let me know where i get good materials for preliminary stage.

Hamdi says

Dear Mallick, thank you for sharing your knowledge……i tried the code, no compile or run time error, but the algorithm is not detecting any thing and is very very slow….i have enabled SSE2, SSE4 and AVX but no results….when i tried the webcam_face_pose_ex from Dlib it works perfectly…..I appreciate any help from your side, as in your video the algorithm works fine and fast

Hamdi says

The bottleneck is the face detector, requires so much time….resizing and using your customized face rendering didn’t solve the problem……Do you have any hint ? is it possible to use opencv face detector instead ? (my PC is modern with i7 processor)…thanks

Satya Mallick says

Here are a few suggestions to speed up dlib. Hope this helps.

https://learnopencv.com/speeding-up-dlib-facial-landmark-detector/

Hamdi says

the instructions in this link are already implemented in your code (resizing, faster rendering) but no results…..I have used the opencv face detector instead and now its working correctly but at 7 fps only….would you please tell me what was your frame speed including everything (detection and pose estimation)…thank you so much again for your assistance

D Sharpe says

Try commenting out the following line in the example code and run in release configuration.

if ( count % 15 == 0)

ارم الزهراء says

i want to use openCV and Dlib in one python script. i want to detect faces thorough dlib and recognize them using fisher faces algorithm. is it possible?

detection and recognition both are real time.

Satya Mallick says

Yes detection will easily be real time using either Dlib or OpenCV versions. I am not 100% sure if recognition will work in real time, but you can do recognition every nth frame.

ارم الزهراء says

thankyou… i want to ask one more thing i want to align the dataset images for the recognition i am using the code from https://github.com/bytefish/facerecognition_guide/blob/master/src/py/crop_face.py but it is not aligning them properly can you identify the mistake

ارم الزهراء says

i want to save the detected face in dlib by cropping the rectangle do you have any idea how can i crop it. i am using dlib first time and having so many problems. i also want to run the fisherface algorithm on the detected faces but it is giving me type error.

i seriously need help in this issue.

Satya Mallick says

Please see the reply above.

sreejith mohanan says

Thanks for sharing this. I have a few doubts. Firstly what is that rotation vector i get as output from solvePNP, also how can i get a full 3×4 projection matrix which can take my 3d points to 2d from this?

Satya Mallick says

Rotation vector is just a way to represent rotation in axis-angle form. To convert it to matrix form, you can use Rodrigues formula. OpenCV has an implementation here

http://docs.opencv.org/2.4/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html#void Rodrigues(InputArray src, OutputArray dst, OutputArray jacobian)

3×4 projection matrix is simply rotation and translation concatenated as the fourth column.

sreejith mohanan says

Thanks, moreover is there some way get the full projection matrix, which can transform the 3d model points to the 2d points in the image, which i believe is being used inside the cv2.projectPoints function

jonas says

Thanks for sharing this project. However, for me it is quite noisy. I calculated the pitch, which sometimes jumps by 30 degrees, especially when my face is frontal. Did you try adding points close to the ear? Or are they generally unreliable?

Any idea, how to make this more robust and accurate?

Satya Mallick says

Yes you can try a few points near the ears. Unfortunately, the location of those points as returned by Dlib is not very reliable because they are not as nicely defined as other facial features. You may also try adding Kalman Filtering which will help smooth out noisy fluctuations in pose estimation. E.g. checkout this tutorial

http://docs.opencv.org/trunk/dc/d2c/tutorial_real_time_pose.html

ارم الزهراء says

i want to save the detected face in dlib by cropping the rectangle do

you have any idea how can i crop it. i am using dlib first time and

having so many problems. i also want to run the fisherface algorithm on

the detected faces but it is giving me type error.

i seriously need help in this issue.

Satya Mallick says

In the dlib code for tracking landmarks, you will notice that faces are detected first. It saves the detected rectangles in a variable called “faces” which is a vector of rectangles. You can get the cropped image for a face

To get one face rectangle in OpenCV cv::Rect format using

cv::Rect r(faces[i].left(), faces[i].top(), faces[i].width(), faces[i].height());

You can use the above rectangle to crop out the face from the image im using

Mat imFace = im(r);

Max Kutny says

Hi Satya,

There is a mistake in left eye 3d coords in the text (“Left corner of the left eye : ( 0.0, 0.0, 0.0)”).

In the code they are (-225.0, 170.0, -135.0) which seem to be correct.

Satya Mallick says

Thank you so much. I have fixed the mistake.

Charles Zheng says

hi, Satya, I try to use some other points to calculate pose, could you please tell me where you get these 3d coords?

Charles Zheng says

hi, Max, I try to use some other points to calculate pose, could you please tell me where I can get other landmarks 3d coords?

Liron says

Hello Satya,

I was wondering how i can get the 3D model points in real time (like i can see in your video with the vector that comes from your nose).

Thanks

Satya Mallick says

If you look at the code, I have put a 3D point some distance from the nose in the 3D model. I simply project this point onto this image plane using the estimated rotation and translation.

Minu says

Hey there Mister Satya!

Great job with all the tutorials and explanation. I wanna do the pose calculation by myself from scratch. I understand the method, the only thing that keeps me away is that i dont know how to extract only 6 landmarks, instead of 68. I’ve checked the code so many times, the dlib/opencv indexes too. I really need some help, im stucked… I uploaded the code too. Maybe u can give me a fast advice, i know ur time is precious! Thanks a lot! https://uploads.disquscdn.com/images/1fe9db819b1280342fd63a55b92d4b6486cde5b8e6235979dd752642bcd8f646.png

Zongchang Chen says

Hi, this is really a fantastic blog. But I’m wondering what is the measure of the image coordinate and the world coordinate? Are they pixel and millimeter?

Satya Mallick says

If you look at my version of dlib, you will see the indices of 6 points. I have shared the C++ code below.

std::vector get_2d_image_points(full_object_detection &d)

{

std::vector image_points;

image_points.push_back( cv::Point2d( d.part(30).x(), d.part(30).y() ) ); // Nose tip

image_points.push_back( cv::Point2d( d.part(8).x(), d.part(8).y() ) ); // Chin

image_points.push_back( cv::Point2d( d.part(36).x(), d.part(36).y() ) ); // Left eye left corner

image_points.push_back( cv::Point2d( d.part(45).x(), d.part(45).y() ) ); // Right eye right corner

image_points.push_back( cv::Point2d( d.part(48).x(), d.part(48).y() ) ); // Left Mouth corner

image_points.push_back( cv::Point2d( d.part(54).x(), d.part(54).y() ) ); // Right mouth corner

return image_points;

`}`

广告任务网 says

很不错的样子

Tranlated :

Looks very good.

Hank White says

I run the program in xcode,but it’s too slow than compiled webcam_head_pose.

Satya Mallick says

Are you sure you are compiling release mode ? Check this out

http://dlib.net/faq.html#Whyisdlibslow

Tolga Durak says

Dear Satya, thanks for sharing this post and explaining it. I am interested in developing gaze estimation program. It can estimate the center of pupil. In other words, I have the point of the center of pupil. How can I estimate gaze on computer like head pose estimation ? Thanks a lot.

Satya Mallick says

Hi Tolga,

You will have to detect the center of the pupils first. Dlibs landmark detector does not detect it, but it is possible to do so by retraining a landmark detector with your own data that contains the center of the eyes. In fact, in a few weeks I plan to release a model with the pupil center.

Satya

Moisés Rodríguez Oceda says

Hi Mr. Satya, thank you for this tutorial. I am also trying to estimate the gaze. I already achieve pupil detection. Now I am trying to determine the gaze pose. I read some articles that uses the similar technique you use in this tutorial, modelling an eye; however I don’t know where to find the reference 3D points values of an adult eye. I would be glad if you could help me with this or recommend me some papers to read. Best Regards, Moises

keizou says

I really thank this article.

I’m so sorry but is there an example of webcam_head_pose in python?

I watched this and tryed to code it in python but I couldn’t do it

dlib/examples/webcam_head_pose.cpp

Satya Mallick says

Sorry, I don’t have a python version currently. But if you follow the logic in the C++ code, you will be able to write your own. There are not many lines of code.

Mohammed ElBalkini says

Thanks Satya for this amazing tutorial. I would like to get you advice on how to reduce jitter resulted from pose matrix when used in augmented reality.

Satya Mallick says

Thanks for the kind words Mohammed.

One option is to smooth out jitter by calculating the moving average of the points over multiple frames ( say plus and minus 2 frames ).

You can do average the rotation / translation directly. Be careful while averaging rotation matrices — it is not straightforward. You may find this discussion helpful

http://stackoverflow.com/questions/12374087/average-of-multiple-quaternions

Satya

Mohammed ElBalkini says

Thanks for your prompt reply.

i was thinking of converting the rotation matrix to quaternion, average it and then back to rotation matrix. will this work?

Zongchang Chen says

Hi! This is really a fantastic blog. I’m wondering what is the measure of the image coordinate and the world coordinate respectively? Are they pixel and millimeter?

Satya Mallick says

Thank you!

The image coordinates are in pixels, but the world coordinates in are arbitrary units. You can produce the world coordinates using real measurements in millimeter or inches etc, or it could be just the coordinates in some arbitrary 3D model.

Zongchang Chen says

Wow that was a very fast reply! Thank you for your answer. Can I interpret your answer as the units of the world coordinates actually does not matter in computation as long as we keep the consistency of the measure of each point in 3D model?

Satya Mallick says

Yes that’s right.

Mohammed ElBalkini says

Hi Satya, does the higher number of model points affect the precision of the estimated pose matrix?

Satya Mallick says

Yes, the pose estimate can be made better with more points. Also, if you could have some points on the ears etc. , the pose estimate will be more stable.

hashini hemanjalee says

sir can i know what are the algorithms used here to estimate the pose?

Zongchang Chen says

Hi Satya! Is there any functions in OpenCV or any other libraries that I can use to find the rotation 3×3 matrix R and the translation matrix t when given the intrinsic camera matrix, the 2D image points and their corresponding 3D model points? Or I have to implement the wheel to find the extrinsic camera matrix in this scenario?

Satya Mallick says

Hi Zongchang,

Yes, solvePnP does precisely that :).

Satya

Zongchang Chen says

Hi Satya, thank you for your quick reply again! But when I run this code, the rotation vector rvec returned is actually a 3×1 column vector. I don’t think that is the 3×3 rotation vector that I actually want.

Satya Mallick says

They are both the same rotation expressed differently. Look for openCV documentation on Rodrigues to convert one form to other

Zongchang Chen says

I see. It really really helps! Thank you so much!

Alexey Ledovskiy says

Hey Satya, I was trying to do just a face recognition using dlib and standard face landmark from their site, it seems like the features and matching are not rotation invariant, I was wondering if you have any ideas how to make the face recognition rotation invariant with dlib?

Satya Mallick says

Hi Alexey, Face Recognition usually means identifying who the person is. Landmark detection can be used as a preprocessing step in face recognition for alignment. Does that make sense ?

Alexey Ledovskiy says

Thanks for replying. I mean face detection phase inside dlib, seems like the landmark detection is not rotation invariant, so when rotate the camera like 90 degrees it doesn’t detects a face. Maybe you have some thoughts where to look to fix that. Thank you.

Chomskyite says

If you’re using the “.dat” file that came with dlib then it’s limited to detecting the 68 facial landmarks that it was trained on. To get it to detect a profile view you’ll have to create a new “.dat” file using several photographs of people with their faces turned 90 degrees to the camera. In the /tools directory you should find imglab which helps you do this.

Siddhant Mehta says

Thank you so much Satya Sir for your wonderful tutorials. They helped me alot to learn OpenCV and creating my projects. Until now I have implemeted pose estimation with SolvePnP as you explained above. But as faar as I understood, camera is fix in this scenario. If both camera and target are moving then how it will be possible to detect the pose of the camera w.r.t. target? My camera has 2 degrees of freedom (pitch, yaw). Do you have any suggestions? I am thiking to estimate homography matrix from point matching between change of pose and somehow add that to the rvec and tvec? Any suggestions? Thank you, Siddhant Mehta

Satya Mallick says

Hi Siddhant. If the camera moves you get the relative orientation of the object w.r.t the camera. But I guess you are asking how do you recover camera motion. For that you have to look at static parts of the scene, find point correspondences. If the point correspondences come from a plane ( e.g. the floor or one wall ) you can estimate Homography and decompose it into R and t. Otherwise, you need to estimate the Essential Matrix / Fundamental Matrix. BTW if you are doing this to learn, go ahead and implement these yourself. But if you are using it in a real world project check out VisualSFM, Theia, and OpenMVG.

Clement Ng says

Hi Satya and thank you for your tutorial. It is very useful for me. I would have one question to ask about swap face. I would like to do an android application about putting model’s face to a user’s face so that they can see the result for applying the cosmetic in our application. I have watched your tutorial (face swap and face morph ). Which one do you think is more suitable and can I swap their face feature and without change their face size and hair style? Because I saw that the face shape would be changed in the face swap tutorial. Thank you very much.

Satya Mallick says

Thanks. Both of those are not actually good for applying makeup. For makeup the technique is very different and each makeup element is rendered differently. You can try to look at something I did at my previous company ( http://www.taaz.com ).

Clement Ng says

If I am doing a college assignment, which one do you think would be more suitable?

Jon Watte says

This tutorial shows how to unproject 2D points to 3D points, which is a somewhat interesting optimization/fitting problem, but to have a working solution, the important bit is finding where the feature points are in the faces in the input images — corners of eyes, nose time, mouth, etc. I can’t find any code in your github that actually calls the opencv face detect functions — there are just files with hard-coded point locations as input. How did you generate these input files?

Satya Mallick says

That is done using dlib. This is the file you need.

https://github.com/spmallick/dlib/blob/master/examples/webcam_head_pose.cpp

and here is the compilation instruction

http://dlib.net/compile.html

Damian Allen says

Satya,

Great post. Do you have a suggestion as to how to derive 3D coordinate locations for the landmark features themselves, rather than the entire head?

The approaches I can think of, using a simple mesh of a generic head:

1. Raycast from a given 2D landmark position to the head mesh model and calculate the point position where the ray intersects.

2. Render a position map of the head mesh from the camera POV (i.e. render the XYZ coordinates of the face model into the RG and B channels respectively), then retrieve the pixel value at the location of the landmark.

3. Render a depth matte of the head mesh and use its value paired with the XY screen coordinates of the landmark to derive the world XYZ from these.

Not sure it there’s a simpler approach, or which one of these would be the most efficient, since I haven’t dealt too much with rendering 3D objects via OpenCV.

Satya Mallick says

Thanks Damian,

If you have a 3D triangulated mesh and you have found the head pose using the method I have described, you can use the project any point on the mesh to the image plane. Conversely, if you want to estimate the 3D location of a 2D point, you can transform the mesh into camera coordinates ( see figure in the article ), and shoot a ray from the camera center through the pixel location and see where it intersects the mesh ( in camera coordinates ). Obviously, it is possible the ray will intersect the mesh multiple times and so you need to choose the point closest to the camera.

Damian Allen says

Thanks,

YEs, I’m actually already putting a workflow together based on using your pose prediction to inform a more detailed mesh.

One other quick question: if you’re only concerned with landmark detection for a single actor, would it be better to train a model with multiple photos of their head in various orientations and lighting, rather than a variety of faces? If so, should they also be of various facial expressions? Forgive my ignorance of the training model; I’m a few levels of encapsulation away from wanting to understand the fine details neural network implementation…

Satya Mallick says

Yes it would be better. In fact I have done something similar for a project. It is very difficult to find the same person under different lighting conditions. You also need to label all those images. So the best trick is to run the standard landmark detector on the person’s face, fix the points that are not accurate, and put these new images in the training set as well. 50% images of this person and 50% of random people will still bias the results toward this person’s face and also have sufficient variety in lighting etc. Hope that helps.

Damian Allen says

Yes indeed, thank you. I actually plan to add custom markers to the face and train those (i.e. dots at landmark positions like cheek bones, corner of mouth and above eyebrows). I imagine at that point using other faces would just confuse the results. And if you don’t mind me asking one more question: in the case of adding custom markers would the shape of the marks need to be unique, or would their proximity to facial features (e.g. a marker just to the side of a mouth corner) be sufficient for the training to see them as unique?

Shaun Campbell says

Hi Satya,

If you mean using the 2d landmark points that come from dlib and are therefore subject to skewing/scaling depending on perspective and head rotation (i.e. the points before and unrelated to pose estimation), wouldn’t this become increasingly more inaccurate with more pose rotation beyond zero, and translation away from the center of the camera image (e.g. if there is significant perspective distortion)?

What would the above method give, that isn’t already achieved by taking the 68 landmark points’ 2D camera-image coordinates, scaling them with respect to the target 3D coordinate system, giving a Z-plane of X Y positions, then translating and rotating this collection of points by the pose estimation matrix?

Or are we talking about estimating the 3D location of a 2D point that has had further transformation to take the perspective of the device camera into account?

Thanks

CO says

Hi Satya, Thank you for very good tutorial about dlib and opencv. I am beginner at c++ and I have some question to ask about webcam_head_pose.cpp as in code. My goal is to draw laser from eyes like Superman so I need to get eyes position from face. Is there anyway to get eyes position from it ? Thank you very much.

Satya Mallick says

Thanks!

You will have to train your own dlib model that contains the center of the eyes. You can also use the points around the eyes to come up with a heuristic for the the location of the center of the pupil, but it won’t be very good.

Angad Nayyar says

Hi Satya,

Can we use the information determined from this, to get the location of a real world object from it’s pixel co-ordinates?

For example, I use an A4 paper to do the mentioned steps. Can I then use the translational vector, rotation vector, and my knowledge of the dimensions of the paper to get the real world location of a coin next to it?

Satya Mallick says

For 3D you need two cameras. This just gives you the pose. The translation vector here does not correspond to real world. It is w.r.t the coordinate in which the 3D points are defined.

Arsalan Tariq says

Hi,

My project is “Density Estimation of crowd”. Video is captured from a drone camera and i have to count number of heads. And i have to code in opencv python.can someone guide me please?

andres says

nice tutorial!! but is there a way to process using gpu

Anonymous says

Hi. nice tutorial ..But its running slow on my system i.e. 30fps only. Also it detects only within a limited range. Outside that it simply doesn’t detect at all. Is this problem in actual system also or only my problem? How to increase speed further? i have set AVX instruction flag still no effect.

Satya Mallick says

Thanks.

You can try some suggestions here

https://learnopencv.com/speeding-up-dlib-facial-landmark-detector/

Sam Zheng says

I want the computer to know whether the user turns his head left, right, up or down. Thus, based on the pitch and yaw, can u provide some suggestions to let the computer learns itselft?

Satya Mallick says

You may find this post useful https://learnopencv.com/rotation-matrix-to-euler-angles/

Marius Maaland says

Hi Satya. Great site, I’m learning a ton.

I would like to clarify my understanding of the assumptions made, and the preprocessing necessary. Firstly, the 2D image points, i.e. the 2D locations of the nose tip, chin etc., am I correct in assuming that they are the result of a facial landmark detector run beforehand?

Secondly, I did not understand clearly where the 3D model points where taken from, and how I would need to alter them for my own use?

Thanks in advance.

shiva prasad says

Hi, Was your question answered?

even i have similar question.

Satya Mallick says

yes, the 2D points are a result of facial landmark detection.

The 3D points were simply approximated by me. If you have a 3D model of a human head, you can use the points from that model.

SRINIVAS ALUVALA says

Sir can you tell me how you calculated the 3d coordinates.

Mohammad Haghighat says

Does it matter to normalize the template 3D points (defined above in step 2), and scale them to the size of the detected face?

Satya Mallick says

It does not matter. The transformation you calculate has scale embeded inside.

Shaun Campbell says

I wondered about this too. I’m struggling to make sense of the Z position of the solvePnP detected translation, and how to use that. I get translations with a large depth value e.g. (-300, -200, -2056).

I’m working on iOS, using SceneKit. If I have orthographic projection enabled in my own 3D scene, this Z depth (either applied to the scene’s camera, or a particular 3D object with the pose transform applied to it) won’t affect the perceived size of an object.

Is it better to ignore this Z depth, and influence the scale of an object (such as a clown mask placed over the square frame of the detected face) based on initial distance mapped by face metadata or facial landmark points with respect to camera image size?

Thanks

Hiro says

Hi Satya, I’m using a checkerboard or circles to use solvePnP. In this case, how many pictures do I need to prepare? Only one picture is fine as you did if it includes several points?

Also, I noticed that the latest calibrateCamera in OpenCV3 accepts the object points in the object points’ coordinate frame (= checkerboard coordinate frame), and not necessarily be in the world frame. Is it the same for solvePnP?

Reem Alfaifi says

I want take 3d coordination of landmark points,

can you please help me?

Satya Mallick says

You cannot have true 3D coordinates because it is a single camera based system.

shiva prasad says

Hi Satya, The way you have presented this topic is so simple and awesome to understand.

My question is, I know the 2D Coordinates on the images(Image points) where feature is located. I know that i can estimate the 3D world Coordinate with Image points and camera parameters. Can i use the calculated 3D world Coordinate and Known Image Points to find the Pose?.

If yes, how accurate will this be?

Satya Mallick says

Yes, you basically need the 3D points, cameraMatrix and the 2D points to find the pose.

SRINIVAS ALUVALA says

Sir,can you tell me how you are extracting the 3d points from 2d.

Ternow Chal says

Hi Satya, how to estimation gaze position based on the information which we get from face landmarks?

Pradeepta Ranjan Choudhury says

sir..thanks for this awesome tutorial.but one question how to do it in for video captured live from webcam using python

Ternow Chal says

Hi Satya,

I tried webcam_head_pose example in https://github.com/spmallick/dlib. Unfortunately, I only see the raw images from the webcam without any head pose and face landmarks. What would be the problem?

OpenCV3.3.0

CUDA 8.0

platform Jetson TX1

Jing Yang says

Hi Satya, Thank you very much for your tutorial.

I have a question: what if the images are captured by webcam in real time? How can you get the 2d image points and 3d model points in this case?

Satya Mallick says

Hi Jing,

The 3D model points remain constant. The 2D points can be estimated using Dlib’s Facial Landmark Detector like we do in the tutorial.

Satya

Jing Yang says

Thanks Satya,

I noticed that in your post, the 3D model points are not specialized for a specific person. Could you please tell me what model you use to locate those landmarks? Many thanks.

-Jing

Satya Mallick says

I cooked up those points by eye balling what would be approximate positions of the points in 3D

John Desrosiers says

Hi Satya, in a typical front headshot with the subject basically facing the camera, I see how you can use this to estimate slight tliting sideways and turning of the head left/right. However, how well does this work for estimating forward and backward tilt when you’re using an uncalibrated camera and generic 3D model.

Working with these landmarks, it would seem to me that there’s too much variation between individuals in nose length, nose vs. mouth position, etc to make a determination. Of course a partial side view would solve this, but that’s not always possible.

Am I missing something? Are there better approaches than this? Or ss the ability to do this just one of those things that just make us humans special? 😉 Thanks!

Jose Perisadsuara says

Hello Satya, thak you for sharing your knowledge.

Wich tool did you use for your 2d image landmark custom annotation? I want to train my model with specif landmarks…

And another question: Are there any functions in opencv to train my custom model and making accurate landmarks predictions?

Thanks

Satya Mallick says

Hi Jose,

We wrote a tool in MATLAB a while back for a client. Unfortunately, I cannot share it for that reason.

Thanks

Satya

VISHANT GARG says

Excellent explanation sir..!!

I Have a doubt sir,values given by rotation vector and translation vector,what they will signify?

As for a rotation of about 120′ in ‘yaw’ i m getting values in range of [-6,6]. It’s not in degree then what is it??

Thank you

Satya Mallick says

Vishant,

There are two coordinates in 3D — the one attached to the camera using which the picture was taken and another attached to the 3D model. The rotation matrix and translation vector relate the two coordinate systems. In other words, you can apply the R and t the 3D point in the model coordinates to find the coordinates in the camera coordinates.

VISHANT GARG says

Sir,

Thankyou for your crucial time.

Actually,what i m trying to achieve is based on some threshold value of rotational matrix i want to go for face recognition.What i mean is if the value is below or above some threshold then only i will go for recognition like if side pose is there then my face recognition algorithm does not able to extract features correctly and will give wrong result as well as waste my computational time.SO, do u think it is feasible??

Please share your thoughts on same.

Thankyou

John Desrosiers says

Thanks for the tutorial, Satya. This seems to have become Google’s go-to article for face post estimation.

Looking at the code, I see you’re using a 3D model using the nose as the

origin, with +y going upward (Cartesian). Yet, the 2D data uses Open

CV/DLib’s +y *downward* convention, the vertically-mirrored image of the

3D model.

Could you please explain the reasoning behind the discrepancy between the coordinate systems?

Running your example gives me a rotation vector of roughly [0, 2, 0].

Inverting the sign of the y-coords in the 3D model gives me a rotation vector roughly [0, -1, 0].

Which is correct?

Satya Mallick says

Hi John,

It does not matter how you define your coordinates. The R and t will adjust to whatever system you use. The only check you should do is to apply the R and t to the 3D points, and then project it only the image ( face ). If the 3D points land near their 2D counter part, your estimation is correct.

[ You can see how I am projecting the point in front of nose as an example ]

Satya

Rajarshi Lahiri says

Hello.This is a great tutorial but can you explain what exactly we are getting in the rotation vector obtained?

Nahid Shafizadeh says

Hello Satya, I am trying to run it with python. How can I find my Reprojection Error?

The ouput picture looks quite good but I am not sure how to interpret my euler angles. I am a little bit confused. I thought, I start being the camera (X right, Y down and Z to the front). Then I start due to euler convention turning on x, then on y’ then on z”. But the new coordinate system is never how I expected. Even though the blue line points allways in the right direction.

Thank you

Trần Minh Luận says

Hi Satya, Thank you so much for this tutorial. I learn a lot of things from your blog. Can you help me with head pose estimation? I’m integrating head pose estimation in iOS. It’s work fine, but the euler angles X value when my face is around 90 Degree. I wonder that maybe something wrong with camera matrix in iOS or the coordinates is not correct?

Adhiyaman manickam says

Hi Satya, Nice Presentation. Thanks

I have few questions,

1. Did you use 3D model as a reference for finding the third coordinate of 2D. Or Just assuming the third coordinate in your code.

2. If you using the 3D model as reference, then how do you find third coordinate of 2D.

3. Is there any possibilty to find the translation and rotation before obtaining the third coordinate of 2D.

4. Can you expain more detail about 2D to 3D which you have derived.

Thanks a ton in advance.

George says

Hello Satya,

I tried to run your headPose.py program and I get the following error:

Traceback (most recent call last):

File “/home/pi/headPose.py”, line 45, in

(success, rotation_vector, translation_vector) = cv2.solvePnP(model_points, image_points, camera_matrix, dist_coeffs, flags=cv2.CV_ITERATIVE)

AttributeError: module ‘cv2’ has no attribute ‘CV_ITERATIVE’

What is the problem and how can I solve it?

Satya Mallick says

They keep changing the names of these constants and breaking backward compatibility in the Python versions. Try cv2.SOLVEPNP_ITERATIVE and let me know if that works. I will update the post accordingly.

George says

That was a very quick answer Thank you. Indeed it worked perfectly.

I also tried the c++ code but this produced a lot of errors. Can you help on this??

(Sorry for the long post, but didn’t know how to upload it)

/tmp/ccwiPEXZ.o: In function `cv::operator<<(std::ostream&, cv::Mat const&)':

headPose.cpp:(.text+0x128): undefined reference to `cv::Formatter::get(int)'

/tmp/ccwiPEXZ.o: In function `main':

headPose.cpp:(.text+0x1f0): undefined reference to `cv::imread(cv::String const&, int)'

headPose.cpp:(.text+0x5f4): undefined reference to `cv::Mat::zeros(int, int, int)'

headPose.cpp:(.text+0x824): undefined reference to `cv::solvePnP(cv::_InputArray const&, cv::_InputArray const&, cv::_InputArray const&, cv::_InputArray const&, cv::_OutputArray const&, cv::_OutputArray const&, bool, int)'

headPose.cpp:(.text+0x964): undefined reference to `cv::noArray()'

headPose.cpp:(.text+0x998): undefined reference to `cv::projectPoints(cv::_InputArray const&, cv::_InputArray const&, cv::_InputArray const&, cv::_InputArray const&, cv::_InputArray const&, cv::_OutputArray const&, cv::_OutputArray const&, double)'

headPose.cpp:(.text+0xab0): undefined reference to `cv::circle(cv::_InputOutputArray const&, cv::Point_, int, cv::Scalar_ const&, int, int, int)’

headPose.cpp:(.text+0xb9c): undefined reference to `cv::line(cv::_InputOutputArray const&, cv::Point_, cv::Point_, cv::Scalar_ const&, int, int, int)’

headPose.cpp:(.text+0xc94): undefined reference to `cv::imshow(cv::String const&, cv::_InputArray const&)’

headPose.cpp:(.text+0xcb4): undefined reference to `cv::waitKey(int)’

/tmp/ccwiPEXZ.o: In function `std::ostream& cv::operator<< (std::ostream&, std::vector<cv::Point_, std::allocator<cv::Point_ > > const&)’:

headPose.cpp:(.text+0xfc8): undefined reference to `cv::Formatter::get(int)’

/tmp/ccwiPEXZ.o: In function `cv::String::String(char const*)’:

headPose.cpp:(.text._ZN2cv6StringC2EPKc[_ZN2cv6StringC5EPKc]+0x58): undefined reference to `cv::String::allocate(unsigned int)’

/tmp/ccwiPEXZ.o: In function `cv::String::~String()’:

headPose.cpp:(.text._ZN2cv6StringD2Ev[_ZN2cv6StringD5Ev]+0x14): undefined reference to `cv::String::deallocate()’

/tmp/ccwiPEXZ.o: In function `cv::String::operator=(cv::String const&)’:

headPose.cpp:(.text._ZN2cv6StringaSERKS0_[_ZN2cv6StringaSERKS0_]+0x30): undefined reference to `cv::String::deallocate()’

/tmp/ccwiPEXZ.o: In function `cv::Mat::Mat(int, int, int, void*, unsigned int)’:

headPose.cpp:(.text._ZN2cv3MatC2EiiiPvj[_ZN2cv3MatC5EiiiPvj]+0x134): undefined reference to `cv::error(int, cv::String const&, char const*, char const*, int)’

headPose.cpp:(.text._ZN2cv3MatC2EiiiPvj[_ZN2cv3MatC5EiiiPvj]+0x21c): undefined reference to `cv::error(int, cv::String const&, char const*, char const*, int)’

/tmp/ccwiPEXZ.o: In function `cv::Mat::~Mat()’:

headPose.cpp:(.text._ZN2cv3MatD2Ev[_ZN2cv3MatD5Ev]+0x3c): undefined reference to `cv::fastFree(void*)’

/tmp/ccwiPEXZ.o: In function `cv::Mat::operator=(cv::Mat const&)’:

headPose.cpp:(.text._ZN2cv3MataSERKS0_[_ZN2cv3MataSERKS0_]+0x140): undefined reference to `cv::Mat::copySize(cv::Mat const&)’

/tmp/ccwiPEXZ.o: In function `cv::Mat::create(int, int, int)’:

headPose.cpp:(.text._ZN2cv3Mat6createEiii[_ZN2cv3Mat6createEiii]+0xc0): undefined reference to `cv::Mat::create(int, int const*, int)’

/tmp/ccwiPEXZ.o: In function `cv::Mat::release()’:

headPose.cpp:(.text._ZN2cv3Mat7releaseEv[_ZN2cv3Mat7releaseEv]+0x68): undefined reference to `cv::Mat::deallocate()’

/tmp/ccwiPEXZ.o: In function `cv::Mat::operator=(cv::Mat&&)’:

headPose.cpp:(.text._ZN2cv3MataSEOS0_[_ZN2cv3MataSEOS0_]+0xf8): undefined reference to `cv::fastFree(void*)’

/tmp/ccwiPEXZ.o: In function `cv::MatConstIterator::MatConstIterator(cv::Mat const*)’:

headPose.cpp:(.text._ZN2cv16MatConstIteratorC2EPKNS_3MatE[_ZN2cv16MatConstIteratorC5EPKNS_3MatE]+0xf8): undefined reference to `cv::MatConstIterator::seek(int const*, bool)’

/tmp/ccwiPEXZ.o: In function `cv::MatConstIterator::operator++()’:

headPose.cpp:(.text._ZN2cv16MatConstIteratorppEv[_ZN2cv16MatConstIteratorppEv]+0x94): undefined reference to `cv::MatConstIterator::seek(int, bool)’

/tmp/ccwiPEXZ.o: In function `cv::Mat::Mat<cv::Point_ >(std::vector<cv::Point_, std::allocator<cv::Point_ > > const&, bool)’:

headPose.cpp:(.text._ZN2cv3MatC2INS_6Point_IdEEEERKSt6vectorIT_SaIS5_EEb[_ZN2cv3MatC5INS_6Point_IdEEEERKSt6vectorIT_SaIS5_EEb]+0x214): undefined reference to `cv::Mat::copyTo(cv::_OutputArray const&) const’

/tmp/ccwiPEXZ.o: In function `cv::Mat_::operator=(cv::Mat const&)’:

headPose.cpp:(.text._ZN2cv4Mat_IdEaSERKNS_3MatE[_ZN2cv4Mat_IdEaSERKNS_3MatE]+0x94): undefined reference to `cv::Mat::reshape(int, int, int const*) const’

headPose.cpp:(.text._ZN2cv4Mat_IdEaSERKNS_3MatE[_ZN2cv4Mat_IdEaSERKNS_3MatE]+0xec): undefined reference to `cv::Mat::convertTo(cv::_OutputArray const&, int, double, double) const’

/tmp/ccwiPEXZ.o: In function `cv::Mat_::operator=(cv::Mat&&)’:

headPose.cpp:(.text._ZN2cv4Mat_IdEaSEONS_3MatE[_ZN2cv4Mat_IdEaSEONS_3MatE]+0x98): undefined reference to `cv::Mat::reshape(int, int, int const*) const’

headPose.cpp:(.text._ZN2cv4Mat_IdEaSEONS_3MatE[_ZN2cv4Mat_IdEaSEONS_3MatE]+0xf0): undefined reference to `cv::Mat::convertTo(cv::_OutputArray const&, int, double, double) const’

collect2: error: ld returned 1 exit status

Satya Mallick says

It looks like you are not linking to the OpenCV library correctly. You can try these instructions

https://learnopencv.com/how-to-compile-opencv-sample-code/

George says

Hello Satya,

Already done this but again the same problems apear

最后の战役 says

Hi Satya, I’m new to programming and also computer vision. Currently I’m doing head pose estimation using C# language. Is it possible for me to use solvePnP in C#? Also, the camera I’m using is the Intel RealSense D435 RGB-D Camera. Thank you

sergman says

I am trying to use your code to estimate camera position/angle in soccer field.

Here is my calibration frame with four points.

https://ibb.co/hSSt1x

World coordinates are in meters

After I run this:

pts3d = np.array([[ 0. , 0, 11], [ -5.5 , 0, 11], [ 0. , 0, 0], [ -16.5 , 0, 0]])

pts2d = np.array([[189, 207], [65, 244], [564, 242], [191, 402]])

(success, rotation_vector, translation_vector) = cv2.solvePnP(pts3d, pts2d, camera_matrix, dist_coeffs, flags=0)

I use rotation vector to extract camera angles:

rotation_matrix = cv2.Rodrigues(rotation_vector)[0]

angles = rotationMatrixToEulerAngles(rotation_matrix)

where rotationMatrixToEulerAngles:

def rotationMatrixToEulerAngles(R) :

sy = math.sqrt(R[0,0] * R[0,0] + R[1,0] * R[1,0])

singular = sy < 1e-6

if not singular :

x = math.atan2(R[2,1] , R[2,2])

y = math.atan2(-R[2,0], sy)

z = math.atan2(R[1,0], R[0,0])

else :

x = math.atan2(-R[1,2], R[1,1])

y = math.atan2(-R[2,0], sy)

z = 0

return np.array([x, y, z])

No matter what focal I set third angle along Z axis is calculated around 40 degrees which does not make any sense because actual camera can only change angle along X, and Y axis.

Michel Comap says

Hi Satya! This site is great and very useful for OpenCV begginers like me. I saw your webcam_head_pose.cpp code and I was wondering what OpenCV and dlib version you used? Thank you.

Satya Mallick says

I can’t remember the exact version of OpenCV and Dlib, but I think it should work with the latest version of both ( i.e. OpenCV 3.4 + Dlib 19.10).

If you are using OpenCV 3.4, you may also want to try out the native landmark detector

https://learnopencv.com/facemark-facial-landmark-detection-using-opencv

Shaun Campbell says

Hi Satya,

I’m having trouble working out how to convert the output from solvePnP (either a matrix, or a set of two vectors, translation and rotation) to another 3D coordinate system or projection matrix.

In other words: I’m using iOS SceneKit and I want to place a cone wherever the nose is, and rotate it, based on the solvePnP values. I know that’s quite a basic concept, but I’m obviously missing something – either values that I need to configure as dlib does its 3D calculations, or a way to convert its output to make sense to my own scene’s configuration. I’ve done this kind of thing in projects long ago, but I’m struggling.

(A dlib or OpenCV-based simple line rendering – both the line-from-the-nose that you demonstrate above, and also a simple cube render that I’ve taken from other dlib examples, both render nearly perfectly, so I believe the landmark and pose estimation coordinates are correct.)

I thought that I should perhaps be modifying camera_matrix or dist_coeffs to change the output of the dlib pose estimation, but for one, the 4×4 projection matrix doesn’t obviously fit in the 3×3 camera_matrix.

Do you know what process I should follow once I have solvePnP’s pose rotation and translation, to convert these to another scene, so that they display on screen in the same place as they do in the dlib-based render (i.e. your single line drawn from the nose)? I can imagine I’ll need things like the field of view of the SceneKit camera, and to ensure that the focal length is the same value as what goes into camera_matrix – but I can’t think what the calculation is.

Thanks

Shaun Campbell says

I’ve got a bit further by using ‘projectPoint’ and ‘unprojectPoint’ methods in SceneKit, but there’s still a missing link:

I ‘projectPoint’ with origin of the 3d space (SCNVector3Zero), which yields a vector that is the XY center of the view (333.5, 187.5), but the Z depth is given as 0.94, which I think will be determined by the perspective correction set in the scene’s camera matrix, but I’m not sure.

The Z value of the translation vector coming from the dlib results is much larger – it’s 1000 to 2000 or so, and this, as I expected, changes as I move a detected face closer to/farther from the camera.

So now, I’m just struggling to match these two up. My 3D object in my custom scene moves around much more correctly, but the Z depth is clearly off.

The Z value of the translation yielded by solvePnP is in the thousands, and that’s the value that is so different to the kind of depths I’m used to in a 3D scene, and that’s confusing me a little. I’ve changed my scene’s camera from perspective to orthographic, and I set the orthographic height to the height of my view. I understand how the depth is obtained using the iterative method checking for error (since we don’t know the face’s true depth from a flat image), but it’s really just that I can’t visualise the output of solvePnP with respect to my own scene.

Shaun Campbell says

Hi Satya,

I’m having trouble making sense of how to interpret the depth/Z position of the solvePnP translation.

I have my own 3D scene using iOS’s SceneKit, and I’ve tried to configure that to remove any additional error, e.g. enabling an orthographic projection with a size identical to the image size (or some fraction of this image size, and then I multiply by that fraction).

I understand that the solvePnP function yields the position of the camera with respect to an object’s origin, but I want to detect multiple faces and put objects at the faces’ positions, so I’ll be reversing this process if I can.

However, even before I do that reversal, I’m having trouble lining up a single object (representing a face) and a camera in the SceneKit scene. Even if the XY translation appears to make sense — in that, when I move the face back and forth in the device camera’s viewfinder, the coordinates make sense going from edge to edge — the Z depth doesn’t mean much to me. I wondered if it was so large because the camera_matrix has a focal depth – is this the case? I tried reducing the focal depth, and this made the values increase, and I don’t imagine increasing values in the camera_matrix arbitrarily is going to the correct approach.

Finally, I wondered: is this Z depth of solvePnP’s output influenced by scale of the 3D points used from the reference model? If I use points with tip of nose at (0,0,0), eyes at z=-135, mouth at z=-125 and so on, will the depth I get from solvePnP be proportionally large?

Thanks!

John Papakonstantinopoulos says

Hi Satya. Thank you for the tutorial. I would like to ask you how i can find the camera position using the R|t . Actually i want to measure the distance between the object and the camera.

Shaun Campbell says

Hi Satya,

Whenever I switch from solvePnP to solvePnPRansac, my results become much worse.

I also created sliders on screen to modify iterations, min-inliers, and reprojection-error, to see if I could improve from the visual feedback, but had no luck.

Do you recommend using the default params for the above style of face tracking? Or would you be customising them to suit the scale of the 3D reference points (i.e. a reprojection error more in terms of 100-200 units rather than the default 8.0)?

Ansh David says

Excellent tutorial , thank you. M having trouble finding the world coordinated for the arbitrary reference frame for facial landmarks. Can you point me towards a good resource.

Victor LEPRINCE says

Hello,

I compiled your code without any errors, but when the program launches, the camera window pops up but just freezes. I get an infinite loading. Any idea where this might come from?

Thanks a lot

Suman Nepal says

So I have a simple question. How can I extract the information if the person is looking left, right or straight from this rotation and translation matrix?

谢旭 says

Hi, Satya. I am very puzzled . How did you get these 3D points , such as Tip of the nose : ( 0.0, 0.0, 0.0) , Chin : ( 0.0, -330.0, -65.0) , Left corner of the left eye : (-225.0f, 170.0f, -135.0)

sumaliqing says

Hi, Was your question answered?

even i have similar question.

谢旭 says

Sorry, I just saw your comment.

These 3D points are coordinates in any world coordinate system

Kautilya joshi says

i applied for subscription many times but i didn’t received the confirmation mail

monxarat says

Hi Satya, I want to measure the actual size of the mouth and eyes. Distance from mouth to eyes? How to do that?

Ruben Alvarez says

Hi, thank you for the very well explained tutorial. I have one question. Let’s say that I want to find the 3D points from a given 2D image. I was thinking of going through the steps, defining a mapping between 2D and 3D points, then I could use the transformation matrix to reverse the process, am I right?

Sreerag A G says

Hello Satya, thank you for sharing your knowledge.

Is it possible to run this application in GPU? We are using Jexton TX2. Does the code support CUDA?

oldpie says

Hi Satya, Thank your for your tutorials. I rewrite the webcam_head_pose.cpp into python, and it works good. I’m curious that, given detected skeleton keypoints (shoulders, hips, nose), is it possible to estimate body orientation? From my opinion, the key is to get the 3D model of human body, while I can’t find it. Thank you.