In this tutorial we will learn how to estimate the pose of a human head in a photo using OpenCV and Dlib.
In many applications, we need to know how the head is tilted with respect to a camera. In a virtual reality application, for example, one can use the pose of the head to render the right view of the scene. In a driver assistance system, a camera looking at a driver’s face in a vehicle can use head pose estimation to see if the driver is paying attention to the road. And of course one can use head pose based gestures to control a hands-free application / game. For example, yawing your head left to right can signify a NO. But if you are from southern India, it can signify a YES! To understand the full repertoire of head pose based gestures used by my fellow Indians, please partake in the hilarious video below.
My point is that estimating the head pose is useful. Sometimes.
Before proceeding with the tutorial, I want to point out that this post belongs to a series I have written on face processing. Some of the articles below are useful in understanding this post and others complement it.
What is pose estimation ?
In computer vision the pose of an object refers to its relative orientation and position with respect to a camera. You can change the pose by either moving the object with respect to the camera, or the camera with respect to the object.
The pose estimation problem described in this tutorial is often referred to as Perspective-n-Point problem or PNP in computer vision jargon. As we shall see in the following sections in more detail, in this problem the goal is to find the pose of an object when we have a calibrated camera, and we know the locations of n 3D points on the object and the corresponding 2D projections in the image.
How to mathematically represent camera motion ?
A 3D rigid object has only two kinds of motions with respect to a camera.
- Translation : Moving the camera from its current 3D location
to a new 3D location
is called translation. As you can see translation has 3 degrees of freedom — you can move in the X, Y or Z direction. Translation is represented by a vector
which is equal to
.
- Rotation : You can also rotate the camera about the
,
and
axes. A rotation, therefore, also has three degrees of freedom. There are many ways of representing rotation. You can represent it using Euler angles ( roll, pitch and yaw ), a
rotation matrix, or a direction of rotation (i.e. axis ) and angle.
So, estimating the pose of a 3D object means finding 6 numbers — three for translation and three for rotation.
What do you need for pose estimation ?
To calculate the 3D pose of an object in an image you need the following information
- 2D coordinates of a few points : You need the 2D (x,y) locations of a few points in the image. In the case of a face, you could choose the corners of the eyes, the tip of the nose, corners of the mouth etc. Dlib’s facial landmark detector provides us with many points to choose from. In this tutorial, we will use the tip of the nose, the chin, the left corner of the left eye, the right corner of the right eye, the left corner of the mouth, and the right corner of the mouth.
- 3D locations of the same points : You also need the 3D location of the 2D feature points. You might be thinking that you need a 3D model of the person in the photo to get the 3D locations. Ideally yes, but in practice, you don’t. A generic 3D model will suffice. Where do you get a 3D model of a head from ? Well, you really don’t need a full 3D model. You just need the 3D locations of a few points in some arbitrary reference frame. In this tutorial, we are going to use the following 3D points.
- Tip of the nose : ( 0.0, 0.0, 0.0)
- Chin : ( 0.0, -330.0, -65.0)
- Left corner of the left eye : (-225.0f, 170.0f, -135.0)
- Right corner of the right eye : ( 225.0, 170.0, -135.0)
- Left corner of the mouth : (-150.0, -150.0, -125.0)
- Right corner of the mouth : (150.0, -150.0, -125.0)
Note that the above points are in some arbitrary reference frame / coordinate system. This is called the World Coordinates ( a.k.a Model Coordinates in OpenCV docs ) .
- Intrinsic parameters of the camera. As mentioned before, in this problem the camera is assumed to be calibrated. In other words, you need to know the focal length of the camera, the optical center in the image and the radial distortion parameters. So you need to calibrate your camera. Of course, for the lazy dudes and dudettes among us, this is too much work. Can I supply a hack ? Of course, I can! We are already in approximation land by not using an accurate 3D model. We can approximate the optical center by the center of the image, approximate the focal length by the width of the image in pixels and assume that radial distortion does not exist. Boom! you did not even have to get up from your couch!
How do pose estimation algorithms work ?
There are several algorithms for pose estimation. The first known algorithm dates back to 1841. It is beyond the scope of this post to explain the details of these algorithms but here is a general idea.
There are three coordinate systems in play here. The 3D coordinates of the various facial features shown above are in world coordinates. If we knew the rotation and translation ( i.e. pose ), we could transform the 3D points in world coordinates to 3D points in camera coordinates. The 3D points in camera coordinates can be projected onto the image plane ( i.e. image coordinate system ) using the intrinsic parameters of the camera ( focal length, optical center etc. ).
Let’s dive into the image formation equation to understand how these above coordinate systems work. In the figure above, is the center of the camera and plane shown in the figure is the image plane. We are interested in finding out what equations govern the projection
of the 3D point
onto the image plane.
Let’s assume we know the location of a 3D point
in World Coordinates. If we know the rotation
( a 3×3 matrix ) and translation
( a 3×1 vector ), of the world coordinates with respect to the camera coordinates, we can calculate the location
of the point
in the camera coordinate system using the following equation.
(1)
In expanded form, the above equation looks like this
(2)
If you have ever taken a Linear Algebra class, you will recognize that if we knew sufficient number of point correspondences ( i.e. and
), the above is a linear system of equations where the
and
are unknowns and you can trivially solve for the unknowns.
As you will see in the next section, we know only up to an unknown scale, and so we do not have a simple linear system.
Direct Linear Transform
We do know many points on the 3D model ( i.e. ), but we do not know
. We only know the location of the 2D points ( i.e.
). In the absence of radial distortion, the coordinates
of point
in the image coordinates is given by
(3)
where, and
are the focal lengths in the x and y directions, and
is the optical center. Things get slightly more complicated when radial distortion is involved and for the purpose of simplicity I am leaving it out.
What about that in the equation ? It is an unknown scale factor. It exists in the equation due to the fact that in any image we do not know the depth. If you join any point
in 3D to the center
of the camera, the point
, where the ray intersects the image plane is the image of
. Note that all the points along the ray joining the center of the camera and point
produce the same image. In other words, using the above equation, you can only obtain
up to a scale
.
Now this messes up equation 2 because it is no longer the nice linear equation we know how to solve. Our equation looks more like
(4)
Fortunately, the equation of the above form can be solved using some algebraic wizardry using a method called Direct Linear Transform (DLT). You can use DLT any time you find a problem where the equation is almost linear but is off by an unknown scale.
Levenberg-Marquardt Optimization
The DLT solution mentioned above is not very accurate because of the following reasons . First, rotation has three degrees of freedom but the matrix representation used in the DLT solution has 9 numbers. There is nothing in the DLT solution that forces the estimated 3×3 matrix to be a rotation matrix. More importantly, the DLT solution does not minimize the correct objective function. Ideally, we want to minimize the reprojection error that is described below.
As shown in the equations 2 and 3, if we knew the right pose ( and
), we could predict the 2D locations of the 3D facial points on the image by projecting the 3D points onto the 2D image. In other words, if we knew
and
we could find the point
in the image for every 3D point
.
We also know the 2D facial feature points ( using Dlib or manual clicks ). We can look at the distance between projected 3D points and 2D facial features. When the estimated pose is perfect, the 3D points projected onto the image plane will line up almost perfectly with the 2D facial features. When the pose estimate is incorrect, we can calculate a re-projection error measure — the sum of squared distances between the projected 3D points and 2D facial feature points.
As mentioned earlier, an approximate estimate of the pose ( and
) can be found using the DLT solution. A naive way to improve the DLT solution would be to randomly change the pose (
and
) slightly and check if the reprojection error decreases. If it does, we can accept the new estimate of the pose. We can keep perturbing
and
again and again to find better estimates. While this procedure will work, it will be very slow. Turns out there are principled ways to iteratively change the values of
and
so that the reprojection error decreases. One such method is called Levenberg-Marquardt optimization. Check out more details on Wikipedia.
OpenCV solvePnP
In OpenCV the function solvePnP and solvePnPRansac can be used to estimate pose.
solvePnP implements several algorithms for pose estimation which can be selected using the parameter flag. By default it uses the flag SOLVEPNP_ITERATIVE which is essentially the DLT solution followed by Levenberg-Marquardt optimization. SOLVEPNP_P3P uses only 3 points for calculating the pose and it should be used only when using solvePnPRansac.
In OpenCV 3, two new methods have been introduced — SOLVEPNP_DLS and SOLVEPNP_UPNP. The interesting thing about SOLVEPNP_UPNP is that it tries to estimate camera internal parameters also.
C++: bool solvePnP(InputArray objectPoints, InputArray imagePoints, InputArray cameraMatrix, InputArray distCoeffs, OutputArray rvec, OutputArray tvec, bool useExtrinsicGuess=false, int flags=SOLVEPNP_ITERATIVE )
Python: cv2.solvePnP(objectPoints, imagePoints, cameraMatrix, distCoeffs[, rvec[, tvec[, useExtrinsicGuess[, flags]]]]) → retval, rvec, tvec
Parameters:
objectPoints – Array of object points in the world coordinate space. I usually pass vector of N 3D points. You can also pass Mat of size Nx3 ( or 3xN ) single channel matrix, or Nx1 ( or 1xN ) 3 channel matrix. I would highly recommend using a vector instead.
imagePoints – Array of corresponding image points. You should pass a vector of N 2D points. But you may also pass 2xN ( or Nx2 ) 1-channel or 1xN ( or Nx1 ) 2-channel Mat, where N is the number of points.
cameraMatrix – Input camera matrix . Note that
,
can be approximated by the image width in pixels under certain circumstances, and the
and
can be the coordinates of the image center.
distCoeffs – Input vector of distortion coefficients (,
,
,
[,
[,
,
,
],[
,
,
,
]]) of 4, 5, 8 or 12 elements. If the vector is NULL/empty, the zero distortion coefficients are assumed. Unless you are working with a Go-Pro like camera where the distortion is huge, we can simply set this to NULL. If you are working with a lens with high distortion, I recommend doing a full camera calibration.
rvec – Output rotation vector.
tvec – Output translation vector.
useExtrinsicGuess – Parameter used for SOLVEPNP_ITERATIVE. If true (1), the function uses the provided rvec and tvec values as initial approximations of the rotation and translation vectors, respectively, and further optimizes them.
flags –
Method for solving a PnP problem:
SOLVEPNP_ITERATIVE Iterative method is based on Levenberg-Marquardt optimization. In this case, the function finds such a pose that minimizes reprojection error, that is the sum of squared distances between the observed projections imagePoints and the projected (using projectPoints() ) objectPoints .
SOLVEPNP_P3P Method is based on the paper of X.S. Gao, X.-R. Hou, J. Tang, H.-F. Chang “Complete Solution Classification for the Perspective-Three-Point Problem”. In this case, the function requires exactly four object and image points.
SOLVEPNP_EPNP Method has been introduced by F.Moreno-Noguer, V.Lepetit and P.Fua in the paper “EPnP: Efficient Perspective-n-Point Camera Pose Estimation”.
The flags below are only available for OpenCV 3
SOLVEPNP_DLS Method is based on the paper of Joel A. Hesch and Stergios I. Roumeliotis. “A Direct Least-Squares (DLS) Method for PnP”.
SOLVEPNP_UPNP Method is based on the paper of A.Penate-Sanchez, J.Andrade-Cetto, F.Moreno-Noguer. “Exhaustive Linearization for Robust Camera Pose and Focal Length Estimation”. In this case the function also estimates the parameters f_x and f_y assuming that both have the same value. Then the cameraMatrix is updated with the estimated focal length.
OpenCV solvePnPRansac
solvePnPRansac is very similar to solvePnP except that it uses Random Sample Consensus ( RANSAC ) for robustly estimating the pose.
Using RANSAC is useful when you suspect that a few data points are extremely noisy. For example, consider the problem of fitting a line to 2D points. This problem can be solved using linear least squares where the distance of all points from the fitted line is minimized. Now consider one bad data point that is wildly off. This one data point can dominate the least squares solution and our estimate of the line would be very wrong. In RANSAC, the parameters are estimated by randomly selecting the minimum number of points required. In a line fitting problem, we randomly select two points from all data and find the line passing through them. Other data points that are close enough to the line are called inliers. Several estimates of the line are obtained by randomly selecting two points, and the line with the maximum number of inliers is chosen as the correct estimate.
The usage of solvePnPRansac is shown below and parameters specific to solvePnPRansac are explained.
Python: cv2.solvePnPRansac(objectPoints, imagePoints, cameraMatrix, distCoeffs[, rvec[, tvec[, useExtrinsicGuess[, iterationsCount[, reprojectionError[, minInliersCount[, inliers[, flags]]]]]]]]) → rvec, tvec, inliers
iterationsCount – The number of times the minimum number of points are picked and the parameters estimated.
reprojectionError – As mentioned earlier in RANSAC the points for which the predictions are close enough are called “inliers”. This parameter value is the maximum allowed distance between the observed and computed point projections to consider it an inlier.
minInliersCount – Number of inliers. If the algorithm at some stage finds more inliers than minInliersCount , it finishes.
inliers – Output vector that contains indices of inliers in objectPoints and imagePoints .
OpenCV POSIT
OpenCV used to a pose estimation algorithm called POSIT. It is still present in the C API ( cvPosit ), but is not part of the C++ API. POSIT assumes a scaled orthographic camera model and therefore you do not need to supply a focal length estimate. This function is now obsolete and I would recommend using one of the algorithms implemented in solvePnp.
OpenCV Pose Estimation Code : C++ / Python
In this section, I have shared example code in C++ and Python for head pose estimation in a single image. You can download the image headPose.jpg here.
The locations of facial feature points are hard coded and if you want to use your own image, you will need to change the vector image_points
C++
#include <opencv2/opencv.hpp>
using namespace std;
using namespace cv;
int main(int argc, char **argv)
{
// Read input image
cv::Mat im = cv::imread("headPose.jpg");
// 2D image points. If you change the image, you need to change vector
std::vector<cv::Point2d> image_points;
image_points.push_back( cv::Point2d(359, 391) ); // Nose tip
image_points.push_back( cv::Point2d(399, 561) ); // Chin
image_points.push_back( cv::Point2d(337, 297) ); // Left eye left corner
image_points.push_back( cv::Point2d(513, 301) ); // Right eye right corner
image_points.push_back( cv::Point2d(345, 465) ); // Left Mouth corner
image_points.push_back( cv::Point2d(453, 469) ); // Right mouth corner
// 3D model points.
std::vector<cv::Point3d> model_points;
model_points.push_back(cv::Point3d(0.0f, 0.0f, 0.0f)); // Nose tip
model_points.push_back(cv::Point3d(0.0f, -330.0f, -65.0f)); // Chin
model_points.push_back(cv::Point3d(-225.0f, 170.0f, -135.0f)); // Left eye left corner
model_points.push_back(cv::Point3d(225.0f, 170.0f, -135.0f)); // Right eye right corner
model_points.push_back(cv::Point3d(-150.0f, -150.0f, -125.0f)); // Left Mouth corner
model_points.push_back(cv::Point3d(150.0f, -150.0f, -125.0f)); // Right mouth corner
// Camera internals
double focal_length = im.cols; // Approximate focal length.
Point2d center = cv::Point2d(im.cols/2,im.rows/2);
cv::Mat camera_matrix = (cv::Mat_<double>(3,3) << focal_length, 0, center.x, 0 , focal_length, center.y, 0, 0, 1);
cv::Mat dist_coeffs = cv::Mat::zeros(4,1,cv::DataType<double>::type); // Assuming no lens distortion
cout << "Camera Matrix " << endl << camera_matrix << endl ;
// Output rotation and translation
cv::Mat rotation_vector; // Rotation in axis-angle form
cv::Mat translation_vector;
// Solve for pose
cv::solvePnP(model_points, image_points, camera_matrix, dist_coeffs, rotation_vector, translation_vector);
// Project a 3D point (0, 0, 1000.0) onto the image plane.
// We use this to draw a line sticking out of the nose
vector<Point3d> nose_end_point3D;
vector<Point2d> nose_end_point2D;
nose_end_point3D.push_back(Point3d(0,0,1000.0));
projectPoints(nose_end_point3D, rotation_vector, translation_vector, camera_matrix, dist_coeffs, nose_end_point2D);
for(int i=0; i < image_points.size(); i++)
{
circle(im, image_points[i], 3, Scalar(0,0,255), -1);
}
cv::line(im,image_points[0], nose_end_point2D[0], cv::Scalar(255,0,0), 2);
cout << "Rotation Vector " << endl << rotation_vector << endl;
cout << "Translation Vector" << endl << translation_vector << endl;
cout << nose_end_point2D << endl;
// Display image.
cv::imshow("Output", im);
cv::waitKey(0);
}
Python
#!/usr/bin/env python
import cv2
import numpy as np
# Read Image
im = cv2.imread("headPose.jpg");
size = im.shape
#2D image points. If you change the image, you need to change vector
image_points = np.array([
(359, 391), # Nose tip
(399, 561), # Chin
(337, 297), # Left eye left corner
(513, 301), # Right eye right corne
(345, 465), # Left Mouth corner
(453, 469) # Right mouth corner
], dtype="double")
# 3D model points.
model_points = np.array([
(0.0, 0.0, 0.0), # Nose tip
(0.0, -330.0, -65.0), # Chin
(-225.0, 170.0, -135.0), # Left eye left corner
(225.0, 170.0, -135.0), # Right eye right corne
(-150.0, -150.0, -125.0), # Left Mouth corner
(150.0, -150.0, -125.0) # Right mouth corner
])
# Camera internals
focal_length = size[1]
center = (size[1]/2, size[0]/2)
camera_matrix = np.array(
[[focal_length, 0, center[0]],
[0, focal_length, center[1]],
[0, 0, 1]], dtype = "double"
)
print "Camera Matrix :\n {0}".format(camera_matrix)
dist_coeffs = np.zeros((4,1)) # Assuming no lens distortion
(success, rotation_vector, translation_vector) = cv2.solvePnP(model_points, image_points, camera_matrix, dist_coeffs, flags=cv2.CV_ITERATIVE)
print "Rotation Vector:\n {0}".format(rotation_vector)
print "Translation Vector:\n {0}".format(translation_vector)
# Project a 3D point (0, 0, 1000.0) onto the image plane.
# We use this to draw a line sticking out of the nose
(nose_end_point2D, jacobian) = cv2.projectPoints(np.array([(0.0, 0.0, 1000.0)]), rotation_vector, translation_vector, camera_matrix, dist_coeffs)
for p in image_points:
cv2.circle(im, (int(p[0]), int(p[1])), 3, (0,0,255), -1)
p1 = ( int(image_points[0][0]), int(image_points[0][1]))
p2 = ( int(nose_end_point2D[0][0][0]), int(nose_end_point2D[0][0][1]))
cv2.line(im, p1, p2, (255,0,0), 2)
# Display image
cv2.imshow("Output", im)
cv2.waitKey(0)
Real time pose estimation using Dlib
The video included in this post was made using my fork of dlib which is freely available for subscribers of this blog. If you have already subscribed, please check the welcome email for link to my dlib fork and check out this file
dlib/examples/webcam_head_pose.cpp
If you have not subscribed yet, please do so in the section below
Great article here! Found a small mistake though. Solvepnp’s P3P method takes not 3, but 4 points, including the origin of the model.
And can you explain further regarding why you recommend using P3P only with ransac?
Thanks. P3P uses the minimum number of points and not all points and therefore the estimates can be noisy. RANSAC provides the robustness against noise by sampling the minimum number of points multiple times and selecting the model that has the maximum number of inliers.
Thanks for replying. I had tried to use P3P with RANSAC sometime back, but wasn’t able to get good results. Iterative was good compared to using P3P with RANSAC. Maybe the parameters I used were wrong. I experimented with default parameters as well as some custom params. Didnt seem to give a good output. If you were able to make RANSAC work well, can you post them too?
In most cases where the noise is small and the number of 3D to 2D matches are small, the iterative method will work better. People use RANSAC when there is a large amount of noise but they have a large number of matches. Imagine you have a 3D model of an arbitrary scene with a texture map and you are using SIFT to match features. You can get hundreds of 3D to 2D matches in such applications buy a lot ( say 30-40%) of the matches will be incorrect. In such cases the iterative method will fail miserably and RANSAC will do a very reasonable job.
Hi Satya,
Can i use this to create a 3d mesh on the face, and could i also use this for eye blink detection?
Thanks,
No. This just gives the direction in which the face is looking.
Hey Satya, thanks for your reply. Any existing code that does such a thing, maybe in dlib?
dlib will allow you to track 68 points on the face which you can triangulate to create a rough 2D mesh. There are a few techniques for calculating 3D mesh (e.g. 3D morphable model), but I don’t know one that is implemented in a library like opencv or dlib.
In OpenCV 3.1.0 for raspberry pi 3. I removed this flags=cv2.CV_ITERATIVE.
It will worked If I removed flags=cv2.CV_ITERATIVE.
Thanks, Mallick
That is odd. Unfortunately, I don’t have a way to quickly test. But if someone else also points this out, I will change the code.
This is for only Raspberry Pi 3. Not pc. Usually, I used Raspberry pi 3 all times.
In Python 2/3, why did u used semi-colons?
U don;t needed semi-colon @ the end of brace brackets
Sorry that was a typo. Semi-colons are not needed. Fixed.
Thanks Dr. Satya Mallick !! I get interest to read your all the posts. The posts are very informative and clears each and every detail in minimum words. Still I have not implemented the work you shared. But, once I will implement it, definitely my interest in OpenCV will increase more..
Thanks for the kind words.
I want to learn OpenCV by implementing your work. I prefer the Ubuntu platform. Let me know where i get good materials for preliminary stage.
Dear Mallick, thank you for sharing your knowledge……i tried the code, no compile or run time error, but the algorithm is not detecting any thing and is very very slow….i have enabled SSE2, SSE4 and AVX but no results….when i tried the webcam_face_pose_ex from Dlib it works perfectly…..I appreciate any help from your side, as in your video the algorithm works fine and fast
The bottleneck is the face detector, requires so much time….resizing and using your customized face rendering didn’t solve the problem……Do you have any hint ? is it possible to use opencv face detector instead ? (my PC is modern with i7 processor)…thanks
Here are a few suggestions to speed up dlib. Hope this helps.
https://learnopencv.com/speeding-up-dlib-facial-landmark-detector/
the instructions in this link are already implemented in your code (resizing, faster rendering) but no results…..I have used the opencv face detector instead and now its working correctly but at 7 fps only….would you please tell me what was your frame speed including everything (detection and pose estimation)…thank you so much again for your assistance
Try commenting out the following line in the example code and run in release configuration.
if ( count % 15 == 0)
i want to use openCV and Dlib in one python script. i want to detect faces thorough dlib and recognize them using fisher faces algorithm. is it possible?
detection and recognition both are real time.
Yes detection will easily be real time using either Dlib or OpenCV versions. I am not 100% sure if recognition will work in real time, but you can do recognition every nth frame.
thankyou… i want to ask one more thing i want to align the dataset images for the recognition i am using the code from https://github.com/bytefish/facerecognition_guide/blob/master/src/py/crop_face.py but it is not aligning them properly can you identify the mistake
i want to save the detected face in dlib by cropping the rectangle do you have any idea how can i crop it. i am using dlib first time and having so many problems. i also want to run the fisherface algorithm on the detected faces but it is giving me type error.
i seriously need help in this issue.
Please see the reply above.
Thanks for sharing this. I have a few doubts. Firstly what is that rotation vector i get as output from solvePNP, also how can i get a full 3×4 projection matrix which can take my 3d points to 2d from this?
Rotation vector is just a way to represent rotation in axis-angle form. To convert it to matrix form, you can use Rodrigues formula. OpenCV has an implementation here
http://docs.opencv.org/2.4/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html#void Rodrigues(InputArray src, OutputArray dst, OutputArray jacobian)
3×4 projection matrix is simply rotation and translation concatenated as the fourth column.
Thanks, moreover is there some way get the full projection matrix, which can transform the 3d model points to the 2d points in the image, which i believe is being used inside the cv2.projectPoints function
Thanks for sharing this project. However, for me it is quite noisy. I calculated the pitch, which sometimes jumps by 30 degrees, especially when my face is frontal. Did you try adding points close to the ear? Or are they generally unreliable?
Any idea, how to make this more robust and accurate?
Yes you can try a few points near the ears. Unfortunately, the location of those points as returned by Dlib is not very reliable because they are not as nicely defined as other facial features. You may also try adding Kalman Filtering which will help smooth out noisy fluctuations in pose estimation. E.g. checkout this tutorial
http://docs.opencv.org/trunk/dc/d2c/tutorial_real_time_pose.html
i want to save the detected face in dlib by cropping the rectangle do
you have any idea how can i crop it. i am using dlib first time and
having so many problems. i also want to run the fisherface algorithm on
the detected faces but it is giving me type error.
i seriously need help in this issue.
In the dlib code for tracking landmarks, you will notice that faces are detected first. It saves the detected rectangles in a variable called “faces” which is a vector of rectangles. You can get the cropped image for a face
To get one face rectangle in OpenCV cv::Rect format using
cv::Rect r(faces[i].left(), faces[i].top(), faces[i].width(), faces[i].height());
You can use the above rectangle to crop out the face from the image im using
Mat imFace = im(r);
Hi Satya,
There is a mistake in left eye 3d coords in the text (“Left corner of the left eye : ( 0.0, 0.0, 0.0)”).
In the code they are (-225.0, 170.0, -135.0) which seem to be correct.
Thank you so much. I have fixed the mistake.
hi, Satya, I try to use some other points to calculate pose, could you please tell me where you get these 3d coords?
hi, Max, I try to use some other points to calculate pose, could you please tell me where I can get other landmarks 3d coords?
Hello Satya,
I was wondering how i can get the 3D model points in real time (like i can see in your video with the vector that comes from your nose).
Thanks
If you look at the code, I have put a 3D point some distance from the nose in the 3D model. I simply project this point onto this image plane using the estimated rotation and translation.
Hey there Mister Satya!
Great job with all the tutorials and explanation. I wanna do the pose calculation by myself from scratch. I understand the method, the only thing that keeps me away is that i dont know how to extract only 6 landmarks, instead of 68. I’ve checked the code so many times, the dlib/opencv indexes too. I really need some help, im stucked… I uploaded the code too. Maybe u can give me a fast advice, i know ur time is precious! Thanks a lot! https://uploads.disquscdn.com/images/1fe9db819b1280342fd63a55b92d4b6486cde5b8e6235979dd752642bcd8f646.png
Hi, this is really a fantastic blog. But I’m wondering what is the measure of the image coordinate and the world coordinate? Are they pixel and millimeter?
If you look at my version of dlib, you will see the indices of 6 points. I have shared the C++ code below.
std::vector get_2d_image_points(full_object_detection &d)
{
std::vector image_points;
image_points.push_back( cv::Point2d( d.part(30).x(), d.part(30).y() ) ); // Nose tip
image_points.push_back( cv::Point2d( d.part(8).x(), d.part(8).y() ) ); // Chin
image_points.push_back( cv::Point2d( d.part(36).x(), d.part(36).y() ) ); // Left eye left corner
image_points.push_back( cv::Point2d( d.part(45).x(), d.part(45).y() ) ); // Right eye right corner
image_points.push_back( cv::Point2d( d.part(48).x(), d.part(48).y() ) ); // Left Mouth corner
image_points.push_back( cv::Point2d( d.part(54).x(), d.part(54).y() ) ); // Right mouth corner
return image_points;
}
很不错的样子
Tranlated :
Looks very good.
I run the program in xcode,but it’s too slow than compiled webcam_head_pose.
Are you sure you are compiling release mode ? Check this out
http://dlib.net/faq.html#Whyisdlibslow
Dear Satya, thanks for sharing this post and explaining it. I am interested in developing gaze estimation program. It can estimate the center of pupil. In other words, I have the point of the center of pupil. How can I estimate gaze on computer like head pose estimation ? Thanks a lot.
Hi Tolga,
You will have to detect the center of the pupils first. Dlibs landmark detector does not detect it, but it is possible to do so by retraining a landmark detector with your own data that contains the center of the eyes. In fact, in a few weeks I plan to release a model with the pupil center.
Satya
Hi Mr. Satya, thank you for this tutorial. I am also trying to estimate the gaze. I already achieve pupil detection. Now I am trying to determine the gaze pose. I read some articles that uses the similar technique you use in this tutorial, modelling an eye; however I don’t know where to find the reference 3D points values of an adult eye. I would be glad if you could help me with this or recommend me some papers to read. Best Regards, Moises
I really thank this article.
I’m so sorry but is there an example of webcam_head_pose in python?
I watched this and tryed to code it in python but I couldn’t do it
dlib/examples/webcam_head_pose.cpp
Sorry, I don’t have a python version currently. But if you follow the logic in the C++ code, you will be able to write your own. There are not many lines of code.
Thanks Satya for this amazing tutorial. I would like to get you advice on how to reduce jitter resulted from pose matrix when used in augmented reality.
Thanks for the kind words Mohammed.
One option is to smooth out jitter by calculating the moving average of the points over multiple frames ( say plus and minus 2 frames ).
You can do average the rotation / translation directly. Be careful while averaging rotation matrices — it is not straightforward. You may find this discussion helpful
http://stackoverflow.com/questions/12374087/average-of-multiple-quaternions
Satya
Thanks for your prompt reply.
i was thinking of converting the rotation matrix to quaternion, average it and then back to rotation matrix. will this work?
Hi! This is really a fantastic blog. I’m wondering what is the measure of the image coordinate and the world coordinate respectively? Are they pixel and millimeter?
Thank you!
The image coordinates are in pixels, but the world coordinates in are arbitrary units. You can produce the world coordinates using real measurements in millimeter or inches etc, or it could be just the coordinates in some arbitrary 3D model.
Wow that was a very fast reply! Thank you for your answer. Can I interpret your answer as the units of the world coordinates actually does not matter in computation as long as we keep the consistency of the measure of each point in 3D model?
Yes that’s right.
Hi Satya, does the higher number of model points affect the precision of the estimated pose matrix?
Yes, the pose estimate can be made better with more points. Also, if you could have some points on the ears etc. , the pose estimate will be more stable.
sir can i know what are the algorithms used here to estimate the pose?
Hi Satya! Is there any functions in OpenCV or any other libraries that I can use to find the rotation 3×3 matrix R and the translation matrix t when given the intrinsic camera matrix, the 2D image points and their corresponding 3D model points? Or I have to implement the wheel to find the extrinsic camera matrix in this scenario?
Hi Zongchang,
Yes, solvePnP does precisely that :).
Satya
Hi Satya, thank you for your quick reply again! But when I run this code, the rotation vector rvec returned is actually a 3×1 column vector. I don’t think that is the 3×3 rotation vector that I actually want.
They are both the same rotation expressed differently. Look for openCV documentation on Rodrigues to convert one form to other
I see. It really really helps! Thank you so much!
Hey Satya, I was trying to do just a face recognition using dlib and standard face landmark from their site, it seems like the features and matching are not rotation invariant, I was wondering if you have any ideas how to make the face recognition rotation invariant with dlib?
Hi Alexey, Face Recognition usually means identifying who the person is. Landmark detection can be used as a preprocessing step in face recognition for alignment. Does that make sense ?
Thanks for replying. I mean face detection phase inside dlib, seems like the landmark detection is not rotation invariant, so when rotate the camera like 90 degrees it doesn’t detects a face. Maybe you have some thoughts where to look to fix that. Thank you.
If you’re using the “.dat” file that came with dlib then it’s limited to detecting the 68 facial landmarks that it was trained on. To get it to detect a profile view you’ll have to create a new “.dat” file using several photographs of people with their faces turned 90 degrees to the camera. In the /tools directory you should find imglab which helps you do this.
Thank you so much Satya Sir for your wonderful tutorials. They helped me alot to learn OpenCV and creating my projects. Until now I have implemeted pose estimation with SolvePnP as you explained above. But as faar as I understood, camera is fix in this scenario. If both camera and target are moving then how it will be possible to detect the pose of the camera w.r.t. target? My camera has 2 degrees of freedom (pitch, yaw). Do you have any suggestions? I am thiking to estimate homography matrix from point matching between change of pose and somehow add that to the rvec and tvec? Any suggestions? Thank you, Siddhant Mehta
Hi Siddhant. If the camera moves you get the relative orientation of the object w.r.t the camera. But I guess you are asking how do you recover camera motion. For that you have to look at static parts of the scene, find point correspondences. If the point correspondences come from a plane ( e.g. the floor or one wall ) you can estimate Homography and decompose it into R and t. Otherwise, you need to estimate the Essential Matrix / Fundamental Matrix. BTW if you are doing this to learn, go ahead and implement these yourself. But if you are using it in a real world project check out VisualSFM, Theia, and OpenMVG.
Hi Satya and thank you for your tutorial. It is very useful for me. I would have one question to ask about swap face. I would like to do an android application about putting model’s face to a user’s face so that they can see the result for applying the cosmetic in our application. I have watched your tutorial (face swap and face morph ). Which one do you think is more suitable and can I swap their face feature and without change their face size and hair style? Because I saw that the face shape would be changed in the face swap tutorial. Thank you very much.
Thanks. Both of those are not actually good for applying makeup. For makeup the technique is very different and each makeup element is rendered differently. You can try to look at something I did at my previous company ( http://www.taaz.com ).
If I am doing a college assignment, which one do you think would be more suitable?
This tutorial shows how to unproject 2D points to 3D points, which is a somewhat interesting optimization/fitting problem, but to have a working solution, the important bit is finding where the feature points are in the faces in the input images — corners of eyes, nose time, mouth, etc. I can’t find any code in your github that actually calls the opencv face detect functions — there are just files with hard-coded point locations as input. How did you generate these input files?
That is done using dlib. This is the file you need.
https://github.com/spmallick/dlib/blob/master/examples/webcam_head_pose.cpp
and here is the compilation instruction
http://dlib.net/compile.html
Satya,
Great post. Do you have a suggestion as to how to derive 3D coordinate locations for the landmark features themselves, rather than the entire head?
The approaches I can think of, using a simple mesh of a generic head:
1. Raycast from a given 2D landmark position to the head mesh model and calculate the point position where the ray intersects.
2. Render a position map of the head mesh from the camera POV (i.e. render the XYZ coordinates of the face model into the RG and B channels respectively), then retrieve the pixel value at the location of the landmark.
3. Render a depth matte of the head mesh and use its value paired with the XY screen coordinates of the landmark to derive the world XYZ from these.
Not sure it there’s a simpler approach, or which one of these would be the most efficient, since I haven’t dealt too much with rendering 3D objects via OpenCV.
Thanks Damian,
If you have a 3D triangulated mesh and you have found the head pose using the method I have described, you can use the project any point on the mesh to the image plane. Conversely, if you want to estimate the 3D location of a 2D point, you can transform the mesh into camera coordinates ( see figure in the article ), and shoot a ray from the camera center through the pixel location and see where it intersects the mesh ( in camera coordinates ). Obviously, it is possible the ray will intersect the mesh multiple times and so you need to choose the point closest to the camera.
Thanks,
YEs, I’m actually already putting a workflow together based on using your pose prediction to inform a more detailed mesh.
One other quick question: if you’re only concerned with landmark detection for a single actor, would it be better to train a model with multiple photos of their head in various orientations and lighting, rather than a variety of faces? If so, should they also be of various facial expressions? Forgive my ignorance of the training model; I’m a few levels of encapsulation away from wanting to understand the fine details neural network implementation…
Yes it would be better. In fact I have done something similar for a project. It is very difficult to find the same person under different lighting conditions. You also need to label all those images. So the best trick is to run the standard landmark detector on the person’s face, fix the points that are not accurate, and put these new images in the training set as well. 50% images of this person and 50% of random people will still bias the results toward this person’s face and also have sufficient variety in lighting etc. Hope that helps.
Yes indeed, thank you. I actually plan to add custom markers to the face and train those (i.e. dots at landmark positions like cheek bones, corner of mouth and above eyebrows). I imagine at that point using other faces would just confuse the results. And if you don’t mind me asking one more question: in the case of adding custom markers would the shape of the marks need to be unique, or would their proximity to facial features (e.g. a marker just to the side of a mouth corner) be sufficient for the training to see them as unique?
Hi Satya,
If you mean using the 2d landmark points that come from dlib and are therefore subject to skewing/scaling depending on perspective and head rotation (i.e. the points before and unrelated to pose estimation), wouldn’t this become increasingly more inaccurate with more pose rotation beyond zero, and translation away from the center of the camera image (e.g. if there is significant perspective distortion)?
What would the above method give, that isn’t already achieved by taking the 68 landmark points’ 2D camera-image coordinates, scaling them with respect to the target 3D coordinate system, giving a Z-plane of X Y positions, then translating and rotating this collection of points by the pose estimation matrix?
Or are we talking about estimating the 3D location of a 2D point that has had further transformation to take the perspective of the device camera into account?
Thanks
Hi Satya, Thank you for very good tutorial about dlib and opencv. I am beginner at c++ and I have some question to ask about webcam_head_pose.cpp as in code. My goal is to draw laser from eyes like Superman so I need to get eyes position from face. Is there anyway to get eyes position from it ? Thank you very much.
Thanks!
You will have to train your own dlib model that contains the center of the eyes. You can also use the points around the eyes to come up with a heuristic for the the location of the center of the pupil, but it won’t be very good.
Hi Satya,
Can we use the information determined from this, to get the location of a real world object from it’s pixel co-ordinates?
For example, I use an A4 paper to do the mentioned steps. Can I then use the translational vector, rotation vector, and my knowledge of the dimensions of the paper to get the real world location of a coin next to it?
For 3D you need two cameras. This just gives you the pose. The translation vector here does not correspond to real world. It is w.r.t the coordinate in which the 3D points are defined.
Hi,
My project is “Density Estimation of crowd”. Video is captured from a drone camera and i have to count number of heads. And i have to code in opencv python.can someone guide me please?
nice tutorial!! but is there a way to process using gpu
Hi. nice tutorial ..But its running slow on my system i.e. 30fps only. Also it detects only within a limited range. Outside that it simply doesn’t detect at all. Is this problem in actual system also or only my problem? How to increase speed further? i have set AVX instruction flag still no effect.
Thanks.
You can try some suggestions here
https://learnopencv.com/speeding-up-dlib-facial-landmark-detector/
I want the computer to know whether the user turns his head left, right, up or down. Thus, based on the pitch and yaw, can u provide some suggestions to let the computer learns itselft?
You may find this post useful https://learnopencv.com/rotation-matrix-to-euler-angles/
Hi Satya. Great site, I’m learning a ton.
I would like to clarify my understanding of the assumptions made, and the preprocessing necessary. Firstly, the 2D image points, i.e. the 2D locations of the nose tip, chin etc., am I correct in assuming that they are the result of a facial landmark detector run beforehand?
Secondly, I did not understand clearly where the 3D model points where taken from, and how I would need to alter them for my own use?
Thanks in advance.
Hi, Was your question answered?
even i have similar question.
yes, the 2D points are a result of facial landmark detection.
The 3D points were simply approximated by me. If you have a 3D model of a human head, you can use the points from that model.
Sir can you tell me how you calculated the 3d coordinates.
Does it matter to normalize the template 3D points (defined above in step 2), and scale them to the size of the detected face?
It does not matter. The transformation you calculate has scale embeded inside.
I wondered about this too. I’m struggling to make sense of the Z position of the solvePnP detected translation, and how to use that. I get translations with a large depth value e.g. (-300, -200, -2056).
I’m working on iOS, using SceneKit. If I have orthographic projection enabled in my own 3D scene, this Z depth (either applied to the scene’s camera, or a particular 3D object with the pose transform applied to it) won’t affect the perceived size of an object.
Is it better to ignore this Z depth, and influence the scale of an object (such as a clown mask placed over the square frame of the detected face) based on initial distance mapped by face metadata or facial landmark points with respect to camera image size?
Thanks
Hi Satya, I’m using a checkerboard or circles to use solvePnP. In this case, how many pictures do I need to prepare? Only one picture is fine as you did if it includes several points?
Also, I noticed that the latest calibrateCamera in OpenCV3 accepts the object points in the object points’ coordinate frame (= checkerboard coordinate frame), and not necessarily be in the world frame. Is it the same for solvePnP?
I want take 3d coordination of landmark points,
can you please help me?
You cannot have true 3D coordinates because it is a single camera based system.
Hi Satya, The way you have presented this topic is so simple and awesome to understand.
My question is, I know the 2D Coordinates on the images(Image points) where feature is located. I know that i can estimate the 3D world Coordinate with Image points and camera parameters. Can i use the calculated 3D world Coordinate and Known Image Points to find the Pose?.
If yes, how accurate will this be?
Yes, you basically need the 3D points, cameraMatrix and the 2D points to find the pose.
Sir,can you tell me how you are extracting the 3d points from 2d.
Hi Satya, how to estimation gaze position based on the information which we get from face landmarks?
sir..thanks for this awesome tutorial.but one question how to do it in for video captured live from webcam using python
Hi Satya,
I tried webcam_head_pose example in https://github.com/spmallick/dlib. Unfortunately, I only see the raw images from the webcam without any head pose and face landmarks. What would be the problem?
OpenCV3.3.0
CUDA 8.0
platform Jetson TX1
Hi Satya, Thank you very much for your tutorial.
I have a question: what if the images are captured by webcam in real time? How can you get the 2d image points and 3d model points in this case?
Hi Jing,
The 3D model points remain constant. The 2D points can be estimated using Dlib’s Facial Landmark Detector like we do in the tutorial.
Satya
Thanks Satya,
I noticed that in your post, the 3D model points are not specialized for a specific person. Could you please tell me what model you use to locate those landmarks? Many thanks.
-Jing
I cooked up those points by eye balling what would be approximate positions of the points in 3D
Hi Satya, in a typical front headshot with the subject basically facing the camera, I see how you can use this to estimate slight tliting sideways and turning of the head left/right. However, how well does this work for estimating forward and backward tilt when you’re using an uncalibrated camera and generic 3D model.
Working with these landmarks, it would seem to me that there’s too much variation between individuals in nose length, nose vs. mouth position, etc to make a determination. Of course a partial side view would solve this, but that’s not always possible.
Am I missing something? Are there better approaches than this? Or ss the ability to do this just one of those things that just make us humans special? 😉 Thanks!
Hello Satya, thak you for sharing your knowledge.
Wich tool did you use for your 2d image landmark custom annotation? I want to train my model with specif landmarks…
And another question: Are there any functions in opencv to train my custom model and making accurate landmarks predictions?
Thanks
Hi Jose,
We wrote a tool in MATLAB a while back for a client. Unfortunately, I cannot share it for that reason.
Thanks
Satya
Excellent explanation sir..!!
I Have a doubt sir,values given by rotation vector and translation vector,what they will signify?
As for a rotation of about 120′ in ‘yaw’ i m getting values in range of [-6,6]. It’s not in degree then what is it??
Thank you
Vishant,
There are two coordinates in 3D — the one attached to the camera using which the picture was taken and another attached to the 3D model. The rotation matrix and translation vector relate the two coordinate systems. In other words, you can apply the R and t the 3D point in the model coordinates to find the coordinates in the camera coordinates.
Sir,
Thankyou for your crucial time.
Actually,what i m trying to achieve is based on some threshold value of rotational matrix i want to go for face recognition.What i mean is if the value is below or above some threshold then only i will go for recognition like if side pose is there then my face recognition algorithm does not able to extract features correctly and will give wrong result as well as waste my computational time.SO, do u think it is feasible??
Please share your thoughts on same.
Thankyou
Thanks for the tutorial, Satya. This seems to have become Google’s go-to article for face post estimation.
Looking at the code, I see you’re using a 3D model using the nose as the
origin, with +y going upward (Cartesian). Yet, the 2D data uses Open
CV/DLib’s +y *downward* convention, the vertically-mirrored image of the
3D model.
Could you please explain the reasoning behind the discrepancy between the coordinate systems?
Running your example gives me a rotation vector of roughly [0, 2, 0].
Inverting the sign of the y-coords in the 3D model gives me a rotation vector roughly [0, -1, 0].
Which is correct?
Hi John,
It does not matter how you define your coordinates. The R and t will adjust to whatever system you use. The only check you should do is to apply the R and t to the 3D points, and then project it only the image ( face ). If the 3D points land near their 2D counter part, your estimation is correct.
[ You can see how I am projecting the point in front of nose as an example ]
Satya
Hello.This is a great tutorial but can you explain what exactly we are getting in the rotation vector obtained?
Hello Satya, I am trying to run it with python. How can I find my Reprojection Error?
The ouput picture looks quite good but I am not sure how to interpret my euler angles. I am a little bit confused. I thought, I start being the camera (X right, Y down and Z to the front). Then I start due to euler convention turning on x, then on y’ then on z”. But the new coordinate system is never how I expected. Even though the blue line points allways in the right direction.
Thank you
Hi Satya, Thank you so much for this tutorial. I learn a lot of things from your blog. Can you help me with head pose estimation? I’m integrating head pose estimation in iOS. It’s work fine, but the euler angles X value when my face is around 90 Degree. I wonder that maybe something wrong with camera matrix in iOS or the coordinates is not correct?
Hi Satya, Nice Presentation. Thanks
I have few questions,
1. Did you use 3D model as a reference for finding the third coordinate of 2D. Or Just assuming the third coordinate in your code.
2. If you using the 3D model as reference, then how do you find third coordinate of 2D.
3. Is there any possibilty to find the translation and rotation before obtaining the third coordinate of 2D.
4. Can you expain more detail about 2D to 3D which you have derived.
Thanks a ton in advance.
Hello Satya,
I tried to run your headPose.py program and I get the following error:
Traceback (most recent call last):
File “/home/pi/headPose.py”, line 45, in
(success, rotation_vector, translation_vector) = cv2.solvePnP(model_points, image_points, camera_matrix, dist_coeffs, flags=cv2.CV_ITERATIVE)
AttributeError: module ‘cv2’ has no attribute ‘CV_ITERATIVE’
What is the problem and how can I solve it?
They keep changing the names of these constants and breaking backward compatibility in the Python versions. Try cv2.SOLVEPNP_ITERATIVE and let me know if that works. I will update the post accordingly.
That was a very quick answer Thank you. Indeed it worked perfectly.
I also tried the c++ code but this produced a lot of errors. Can you help on this??
(Sorry for the long post, but didn’t know how to upload it)
/tmp/ccwiPEXZ.o: In function `cv::operator<<(std::ostream&, cv::Mat const&)':
headPose.cpp:(.text+0x128): undefined reference to `cv::Formatter::get(int)'
/tmp/ccwiPEXZ.o: In function `main':
headPose.cpp:(.text+0x1f0): undefined reference to `cv::imread(cv::String const&, int)'
headPose.cpp:(.text+0x5f4): undefined reference to `cv::Mat::zeros(int, int, int)'
headPose.cpp:(.text+0x824): undefined reference to `cv::solvePnP(cv::_InputArray const&, cv::_InputArray const&, cv::_InputArray const&, cv::_InputArray const&, cv::_OutputArray const&, cv::_OutputArray const&, bool, int)'
headPose.cpp:(.text+0x964): undefined reference to `cv::noArray()'
headPose.cpp:(.text+0x998): undefined reference to `cv::projectPoints(cv::_InputArray const&, cv::_InputArray const&, cv::_InputArray const&, cv::_InputArray const&, cv::_InputArray const&, cv::_OutputArray const&, cv::_OutputArray const&, double)'
headPose.cpp:(.text+0xab0): undefined reference to `cv::circle(cv::_InputOutputArray const&, cv::Point_, int, cv::Scalar_ const&, int, int, int)’
headPose.cpp:(.text+0xb9c): undefined reference to `cv::line(cv::_InputOutputArray const&, cv::Point_, cv::Point_, cv::Scalar_ const&, int, int, int)’
headPose.cpp:(.text+0xc94): undefined reference to `cv::imshow(cv::String const&, cv::_InputArray const&)’
headPose.cpp:(.text+0xcb4): undefined reference to `cv::waitKey(int)’
/tmp/ccwiPEXZ.o: In function `std::ostream& cv::operator<< (std::ostream&, std::vector<cv::Point_, std::allocator<cv::Point_ > > const&)’:
headPose.cpp:(.text+0xfc8): undefined reference to `cv::Formatter::get(int)’
/tmp/ccwiPEXZ.o: In function `cv::String::String(char const*)’:
headPose.cpp:(.text._ZN2cv6StringC2EPKc[_ZN2cv6StringC5EPKc]+0x58): undefined reference to `cv::String::allocate(unsigned int)’
/tmp/ccwiPEXZ.o: In function `cv::String::~String()’:
headPose.cpp:(.text._ZN2cv6StringD2Ev[_ZN2cv6StringD5Ev]+0x14): undefined reference to `cv::String::deallocate()’
/tmp/ccwiPEXZ.o: In function `cv::String::operator=(cv::String const&)’:
headPose.cpp:(.text._ZN2cv6StringaSERKS0_[_ZN2cv6StringaSERKS0_]+0x30): undefined reference to `cv::String::deallocate()’
/tmp/ccwiPEXZ.o: In function `cv::Mat::Mat(int, int, int, void*, unsigned int)’:
headPose.cpp:(.text._ZN2cv3MatC2EiiiPvj[_ZN2cv3MatC5EiiiPvj]+0x134): undefined reference to `cv::error(int, cv::String const&, char const*, char const*, int)’
headPose.cpp:(.text._ZN2cv3MatC2EiiiPvj[_ZN2cv3MatC5EiiiPvj]+0x21c): undefined reference to `cv::error(int, cv::String const&, char const*, char const*, int)’
/tmp/ccwiPEXZ.o: In function `cv::Mat::~Mat()’:
headPose.cpp:(.text._ZN2cv3MatD2Ev[_ZN2cv3MatD5Ev]+0x3c): undefined reference to `cv::fastFree(void*)’
/tmp/ccwiPEXZ.o: In function `cv::Mat::operator=(cv::Mat const&)’:
headPose.cpp:(.text._ZN2cv3MataSERKS0_[_ZN2cv3MataSERKS0_]+0x140): undefined reference to `cv::Mat::copySize(cv::Mat const&)’
/tmp/ccwiPEXZ.o: In function `cv::Mat::create(int, int, int)’:
headPose.cpp:(.text._ZN2cv3Mat6createEiii[_ZN2cv3Mat6createEiii]+0xc0): undefined reference to `cv::Mat::create(int, int const*, int)’
/tmp/ccwiPEXZ.o: In function `cv::Mat::release()’:
headPose.cpp:(.text._ZN2cv3Mat7releaseEv[_ZN2cv3Mat7releaseEv]+0x68): undefined reference to `cv::Mat::deallocate()’
/tmp/ccwiPEXZ.o: In function `cv::Mat::operator=(cv::Mat&&)’:
headPose.cpp:(.text._ZN2cv3MataSEOS0_[_ZN2cv3MataSEOS0_]+0xf8): undefined reference to `cv::fastFree(void*)’
/tmp/ccwiPEXZ.o: In function `cv::MatConstIterator::MatConstIterator(cv::Mat const*)’:
headPose.cpp:(.text._ZN2cv16MatConstIteratorC2EPKNS_3MatE[_ZN2cv16MatConstIteratorC5EPKNS_3MatE]+0xf8): undefined reference to `cv::MatConstIterator::seek(int const*, bool)’
/tmp/ccwiPEXZ.o: In function `cv::MatConstIterator::operator++()’:
headPose.cpp:(.text._ZN2cv16MatConstIteratorppEv[_ZN2cv16MatConstIteratorppEv]+0x94): undefined reference to `cv::MatConstIterator::seek(int, bool)’
/tmp/ccwiPEXZ.o: In function `cv::Mat::Mat<cv::Point_ >(std::vector<cv::Point_, std::allocator<cv::Point_ > > const&, bool)’:
headPose.cpp:(.text._ZN2cv3MatC2INS_6Point_IdEEEERKSt6vectorIT_SaIS5_EEb[_ZN2cv3MatC5INS_6Point_IdEEEERKSt6vectorIT_SaIS5_EEb]+0x214): undefined reference to `cv::Mat::copyTo(cv::_OutputArray const&) const’
/tmp/ccwiPEXZ.o: In function `cv::Mat_::operator=(cv::Mat const&)’:
headPose.cpp:(.text._ZN2cv4Mat_IdEaSERKNS_3MatE[_ZN2cv4Mat_IdEaSERKNS_3MatE]+0x94): undefined reference to `cv::Mat::reshape(int, int, int const*) const’
headPose.cpp:(.text._ZN2cv4Mat_IdEaSERKNS_3MatE[_ZN2cv4Mat_IdEaSERKNS_3MatE]+0xec): undefined reference to `cv::Mat::convertTo(cv::_OutputArray const&, int, double, double) const’
/tmp/ccwiPEXZ.o: In function `cv::Mat_::operator=(cv::Mat&&)’:
headPose.cpp:(.text._ZN2cv4Mat_IdEaSEONS_3MatE[_ZN2cv4Mat_IdEaSEONS_3MatE]+0x98): undefined reference to `cv::Mat::reshape(int, int, int const*) const’
headPose.cpp:(.text._ZN2cv4Mat_IdEaSEONS_3MatE[_ZN2cv4Mat_IdEaSEONS_3MatE]+0xf0): undefined reference to `cv::Mat::convertTo(cv::_OutputArray const&, int, double, double) const’
collect2: error: ld returned 1 exit status
It looks like you are not linking to the OpenCV library correctly. You can try these instructions
https://learnopencv.com/how-to-compile-opencv-sample-code/
Hello Satya,
Already done this but again the same problems apear
Hi Satya, I’m new to programming and also computer vision. Currently I’m doing head pose estimation using C# language. Is it possible for me to use solvePnP in C#? Also, the camera I’m using is the Intel RealSense D435 RGB-D Camera. Thank you
I am trying to use your code to estimate camera position/angle in soccer field.
Here is my calibration frame with four points.
https://ibb.co/hSSt1x
World coordinates are in meters
After I run this:
pts3d = np.array([[ 0. , 0, 11], [ -5.5 , 0, 11], [ 0. , 0, 0], [ -16.5 , 0, 0]])
pts2d = np.array([[189, 207], [65, 244], [564, 242], [191, 402]])
(success, rotation_vector, translation_vector) = cv2.solvePnP(pts3d, pts2d, camera_matrix, dist_coeffs, flags=0)
I use rotation vector to extract camera angles:
rotation_matrix = cv2.Rodrigues(rotation_vector)[0]
angles = rotationMatrixToEulerAngles(rotation_matrix)
where rotationMatrixToEulerAngles:
def rotationMatrixToEulerAngles(R) :
sy = math.sqrt(R[0,0] * R[0,0] + R[1,0] * R[1,0])
singular = sy < 1e-6
if not singular :
x = math.atan2(R[2,1] , R[2,2])
y = math.atan2(-R[2,0], sy)
z = math.atan2(R[1,0], R[0,0])
else :
x = math.atan2(-R[1,2], R[1,1])
y = math.atan2(-R[2,0], sy)
z = 0
return np.array([x, y, z])
No matter what focal I set third angle along Z axis is calculated around 40 degrees which does not make any sense because actual camera can only change angle along X, and Y axis.
Hi Satya! This site is great and very useful for OpenCV begginers like me. I saw your webcam_head_pose.cpp code and I was wondering what OpenCV and dlib version you used? Thank you.
I can’t remember the exact version of OpenCV and Dlib, but I think it should work with the latest version of both ( i.e. OpenCV 3.4 + Dlib 19.10).
If you are using OpenCV 3.4, you may also want to try out the native landmark detector
https://learnopencv.com/facemark-facial-landmark-detection-using-opencv
Hi Satya,
I’m having trouble working out how to convert the output from solvePnP (either a matrix, or a set of two vectors, translation and rotation) to another 3D coordinate system or projection matrix.
In other words: I’m using iOS SceneKit and I want to place a cone wherever the nose is, and rotate it, based on the solvePnP values. I know that’s quite a basic concept, but I’m obviously missing something – either values that I need to configure as dlib does its 3D calculations, or a way to convert its output to make sense to my own scene’s configuration. I’ve done this kind of thing in projects long ago, but I’m struggling.
(A dlib or OpenCV-based simple line rendering – both the line-from-the-nose that you demonstrate above, and also a simple cube render that I’ve taken from other dlib examples, both render nearly perfectly, so I believe the landmark and pose estimation coordinates are correct.)
I thought that I should perhaps be modifying camera_matrix or dist_coeffs to change the output of the dlib pose estimation, but for one, the 4×4 projection matrix doesn’t obviously fit in the 3×3 camera_matrix.
Do you know what process I should follow once I have solvePnP’s pose rotation and translation, to convert these to another scene, so that they display on screen in the same place as they do in the dlib-based render (i.e. your single line drawn from the nose)? I can imagine I’ll need things like the field of view of the SceneKit camera, and to ensure that the focal length is the same value as what goes into camera_matrix – but I can’t think what the calculation is.
Thanks
I’ve got a bit further by using ‘projectPoint’ and ‘unprojectPoint’ methods in SceneKit, but there’s still a missing link:
I ‘projectPoint’ with origin of the 3d space (SCNVector3Zero), which yields a vector that is the XY center of the view (333.5, 187.5), but the Z depth is given as 0.94, which I think will be determined by the perspective correction set in the scene’s camera matrix, but I’m not sure.
The Z value of the translation vector coming from the dlib results is much larger – it’s 1000 to 2000 or so, and this, as I expected, changes as I move a detected face closer to/farther from the camera.
So now, I’m just struggling to match these two up. My 3D object in my custom scene moves around much more correctly, but the Z depth is clearly off.
The Z value of the translation yielded by solvePnP is in the thousands, and that’s the value that is so different to the kind of depths I’m used to in a 3D scene, and that’s confusing me a little. I’ve changed my scene’s camera from perspective to orthographic, and I set the orthographic height to the height of my view. I understand how the depth is obtained using the iterative method checking for error (since we don’t know the face’s true depth from a flat image), but it’s really just that I can’t visualise the output of solvePnP with respect to my own scene.
Hi Satya,
I’m having trouble making sense of how to interpret the depth/Z position of the solvePnP translation.
I have my own 3D scene using iOS’s SceneKit, and I’ve tried to configure that to remove any additional error, e.g. enabling an orthographic projection with a size identical to the image size (or some fraction of this image size, and then I multiply by that fraction).
I understand that the solvePnP function yields the position of the camera with respect to an object’s origin, but I want to detect multiple faces and put objects at the faces’ positions, so I’ll be reversing this process if I can.
However, even before I do that reversal, I’m having trouble lining up a single object (representing a face) and a camera in the SceneKit scene. Even if the XY translation appears to make sense — in that, when I move the face back and forth in the device camera’s viewfinder, the coordinates make sense going from edge to edge — the Z depth doesn’t mean much to me. I wondered if it was so large because the camera_matrix has a focal depth – is this the case? I tried reducing the focal depth, and this made the values increase, and I don’t imagine increasing values in the camera_matrix arbitrarily is going to the correct approach.
Finally, I wondered: is this Z depth of solvePnP’s output influenced by scale of the 3D points used from the reference model? If I use points with tip of nose at (0,0,0), eyes at z=-135, mouth at z=-125 and so on, will the depth I get from solvePnP be proportionally large?
Thanks!
Hi Satya. Thank you for the tutorial. I would like to ask you how i can find the camera position using the R|t . Actually i want to measure the distance between the object and the camera.
Hi Satya,
Whenever I switch from solvePnP to solvePnPRansac, my results become much worse.
I also created sliders on screen to modify iterations, min-inliers, and reprojection-error, to see if I could improve from the visual feedback, but had no luck.
Do you recommend using the default params for the above style of face tracking? Or would you be customising them to suit the scale of the 3D reference points (i.e. a reprojection error more in terms of 100-200 units rather than the default 8.0)?
Excellent tutorial , thank you. M having trouble finding the world coordinated for the arbitrary reference frame for facial landmarks. Can you point me towards a good resource.
Hello,
I compiled your code without any errors, but when the program launches, the camera window pops up but just freezes. I get an infinite loading. Any idea where this might come from?
Thanks a lot
So I have a simple question. How can I extract the information if the person is looking left, right or straight from this rotation and translation matrix?
Hi, Satya. I am very puzzled . How did you get these 3D points , such as Tip of the nose : ( 0.0, 0.0, 0.0) , Chin : ( 0.0, -330.0, -65.0) , Left corner of the left eye : (-225.0f, 170.0f, -135.0)
Hi, Was your question answered?
even i have similar question.
Sorry, I just saw your comment.
These 3D points are coordinates in any world coordinate system
i applied for subscription many times but i didn’t received the confirmation mail
Hi Satya, I want to measure the actual size of the mouth and eyes. Distance from mouth to eyes? How to do that?
Hi, thank you for the very well explained tutorial. I have one question. Let’s say that I want to find the 3D points from a given 2D image. I was thinking of going through the steps, defining a mapping between 2D and 3D points, then I could use the transformation matrix to reverse the process, am I right?
Hello Satya, thank you for sharing your knowledge.
Is it possible to run this application in GPU? We are using Jexton TX2. Does the code support CUDA?
Hi Satya, Thank your for your tutorials. I rewrite the webcam_head_pose.cpp into python, and it works good. I’m curious that, given detected skeleton keypoints (shoulders, hips, nose), is it possible to estimate body orientation? From my opinion, the key is to get the 3D model of human body, while I can’t find it. Thank you.