Can we distinguish one person from another by looking at the face? We can probably list several features such as eye colour, hairstyle, skin tone, the shape of nose and eyebrows, etc. Some combination of such attributes will be unique to a particular person. Sometimes, people have visual similarities, and in rare cases, it becomes difficult to distinguish one from another.

Can we program computers to do this task automatically? If you enumerate all these features which describe personality and try to build a decision tree in an if-else manner, pretty quickly, there will be a combinatorial explosion of a variety of choices. So, we need more general representation instead of a list of specific features.

The blog post is divided into the following sub-topics:

- Feature Extraction with Deep Neural Network
- Background of Face Recognition
- Introduction to ArcFace and Comparison with Softmax Loss
- Metric for Computing Embedding Pair Score
- Performing Inference with ArcFace Model
- Visualizing embeddings with tSNE

During the classical machine learning era, researchers in the computer vision field came up with various ideas on extracting representation from images. For example, using some combinations of traditional feature extractors and descriptors. Fortunately, deep neural networks have the ability to learn such representation from data automatically.

Many practical tasks can be solved by learning good representation. If you look at other machine learning areas such as Natural Language Processing and Recommendation Systems, you may notice another name for this – embeddings.

**Computer Vision**,

**Machine Learning**, and

**AI**! Sign up now and take your skills to the next level!

Embedding is a vector of real numbers which contains all extracted information. It is assumed, that embeddings for semantically similar objects are close in terms of some metric. So, if you have good embeddings, you can compare them with each other or search for the closest. For example, you can learn embeddings for movies and recommend top-5 the most similar ones the user based on the liking. Also, they have some interesting properties. As noted by the authors of the famous word2vec family of models for word embeddings, arithmetic operations can be applied to them.

Nowadays, embeddings have become an integral part of solutions to many problems. Facial recognition is using the same approach. Usually supposed, the similarity of a pair of faces can be directly calculated by computing their embeddings’ similarity. In this case, the face recognition task is trivial: we only need to check if the distance between the two vectors exceeds a predefined threshold. So, our main goal is to understand how to train a model capable of extracting appropriate embeddings from faces.

## Feature Extraction

Deep Neural Networks have widespread use in computer vision as feature extractors. Certain ideas and mechanisms like stacking layers, skip-connections, SE-blocks, etc., subsequently became an essential part of any contemporary deep learning architecture, but the main principle is the same. There is a backbone with convolution layers and some head which uses extracted information to solve the specific task (usually head is fully connected layers like in simple classification case). So, the forward pass should sound familiar.

To train neural networks, we need a loss function for which gradients are calculated during the backward pass, and then model weights are updated with some optimizer. For image classification task usually cross-entropy loss (some people call it softmax loss) is used. And now the tricky part begins. How suitable is such a loss for embeddings training? Can we think of something better? Or probably we need to drastically change the training technique for our face recognition model? To answer these questions, let’s review what machine learning practitioners have come up within the last few years to solve this problem.

## Background of Face Recognition

In our case, Neural Network is a complex function for mapping from the image space to the vector space. The goal is to distinguish in this vector space identity of one person from another. The *intra-class compactness* and *inter-class separability* are two important factors to feature discrimination ability. Embeddings belonging to the same identity are expected to be closer in the representation space; while embeddings of different identities are expected to be scattered. Several loss functions have been designed to get this property from embeddings during the training procedure.

In previous years, the similarity learning approach used to be quite popular. The first example of this type is the Siamese Network with contrastive loss. This paper was published in 2005 under the supervision of Yann LeCun, one of the most influential researchers in the deep learning field. Another example is FaceNet with triplet loss. Both the contrastive loss and triplet losses penalize the distance between two embeddings, such that, the similarity metric will be small for pairs of faces from the same person, and large for pairs from different people. It is worth noting that both losses require a carefully designed pair selection, which is the obvious disadvantage. Thereby, more novel approaches have been proposed in the last couple of years.

Another type of loss uses the angular margin penalty to enforce intra-class compactness and inter-class difference of embeddings on the hypersphere surface. SphereFace introduced the important idea of angular margin in 2017. A year later, an improvement was suggested in CosFace with a cosine margin penalty. At the same time, ArcFace was developed on similar principles. Let’s take a closer look and understand this state-of-the-art method which has received fairly widespread adoption.

## ArcFace Loss

Besides the backbone that extracts features, there is the head for classification with a fully connected layer with trainable weights. The product of normalized weights and normalized features lie in the interval from -1 to 1. We can represent a logit for each class as the cosine of an angle between the feature and the ground truth weight inside a hypersphere with a unit radius. Also, there are two hyperparameters m (the additional angular margin) and s (scaling ratio from a small number to a final logit) that help adjust the distance between classes.

To illustrate geometric intuition behind those formulas, the authors trained their model to discriminate between 8 identities for 2D embeddings both with softmax and ArcFace loss. The learned embeddings are distributed on a hypersphere (in 2D it is just a circle) with a radius of **s**. You can notice that each identity has a separate colour and a centre around which the points are distributed.

With softmax loss embeddings can be roughly separated, there is uncertainty where decision boundaries can be placed. In practice, this means that faces that look similar will be tough to distinguish.

Another issue with using pure softmax loss is, that number of weights in the last fully connected layer increases linearly with the number of classes. So, it is quite problematic to train a neural network capable of distinguishing between millions of personalities.

ArcFace loss does not have this shortage, and the result seems much better. All points are closer to the centre, and there is an evident gap between identities. Consequently, previously mentioned requirements for intra-class compactness and inter-class separability are met.

## Cosine Distance

To measure the similarity between two embeddings extracted from images of the faces, we need some metric. Cosine distance is a way to measure the similarity between two vectors, and it takes a value from 0 to 1. Actually, this metric reflects the orientation of vectors indifferently to their magnitude. If cosine distance is near 0, then vectors have similar orientation and close to each other. If it is almost 1, then vectors differ (in other words, they are orthogonal).

Here is an example how we can calculate such metric in **numpy**.

```
from numpy import dot, sqrt
def cosine_similarity(x, y):
return dot(x, y) / (sqrt(dot(x, x)) * sqrt(dot(y, y)))
```

If you are interested in a more rigorous mathematical definition or want to refresh your linear algebra knowledge, you can read a wiki article about it.

**Download Code**To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

## Inferencing with ArcFace Model

Besides the identification model itself, face recognition systems usually have other preprocessing steps in a pipeline. Let’s briefly describe them.

First, a face detector needs to be used to detect a face on an image. After that, we can use face alignment for cases that do not satisfy our model’s expected input. Identification is considered a rather challenging problem already, so face alignment is utilized to make the model’s life a bit easier. If a face is transformed into a canonical pose (like the tip of the nose in the centre of the image, etc.), the model can focus on getting important information straight away. Finally, we need to crop and resize the face to a specific size (in our examples, it will be 112×112). In real-world scenarios, an additional anti-spoofing model can be used to avoid deception at the identification step, but we’ll leave it behind the scenes.

Now it is time to take the pre-trained face identification model and check how it works on various examples.

*Weights for an ArcFace model and some part of the code have been taken from https://github.com/ZhaoJ9014/face.evoLVe.PyTorch. Thanks to them for sharing!*

### Different people

Okay, let’s start with two famous actors: Jeff Bridges and Kurt Russell. These guys have some similarities, don’t they?

To evaluate our model, we need to extract embeddings via the backbone from preprocessed images, normalize them, and calculate cosine distance for all pairs.

```
def visualize_similarity(tag, input_size=[112, 112]):
images, embeddings = get_embeddings(
data_root=f"data/{tag}_aligned",
model_root="checkpoint/backbone_ir50_ms1m_epoch120.pth",
input_size=input_size,
)
# calculate cosine similarity matrix
cos_similarity = np.dot(embeddings, embeddings.T)
cos_similarity = cos_similarity.clip(min=0, max=1)
# plot grid from pair distance values in similarity matrix
similarity_grid = plot_similarity_grid(cos_similarity, input_size)
# pad similarity grid with images of faces
horizontal_grid = np.hstack(images)
vertical_grid = np.vstack(images)
zeros = np.zeros((*input_size, 3))
vertical_grid = np.vstack((zeros, vertical_grid))
result = np.vstack((horizontal_grid, similarity_grid))
result = np.hstack((vertical_grid, result))
if not os.path.isdir("images"):
os.mkdir("images")
cv2.imwrite(f"images/{tag}.jpg", result)
```

Through using this helper function, we will receive a final result in one table.

Hmm, not bad! It gives us insight that the predefined threshold can be adjusted to be something like 0.65 or 0.70. In a real use case, such a threshold depends on our risk acceptance and what type of errors we want to minimize, several false positives or false negatives.

Now, let’s try to find similarities between different actresses.

If you look closely, there are patterns in our table. Margot Robbie and Jaime Pressly have a high value of cosine distance, and they really have a similar appearance! Zooey Deschanel and Katy Perry are more similar to each other than to other actresses in this set. So, cosine distance between embeddings indeed reflects our concept of the similarity between people in real life.

### The same person in various poses

Now, let’s see how it works with three different head poses.

The result of the profile image is rather bad. But with the slightly turned face, it seems to work fine. It can be concluded that our model is susceptible to head rotation.

### Corner cases

In order not to fall into the illusion that our model is working in all possible circumstances, let’s try images with a variety of accessories, occlusions, and emotional expressions.

As expected, such corner cases can be too hard for our model. This indicates that models for real-world production face recognition systems must be trained on more diverse data to circumvent such limitations.

## t-SNE

To better understand how embeddings are distributed in space, we can use dimensionality reduction algorithm such as t-SNE. If you aren’t familiar with this method, please, read our previous article about it.

After applying t-SNE, we will receive a vector with two components instead of 512 and we can easily plot our results on a 2D plane. We will take each coordinate as a centre and add an appropriate original face image over it.

As a result, clusters not only have intra-class compactness and inter-class separability as expected from ArcFace embeddings but also match our intuition about the similarity between these people.

## Summary

In this blog post, we looked at the current state of the face recognition task. We learned that good representation in the form of embeddings is key to solving this problem. Several approaches for training such embeddings have been mentioned, including ArcFace, one of the most important at the moment. Also, we looked at several examples to see how it works (what even more important, when it does not work) in practice. Now you can use this knowledge for your own applications!

## Subscribe & Download Code

If you liked this article and would like to download code (C++ and Python) and example images used in this post, please click here. Alternately, sign up to receive a free Computer Vision Resource Guide. In our newsletter, we share OpenCV tutorials and examples written in C++/Python, and Computer Vision and Machine Learning algorithms and news.