YOLOv7 is the new state-of-the-art object detector in the YOLO family. According to the paper, it is the fastest and most accurate real-time object detector to date. According to the YOLOv7 paper, the best model scored 56.8% Average Precision (AP), which is the highest among all known object detectors. The speed ranges from 5-160 FPS for various models (Available in the YOLOv7 GitHub repository). Compared to the base models, YOLOv7 has reduced number of parameters to 40% and computation to 50%.
This blog post contains simplified YOLOv7 paper explanation and inference tests. We will go through the YOLOv7 GitHub repository and test inference. We will also see how YOLOv7 compares with other object detectors of the YOLO family.
YOLOv7 established a significant benchmark by taking the performance up a notch. Starting from YOLOv4, we are seeing new entries in the YOLO family one after another in a very short period of time. Each version introduces something new to improve the performance. Earlier, we have already discussed previous YOLO versions in length. Check out the YOLO series here.
Learning Objectives: ✅ YOLOv7 architecture, What’s new? ✅ Object Detection using YOLOv7 ✅ YOLOv7 models and comparative analysis ✅ YOLOv7 Pose: Human Pose Estimation |
- YOLO Architecture in General
- What’s new in YOLOv7?
- YOLOv7 Architecture
- Trainable Bag of Freebies in YOLOv7
- YOLOv7 Experiments and Results
- YOLOv7 Object Detection Inference
- Comparison between YOLOv4, YOLOv5-Large and YOLOv7
- YOLOv7 Pose Estimation
- Summary of YOLOv7
1. YOLO Architecture in General
YOLO architecture is FCNN(Fully Connected Neural Network) based. However, Transformer based versions have recently been added to the YOLO family as well. We will discuss Transformer based detectors in a separate post. For now, let’s focus on FCNN (Fully Convolutional Neural Network) based YOLO object detectors.
The YOLO framework has three main components.
- Backbone
- Head
- Neck
The Backbone mainly extracts essential features of an image and feeds them to the Head through Neck. The Neck collects feature maps extracted by the Backbone and creates feature pyramids. Finally, the head consists of output layers that have final detections. The following table shows the architectures of YOLOv4, YOLOv4, and YOLOv5.
Table1: Model architecture summary of YOLOv3, YOLOv4, and YOLOv5
Apart from architectural modifications, there are several other improvements. Go through the YOLO series for detailed information.
2. What’s New in YOLOv7?
YOLOv7 improves speed and accuracy by introducing several architectural reforms. Similar to Scaled YOLOv4, YOLOv7 backbones do not use ImageNet pre-trained backbones. Rather, the models are trained using the COCO dataset entirely. The similarity can be expected because YOLOv7 is written by the same authors of Scaled YOLOv4. The following major changes have been introduced in the YOLOv7 paper. We will go through them one by one.
- Architectural Reforms
- E-ELAN (Extended Efficient Layer Aggregation Network)
- Model Scaling for Concatenation based Models
- Trainable BoF (Bag of Freebies)
- Planned re-parameterized convolution
- Coarse for auxiliary and Fine for lead loss
3. YOLOv7 Architecture
The architecture is derived from YOLOv4, Scaled YOLOv4, and YOLO-R. Using these models as base, further experiments were carried out to develop new and improved YOLOv7.
3.1 E-ELAN (Extended Efficient Layer Aggregation Network) in YOLOv7 Paper
The E-ELAN, is the computational block in the YOLOv7 backbone. It takes inspiration from previous research on network efficiency. It has been designed by analyzing the following factors that impact speed and accuracy.
- Memory access cost
- I/O channel ratio
- Element wise operation
- Activations
- Gradient path
The proposed E-ELAN uses expand, shuffle, merge cardinality to achieve the ability to continuously enhance the learning ability of the network without destroying the original gradient path. |
In simple terms, E-ELAN architecture enables the framework to learn better. It is based on the ELAN computational block. At the time of writing this post, the ELAN paper has not been published. We will update the post by adding details of ELAN when it is available.
Fig: E-ELAN and previous work on maximal layer efficiency
3.2 Compound Model Scaling in YOLOv7
Different applications require different models. While some need highly accurate models, some prioritize speed. Model scaling is performed to suit these requirements and make it fit in various computing devices.
While scaling a model size, the following parameters are considered.
- Resolution ( size of the input image)
- Width (number of channels)
- Depth (number of layers)
- Stage (number of feature pyramids)
NAS (Network Architecture Search) is a commonly used model scaling method. It is used by researchers to iterate through the parameters to find the best scaling factors. However, methods like NAS do parameter-specific scaling. The scaling factors are independent in this case.
The authors of the YOLOv7 paper show that it can be further optimized with a compound model scaling approach. Here, width and depth are scaled in coherence for concatenation-based models.
Fig: YOLOv7 compound scaling
4. Trainable Bag of Freebies in YOLOv7
BoF or Bag of Freebies are methods that increase the performance of a model without increasing training cost. YOLOv7 has introduced the following BoF methods.
4.1 Planned re-parameterized convolution
Re-parameterization is a technique used after training to improve the model. It increases the training time but improves the inference results. There are two types of re-parametrization used to finalize models, Model level and Module level ensemble.
Model level re-parametrization can be done in the following two ways.
- Using different training data but the same settings, train multiple models. Then average their weights to obtain the final model.
- Take the average of the weights of models at different epochs.
Recently, Module level re-parameterization has gained a lot of traction in research. In this method, the model training process is split into multiple modules. The outputs are ensemled to obtain the final model. The authors in the YOLOv7 paper show the best possible ways to perform module-level ensemble (shown below).
Fig: Re-parameterization trials
In the diagram above, The 3×3 convolution layer of the E-ELAN computational block is replaced with the RepConv layer. Experiments we carried out by switching or replacing the positions of RepConv, 3×3 Conv, and Identity connection. The residual bypass arrow shown above is an identity connection. It is nothing but a 1×1 convolutional layer. We can see the configurations that work and the ones that do not. Check out more about RepConv in the RepVGG paper.
Including RepConv, YOLOv7 also performs re-parameterization on Conv-BN (Convolution Batch Normalization), OREPA(Online Convolutional Re-parameterization), and YOLO-R to get the best results. We will discuss more about the implementation part in a separate post.
4.2 Coarse for Auxiliary and Fine for Lead loss
As you already know by now, YOLO architecture comprises a backbone, a neck, and a head. The head contains the predicted outputs. YOLOv7 does not limit itself to a single head. It has multiple heads to do whatever it wants. Interesting isn’t it?
However, it’s not the first time a multi-headed framework was introduced. Deep Supervision, a technique used by DL models uses multiple heads. In YOLOv7, the head responsible for final output is called the Lead Head. And the head used to assist training in the middle layers is called the Auxiliary Head.
With the help of an assistant loss, the weights of the auxiliary heads are updated. It allows for Deep Supervision and the model learns better. These concepts are closely coupled with Lead Head and the Label Assigner.
Label Assigner is a mechanism that considers the network prediction results together with the ground truth and then assigns soft labels. It’s important to note that the label assigner generates soft and coarse labels instead of generating hard labels.
4.2.1 Lead Head Guided Label Assigner and Coarse-to-Fine Lead Head Guided Label Assigner
The Lead Head Guided Label Assigner encapsulates the following three concepts.
- Lead Head
- Auxiliary Head
- Soft Label Assigner
The Lead Head in the YOLOv7 network predicts the final results. Soft labels are generated based on these final results. The important part is that the loss is calculated for both the lead head and the auxiliary head based on the same soft labels that are generated. Ultimately, both the heads get trained using the soft labels. This is shown in the left image in the above figure.
One may ask here, “why soft labels?”. The authors have put it quite well in the paper:
“The reason to do this is that the lead head has a relatively strong learning capability. So the soft label generated from it should be more representative of the distribution and correlation between the source data and the target. By letting the shallower auxiliary head directly learn the information that the lead head has learned, the lead head will be more able to focus on learning residual information that has not yet been learned.”
Now, coming to the coarse-to-fine labels as shown in the right image in the previous figure. Actually, in the above process, two different sets of soft labels are generated.
- A fine label to train the lead head
- A set of coarse labels to train the auxiliary head.
The fine labels are the same as the directly generated soft labels. However, to generate the coarse labels, more grids are treated as positive targets. This is done by relaxing the constraints of the positive sample assignment process.
5. YOLOv7 Experiments and Results
All the YOLOv7 models surpass the previous object detectors in speed and accuracy in the range of 5 FPS to 160 FPS. The following figure gives a pretty good idea about the Average Precision(AP) and speed of the YOLOv7 models compared to the others.
It is clear from the above figure that starting from YOLOv7, there is no competition with YOLOv7 in terms of speed and accuracy.
Note: The results that we discuss further are from the YOLOv7 paper where all the inference experiments were done on a Tesla V100 GPU. All AP results were done either on the COCO validation or test set.
5.1 mAP Comparison: YOLOv7 vs Others
The following table shows the comparison of YOLOv7 models with other baseline object detectors.
Table: YOLOv7 vs Other Baseline Models (Source)
Results shown above are mostly grouped together according to a range of parameters for a particular set of models.
- Starting with the YOLOv7-Tiny model which is the smallest in the family with just over 6 million parameters. With a validation AP of 35.2%, it is beating YOLOv4-Tiny models with similar parameters.
- The YOLOv7 normal model with almost 37 million parameters gives 51.2% AP. It beats variants of YOLOv4 and YOLOR which have more parameters easily.
- The larger models in the YOLO7 family are YOLOv7-X, YOLOv7-E6, YOLOv7-D6, and YOLOv7-E6E. All of these beat the respective YOLOR models having either a similar or lesser number of parameters and giving APs of 52.9%, 55.9%, 56.3%, and 56.8% respectively.
Now, it is not only the YOLOv4 and YOLOR models that YOLOv7 surpasses. Comparing the validation APs with that of YOLOv5 and YOLOv7 models which have parameters in the same range, it is pretty clear that YOLOv7 also beats all of the YOLOv5 models.
5.2 FPS Comparison: YOLOv7 vs Others
Table 2 in the YOLOv7 paper provides a comprehensive comparison of FPS of YOLOv7 with other models. It also includes the COCO mAP comparison.
It is already established that the YOLOv7 has the highest FPS and mAP in the range of 5 FPS to 160 FPS. All the FPS comparisons were done on the Tesla V100 GPU.
Table: YOLOv7 models FPS comparison
The YOLOv7-E6 model (highlighted above) runs at 56 FPS on the V100 GPU and has a test AP of 56%. This is more than the transformer-based SWIN_L Cascade-Mask R-CNN model (9.2 FPS with 53.9% AP) and convolutional-based ConvNeXt-XL (8.6 FPS with 55.2% AP). This matters because the other two models give less FPS even on the A100 GPU which is much more powerful compared to the V100 GPU.
It is worth noting that none of the YOLOv7 models are meant for mobile devices/mobile CPU (as mentioned in the paper by the authors).
YOLOv7-Tiny, YOLOv7, and YOLOv7-W6 are meant for edge GPU, normal (consumer) GPU, and cloud GPU respectively. This means that the YOLOv7-E6 and YOLOv7-D6, and YOLOv7-E6E are also meant for high-end cloud GPUs only. Nonetheless, all of the YOLOv7 models run at more than 30 FPS on the Tesla V100 GPU, which is more than real-time FPS.
The experiments above prove that YOLOv7 models outperform the existing object detectors. Both in terms of speed and accuracy.
6. YOLOv7 Object Detection Inference
Now, let’s get into the exciting part of the blog post, that is, running inference on videos using YOLOv7. We will run the inference using the YOLOv7 and YOLOv7-Tiny models. Along with that, we will also compare the results with that of the YOLOv5 and YOLOv4 models.
Note: All the inference results shown here were run on a machine with 6 GB GTX 1060 (laptop GPU), 8th generation i7 CPU, and 16 GB of RAM. |
If you intend to run object detection inference experiments on your own videos, you have to clone the YOLOv7 GitHub repository using the following command.
git clone https://github.com/WongKinYiu/yolov7.git
Then you can use the detect.py script to run inference on the videos of your choice. You will also need to download the yolov7-tiny.pt and yolov7.pt pre-trained model from here.
Here, we will run inference on three videos depicting the following three different scenarios.
- The first video is to test how the YOLOv7 object detection models perform on small and far-off objects.
- The second video has a lot of people depicting a crowded scenario.
- And the third video is one where many YOLO models (v4, v5, and v7) make the same general mistake while detecting the objects.
The YOLOv7 results here are shown together for the Tiny and Normal models for all three videos. This will help us compare the results for each of the results in an easy manner.
Let’s check out the detection inference results on the first video using the YOLOv7-Tiny (top) and YOLOv7 (bottom) models. The following commands were used to run the inference using the Tiny and Normal models respectively.
python detect.py --source ../inference_data/video_1.mp4 --weights yolov7-tiny.pt --name video_tiny_1 --view-img
python detect.py --source ../inference_data/video_1.mp4 --weights yolov7.pt --name video_1 --view-img
YOLOv7-tiny
YOLOv7-Normal
We can see the limitations of the YOLOv7-Tiny model right away. It is not able to detect cars, motorcycles, and persons which are very far-off and small. The YOLOv7 model is able to detect these objects better. But that’s not the entire story. Although YOLO7-Tiny is not performing that well, it is much faster than YOLOv7. While YOLOv7 gave around 19 FPS, YOLOv7-Tiny was able to run it on around 42 FPS, which is well above real-time.
Now, let’s check out the results of the second video which depicts a crowded scenario. We use the same command as above but change the values for the –source and –name flags depending on the video path and name.
YOLOv7 tiny
YOLOv7 Normal
The YOLOv7 model is able to detect persons with less fluctuation and more confidence compared to the YOLOv7-Tiny model. Not only that, YOLOv7-Tiny is also missing out on some of the traffic lights and people who are far away.
Let’s run the inference on a final video now which shows some general failure cases across all the YOLOv7 models.
YOLOv7 tiny
YOLOv7 Normal
We can see some of the general mistakes across both the models:
- Detecting other road signs as stop signs.
- Detecting the forbidden road symbols as persons wrongly.
As we will see later, the above two mistakes are common across YOLOv4 and YOLOv5.
Even though the YOLO7-Tiny makes more mistakes than the YOLOv7 model, it is much faster. On average, the YOLOv7-Tiny runs at more than 40 FPS while the YOLOv7 model runs at slightly above 20 FPS.
7. Comparison Between YOLOv4, YOLOv5-Large, and YOLOv7 Model
The following three videos show the comparison between YOLOv4, YOLOv5-Large, and the YOLOv7 model (top to bottom) on one of the videos. This will give us a proper qualitative idea of how each model performs across various scenarios.
YOLOv4
YOLOv5-Large
YOLOv7
And the following table shows the FPS and run times for different variants of the three models on the three videos.
Table: YOLOv7 vs Other detectors inference speed chart
8. YOLOv7 Pose Estimation
YOLOv7 is the first in the YOLO family to include a human pose estimation model. This is really interesting because there are very few such real-time models out there.
Recently, the official repository also got updated with a pre-trained pose estimation model. Hence, we will not cover the details in this post. A dedicated article on internals of the YOLOv7 Pose Estimation model will be published later. In this section, we will discuss the application part and observe how it works.
8.1 YOLOv7 Pose prerequisites
First, please ensure that you have cloned the YOLOv7 GitHub repository. Now, execute the following command within the cloned yolov7 directory to download the pre-trained pose estimation model.
wget https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7-w6-pose.pt
8.2 YOLOv7 Pose Code
We will need a custom script to run pose estimation inference using the pre-trained model. Let’s write the code in a new yolov7_keypoint.py script inside the yolov7 directory.
import matplotlib.pyplot as plt
import torch
import cv2
import numpy as np
import time
from torchvision import transforms
from utils.datasets import letterbox
from utils.general import non_max_suppression_kpt
from utils.plots import output_to_keypoint, plot_skeleton_kpts
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
weigths = torch.load('yolov7-w6-pose.pt')
model = weigths['model']
model = model.half().to(device)
_ = model.eval()
video_path = '../inference_data/video_4.mp4'
We import all the required modules and load the pre-trained yolov7-w6-pose.pt model, and initialize a video_path variable for the path to the source video. If you intend to run this inference on your own videos, please change the video_path accordingly.
Next, let’s read the video from disk and create the VideoWriter object to save the resulting video on disk.
cap = cv2.VideoCapture(video_path)
if (cap.isOpened() == False):
print('Error while trying to read video. Please check path again')
# Get the frame width and height.
frame_width = int(cap.get(3))
frame_height = int(cap.get(4))
# Pass the first frame through `letterbox` function to get the resized image,
# to be used for `VideoWriter` dimensions. Resize by larger side.
vid_write_image = letterbox(cap.read()[1], (frame_width), stride=64, auto=True)[0]
resize_height, resize_width = vid_write_image.shape[:2]
save_name = f"{video_path.split('/')[-1].split('.')[0]}"
# Define codec and create VideoWriter object .
out = cv2.VideoWriter(f"{save_name}_keypoint.mp4",
cv2.VideoWriter_fourcc(*'mp4v'), 30,
(resize_width, resize_height))
frame_count = 0 # To count total frames.
total_fps = 0 # To get the final frames per second.
Finally, we have a while loop running through each frame in the video.
while(cap.isOpened):
# Capture each frame of the video.
ret, frame = cap.read()
if ret:
orig_image = frame
image = cv2.cvtColor(orig_image, cv2.COLOR_BGR2RGB)
image = letterbox(image, (frame_width), stride=64, auto=True)[0]
image_ = image.copy()
image = transforms.ToTensor()(image)
image = torch.tensor(np.array([image.numpy()]))
image = image.to(device)
image = image.half()
# Get the start time.
start_time = time.time()
with torch.no_grad():
output, _ = model(image)
# Get the end time.
end_time = time.time()
# Get the fps.
fps = 1 / (end_time - start_time)
# Add fps to total fps.
total_fps += fps
# Increment frame count.
frame_count += 1
output = non_max_suppression_kpt(output, 0.25, 0.65, nc=model.yaml['nc'], nkpt=model.yaml['nkpt'], kpt_label=True)
output = output_to_keypoint(output)
nimg = image[0].permute(1, 2, 0) * 255
nimg = nimg.cpu().numpy().astype(np.uint8)
nimg = cv2.cvtColor(nimg, cv2.COLOR_RGB2BGR)
for idx in range(output.shape[0]):
plot_skeleton_kpts(nimg, output[idx, 7:].T, 3)
# Comment/Uncomment the following lines to show bounding boxes around persons.
xmin, ymin = (output[idx, 2]-output[idx, 4]/2), (output[idx, 3]-output[idx, 5]/2)
xmax, ymax = (output[idx, 2]+output[idx, 4]/2), (output[idx, 3]+output[idx, 5]/2)
cv2.rectangle(
nimg,
(int(xmin), int(ymin)),
(int(xmax), int(ymax)),
color=(255, 0, 0),
thickness=1,
lineType=cv2.LINE_AA
)
# Write the FPS on the current frame.
cv2.putText(nimg, f"{fps:.3f} FPS", (15, 30), cv2.FONT_HERSHEY_SIMPLEX,
1, (0, 255, 0), 2)
# Convert from BGR to RGB color format.
cv2.imshow('image', nimg)
out.write(nimg)
# Press `q` to exit.
if cv2.waitKey(1) & 0xFF == ord('q'):
break
else:
break
# Release VideoCapture().
cap.release()
# Close all frames and video windows.
cv2.destroyAllWindows()
# Calculate and print the average FPS.
avg_fps = total_fps / frame_count
print(f"Average FPS: {avg_fps:.3f}")
The above code:
- Loops through each frame.
- Carries out the pose estimation.
- Creates the output frame.
- Writes the FPS on top of the current resulting frame.
- Shows the resulting frame on the screen and writes it to disk as well.
Now, let’s execute the above code for YOLOv7 pose estimation.
python yolov7_keypoint.py
The following is the output.
The results look pretty good given the high FPS that we get. But note that the 60 FPS that we get here is only for the forward pass of the model. If we include the post-processing such as NMS (Non-Max Suppression), thresholding, and annotation, the FPS is likely to decrease. Still, it is impressively high.
This just shows the versatility of the new YOLOv7 object detection model. In the near future, YOLOv7 will also receive an update for instance segmentation. It will be really interesting to see how it performs compared to other instance segmentation models.
9. Summary of YOLOv7
With this, we conclude the introduction to YOLOv7. I hope you enjoyed reading the article. In summary, we have covered the following.
- The general architecture of YOLO consists of Backbone, Neck, and Head.
- Architectural reforms of YOLOv7.
- E-ELAN
- Compound Model Scaling in YOLOv7
- Trainable Bag of Freebies in YOLOv7.
- Re-parameterization in YOLOv7
- Coarse for Auxiliary and Fine for Lead loss
- How to use the YOLOv7 GitHub repository to run object detection inference.
- YOLOv7 surpasses all real-time object detectors in speed and accuracy.
- FPS: 5 – 165
- mAP: 51.4% – 56.8%
- YOLOv7 reduced 40% parameters and 50% computation but improved performance. It’s a significant feat.
- How to use the YOLOv7 Pose estimation (keypoint detection) model.