Advancing object detection technology, YOLOv9 stands out as a significant development in Object Detection, created by Chien-Yao Wang and his team. This new version introduces innovative methods such as Programmable Gradient Information (PGI) and Generalized Efficient Layer Aggregation Network (GELAN) to effectively address issues related to information loss and computational efficiency.
These advancements ensure that YOLOv9 delivers top-notch performance in detecting objects in real-time, setting a new standard for accuracy and speed in the field.
To see the experimental results, scroll down to the concluding part of the article or click here to see them immediately.
All the code discussed in this article is free to grab. Just hit the “Download Code” button to get started.
What is YOLOv9?
YOLOv9 is the latest iteration of the YOLO series by Chien-Yao Wang et al. [1], released on 21 February 2024. It’s an advancement from YOLOv7, both developed by Chien-Yao Wang and colleagues. YOLOv7 made significant steps in optimizing the training process with what’s called a trainable bag-of-freebies, effectively enhancing training efficiency to boost object detection accuracy without adding to the inference cost. However, YOLOv7 didn’t specifically address the problem of information loss during the input data’s feedforward process, a challenge known as the information bottleneck. This issue arises from downscaling operations in the network, which can dilute important input data.
Existing solutions like reversible architectures, masked modeling, and deep supervision help reduce information bottleneck, but the above methods have different drawbacks in the training and inference processes. It is also less effective for smaller model architectures, crucial for real-time object detectors like those in the YOLO series.
To overcome these challenges, YOLOv9 introduces two innovative techniques, Programmable Gradient Information (PGI) and the Generalized Efficient Layer Aggregation Network (GELAN), to tackle the information bottleneck problem directly and improve the accuracy and efficiency of object detection.
YOLO Master Post – Every Model Explained
Don’t miss out on this comprehensive resource, Mastering All Yolo Models, for a richer, more informed perspective on the YOLO series.
Components of YOLOv9
The YOLOv9 framework introduces an innovative approach to addressing the core challenges in object detection through deep learning, mainly focusing on the issues of information loss and efficiency in network architecture. This approach has four key components: The Information Bottleneck Principle, Reversible Functions, Programmable Gradient Information (PGI), and the Generalized Efficient Layer Aggregate Network (GELAN).
Delving deeper into these components, we will understand the aspects of YOLOv9 from theoretical foundations to practical application, highlighting how it establishes a new benchmark in object detection.
Information Bottleneck Principle
The Information Bottleneck Principle highlights how information loss occurs as data transforms within a neural network. The equation of Information Bottleneck represents the loss of mutual information between the original and transformed data by going through the deep network layers.
In this equation, I denote mutual information, while f and g are transformation functions with parameters θ and ϕ, respectively. As data X passes through layers (fθ and gϕ) of a deep neural network, it loses vital information needed for accurate predictions. This loss can lead to unreliable gradients and poor model convergence. A common solution is to increase the model’s size to enhance its data transformation capacity, thereby retaining more information. However, this approach doesn’t address the issue of unreliable gradients in very deep networks. The following section explores how reversible functions offer a more effective solution.
Reversible Functions
The theoretical solution to tackle the Information Bottleneck is Reversible Function. Reversible functions in neural networks ensure no loss of information during data transformation. These functions allow for the inversion of data transformations, meaning original input data can be perfectly reconstructed from the network’s outputs.
In the above equation, r and v represent the forward and reverse transformations, with ψ and ζ as their parameters, respectively. Utilizing reversible functions enables networks to maintain the entire input information through all the layers, leading to more reliable gradient calculations for model updates. Despite their benefits, reversible functions challenge the traditional understanding of deep networks, especially when addressing complex problems with models that are not inherently deep.
After using the reversible function, we can see the preservation of mutual information between the input and output. Here, we can observe that mutual information I preserves the original input X while it passes through the layers using the transformation function r and its inverse v.
This theoretical approach challenges lightweight models due to underparameterization, limiting their capacity to manage extensive raw data without significant information loss. It will affect the model’s performance as it cannot preserve crucial data.
Programmable Gradient Information (PGI)
Based on the above analysis, there is a need for a new deep neural network training method that can generate not only reliable gradients to update the model but also be suitable for shallow and lightweight neural networks. Programmable Gradient Information (PGI) is a solution comprising a main branch for inference, an auxiliary reversible branch for reliable gradient calculation, and multi-level auxiliary information to tackle deep supervision issues effectively without adding extra inference costs.
To learn more about Programmable Gradient Information (PGI) within the YOLOv9 framework, we must observe how it’s complexly designed to enhance model training and efficiency. PGI contains an auxiliary supervision node to address information bottleneck in deep neural networks, focusing on the precise and efficient backpropagation of gradients. PGI develops by integrating three components, each serving a distinct but interrelated function within the model’s architecture.
Main Branch:
The main branch is optimized for inference, ensuring the model remains lean and efficient during this critical phase. It’s designed to bypass the need for auxiliary components during inference and maintain high performance without additional computational overhead.
Auxiliary Reversible Branch:
The Auxiliary branch ensures the generation of reliable gradients and facilitates precise parameter updates. By harnessing reversible architecture, it overcomes the inherent information loss in deep network layers and enables the preservation and utilization of complete data for learning. This branch’s design allows it to seamlessly integrate or remove, ensuring that model depth and complexity do not impede inference speed.
Multi-Level Auxiliary Information:
This method uses special networks to combine gradient information throughout the model’s layers. It tackles the problem of losing information in deep supervision models, ensuring the model fully understands the data. This technique helps it make better predictions for objects of different sizes.
Generalized Efficient Layer Aggregation Network (GELAN)
The need for an even more refined architecture becomes apparent following the introduction of Programmable Gradient Information (PGI) in YOLOv9. This is where the Generalized Efficient Layer Aggregation Network (GELAN) comes into play. GELAN represents a unique design to fit the PGI framework, enhancing the model’s ability to process and learn from data more effectively. As PGI addresses the challenge of retaining crucial information across deep neural networks, GELAN builds on this foundation by providing a flexible and efficient structure that supports various computational blocks.
The Generalized Efficient Layer Aggregation Network (GELAN) in YOLOv9 combines the best features of CSPNet’s gradient path planning with ELAN’s inference speed optimizations. GELAN represents a versatile architecture that merges these attributes and enhances the YOLO family’s signature real-time inference capability. GELAN is a lightweight framework that prioritizes quick inference times without sacrificing accuracy, extending the application of computational blocks.
# GELAN
class SPPELAN(nn.Module):
# spp-elan
def __init__(self, c1, c2, c3): # ch_in, ch_out, number, shortcut, groups, expansion
super().__init__()
self.c = c3
self.cv1 = Conv(c1, c3, 1, 1)
self.cv2 = SP(5)
self.cv3 = SP(5)
self.cv4 = SP(5)
self.cv5 = Conv(4*c3, c2, 1, 1)
def forward(self, x):
y = [self.cv1(x)]
y.extend(m(y[-1]) for m in [self.cv2, self.cv3, self.cv4])
return self.cv5(torch.cat(y, 1))
class RepNCSPELAN4(nn.Module):
# csp-elan
def __init__(self, c1, c2, c3, c4, c5=1): # ch_in, ch_out, number, shortcut, groups, expansion
super().__init__()
self.c = c3//2
self.cv1 = Conv(c1, c3, 1, 1)
self.cv2 = nn.Sequential(RepNCSP(c3//2, c4, c5), Conv(c4, c4, 3, 1))
self.cv3 = nn.Sequential(RepNCSP(c4, c4, c5), Conv(c4, c4, 3, 1))
self.cv4 = Conv(c3+(2*c4), c2, 1, 1)
def forward(self, x):
y = list(self.cv1(x).chunk(2, 1))
y.extend((m(y[-1])) for m in [self.cv2, self.cv3])
return self.cv4(torch.cat(y, 1))
def forward_split(self, x):
y = list(self.cv1(x).split((self.c, self.c), 1))
y.extend(m(y[-1]) for m in [self.cv2, self.cv3])
return self.cv4(torch.cat(y, 1))
The code above illustrates two critical components of the Generalized Efficient Layer Aggregation Network (GELAN) used in YOLOv9, focusing on enhancing the model’s ability to efficiently process and learn from complex data patterns through innovative layer aggregation techniques.
SPPELAN Module:
This module introduces an approach to layer aggregation by incorporating Spatial Pyramid Pooling (SPP) within the ELAN structure. It starts with a convolutional layer that adjusts the channel dimensions, followed by a series of spatial pooling operations to capture multi-scale contextual information. The outputs are concatenated and passed through another convolutional layer to consolidate the features, optimizing the network’s capacity for detailed feature extraction from various spatial hierarchies.
RepNCSPELAN4 Module:
This component represents an advanced version of CSP-ELAN aimed at further streamlining the feature extraction process. It splits the input from the initial convolutional layer into two paths, processes each through a series of RepNCSP and convolutional layers, and then merges them back. This dual-path strategy facilitates efficient gradient flow and feature reuse, significantly enhancing the model’s learning efficiency and inference speed by ensuring depth without the computational penalty typically associated with increased complexity.
The GELAN architecture merges CSPNet’s gradient efficiency and ELAN’s speed-oriented architecture into a unified framework that supports a broader range of computational blocks. This flexibility allows YOLOv9 to adapt to varying computational environments and tasks, maintaining high accuracy and speed. You can observe in Figure 6 that it can keep the input information even after 200 layers.
Architecture of YOLOv9
YOLOv9 combines Programmable Gradient Information (PGI) and the Generalized Efficient Layer Aggregation Network (GELAN) to create a unique architecture that significantly improves gradient flow and information retention. This combination addresses the challenges of information bottleneck and gradient reliability, enabling the model to learn more efficiently and accurately from complex data patterns without losing any information.
# YOLOv9 head
head:
[
# multi-level auxiliary branch
# elan-spp block
[9, 1, SPPELAN, [512, 256]], # 29
# up-concat merge
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 7], 1, Concat, [1]], # cat backbone P4
# csp-elan block
[-1, 1, RepNCSPELAN4, [512, 512, 256, 2]], # 32
# up-concat merge
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 5], 1, Concat, [1]], # cat backbone P3
# csp-elan block
[-1, 1, RepNCSPELAN4, [256, 256, 128, 2]], # 35
# main branch
# elan-spp block
[28, 1, SPPELAN, [512, 256]], # 36
# up-concat merge
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 25], 1, Concat, [1]], # cat backbone P4
# csp-elan block
[-1, 1, RepNCSPELAN4, [512, 512, 256, 2]], # 39
# up-concat merge
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 22], 1, Concat, [1]], # cat backbone P3
# csp-elan block
[-1, 1, RepNCSPELAN4, [256, 256, 128, 2]], # 42 (P3/8-small)
# avg-conv-down merge
[-1, 1, ADown, [256]],
[[-1, 39], 1, Concat, [1]], # cat head P4
# csp-elan block
[-1, 1, RepNCSPELAN4, [512, 512, 256, 2]], # 45 (P4/16-medium)
# avg-conv-down merge
[-1, 1, ADown, [512]],
[[-1, 36], 1, Concat, [1]], # cat head P5
# csp-elan block
[-1, 1, RepNCSPELAN4, [512, 1024, 512, 2]], # 48 (P5/32-large)
# detect
[[35, 32, 29, 42, 45, 48], 1, DualDDetect, [nc]], # DualDDetect(A3, A4, A5, P3, P4, P5)
]
The above code snippet illustrates the division of the YOLOv9 head into two parts: the main branch and the multi-level auxiliary branch. The auxiliary branch functions as an integral part of the main branch, explicitly focusing on capturing and retaining gradient information during the training phase. This design allows the auxiliary branch to support the main branch effectively by preserving essential gradient information for the model’s learning process.
The above figure describes the capability of YOLOv9 as it can learn the exact structure of the object from a very fast warm-up in training. YOLOv9 develops on top of the YOLOv7 architecture, with the extra layer of PGI and GELAN. At the time of this writing, a detailed architectural diagram for YOLOv9 has yet to be provided in the paper. We plan to explore the YOLOv9 architecture more thoroughly in a future article.
Range of YOLOv9 Models
The YOLOv9 series introduces five models: YOLOv9-n(nano), YOLOv9-s (small), YOLOv9-m (medium), YOLOv9-c (compact), and YOLOv9-e (extended), each varying in parameter count and performance. These models range to various requirements, from lightweight to more extensive, performance-intensive applications.
This diagram illustrates how the YOLOv9 models achieve high accuracy on the COCO dataset while utilizing fewer parameters, showcasing their efficiency in balancing model complexity with performance.
YOLOv9 COCO Benchmarks
YOLOv9’s performance on the COCO dataset demonstrates improvements in object detection, offering a balance between efficiency and precision across its variants.
With enhancements in accuracy and reduced computational requirements, YOLOv9 maintains its legacy throughout the YOLO series.
YOLOv9 demonstrates significant progress across its model variants:
Lightweight Models:
YOLOv9-S surpasses its predecessor, YOLO MS-S, by minimizing parameters and computational load while enhancing accuracy by 0.4 to 0.6% in AP. This improvement signifies a step toward making high-accuracy detection accessible with lower resource requirements.
Medium to Large Models:
YOLOv9-M and YOLOv9-E stand out for their adept handling of the balance between model complexity and detection precision. These models significantly reduce parameters and computational demands, thereby increasing accuracy. This balance is crucial for applications requiring high performance without the luxury of extensive computational resources.
Overall Performance
Demonstrating the architectural optimizations of YOLOv9, the YOLOv9-C model operates with 42% fewer parameters and 21% less computational demand compared to YOLOv7 AF, achieving comparable accuracy. Moreover, the YOLOv9-E model sets a new standard for large-scale models by utilizing 15% fewer parameters and 25% less computational effort than YOLOv8-X, coupled with a significant 1.7% improvement in AP. These models underscore YOLOv9’s design excellence, balancing efficiency with the precision critical for real-time detection tasks.
Inference using YOLOv9
We’ll use the pre-trained weights from the YOLOv9 GitHub for our inference experiments in this article. To do the inference, you need to clone the YOLOv9 repository by the following command:
! git clone https://github.com/WongKinYiu/yolov9.git
! cd yolov9
Then, you need to install the requirements.txt
file by the given command:
! pip install -r requirements.txt
After installation, we need to download the model weights from the YOLOv9 GitHub repository. Then, you can run the inference using this command:
! python detect.py --source /path/to/your/video.mp4 --weights './yolov9-c.pt' --device 0 --iou-thres 0.7
You can add more arguments according to your use cases. You can find all the commands in detect.py
script. After running the detect.py
, you might face an error like this:
AttributeError: 'list' object has no attribute 'device'
To solve this, you need to update the non_max_suppression
function in utils/general.py
in the yolov9 directory.
prediction = prediction[0][1]
# [0][0] for aux prediction, [0][1] for main prediction, [0] for re-parameterized prediction.
We will provide you with the updated scripts, which you can download below using the Download Code button.
As you can see, YOLOv9c performs well even under low light conditions.
YOLOv9e even has almost perfect predictions in blurry conditions.
YOLOv9 vs YOLOv8
We compared the latest YOLOv9 with YOLOv8, the previous version of the YOLO series. We did this comparison in two different modes to ensure perfect visualization results.
First, we compared YOLOv9-C(Parameters-25.3 million) and YOLOv8-M(Parameters-25.9 million) to maintain inference similarity. We used Nvidia
Geforce RTX 3070 Ti Laptop GPU to run the inference, and we set the confidence threshold -0.25
and NMS IOU threshold - 0.7
for both models. We used the same video for inference with both models and here are the results:
# YOLOv9c
Speed: 0.3ms pre-process, 24.2ms inference, 1.3ms NMS per image at shape (1, 3, 640, 640)
# YOLOv8m
Speed: 1.0ms preprocess, 7.8ms inference, 0.7ms postprocess per image at shape (1, 3, 384, 640)
Then, we compared YOLOv9-E(Parameters-57.3 million) and YOLOv8-X(Parameters-68.2 million) to maintain inference similarity. We used Nvidia
Geforce RTX 3070 Ti Laptop GPU to run the inference, and we set the confidence threshold -0.25
and NMS IOU threshold - 0.7
for both models. We used the same video for inference with both models and here are the results:
# YOLOv9e
Speed: 0.3ms pre-process, 30.5ms inference, 1.3ms NMS per image at shape (1, 3, 640, 640)
# YOLOv8x
Speed: 0.9ms preprocess, 20.6ms inference, 1.0ms postprocess per image at shape (1, 3, 384, 640)
Interesting results, right? Click here to get an overview & play with the code. Tune all the parameters according to your use case, and get your hands dirty.
We’re not using re-parameterized weights for this experiment, which may be one reason why YOLOv8 has a faster inference speed than YOLOv9 in the above experiments. We explored YOLOv9 Finetuing and YOLOv9 Instance Segmentation in a detailed manner, which you can check too.
Key Takeaways
- Programmable Gradient Information(PGI): PGI is a technique for generating more reliable gradients using auxiliary reversible branches. It is effective for all model sizes, from lightweight to large.
- Generalized Efficient Layer Aggregation Network (GELAN): GELAN, the backbone of YOLOv9, is a new architecture that enhances flexibility by allowing interchangeable computational blocks, provides depth-wise parametrization for efficient resource usage, and ensures stable performance across various configurations with different block types and depths for scalable object detection.
- Enhanced Performance with Less Complexity: Through its innovative architecture, YOLOv9 achieves higher accuracy and speed in object detection while reducing the model’s complexity and computational demands. This is evaluated by its performance on the COCO dataset, where it demonstrates improvements with fewer parameters and less computational overhead compared to previous versions.
- Versatility Across Different Model Sizes: YOLOv9 is versatile, offering five model variants (YOLOv9-n, YOLOv9-s, YOLOv9-m, YOLOv9-c, and YOLOv9-e) to cater to a range of requirements from lightweight to more extensive, performance-intensive applications. This flexibility ensures that YOLOv9 can be effectively deployed in various environments and applications.
Conclusion
Through unique techniques and designs like Programmable Gradient Information (PGI) and the Generalized Efficient Layer Aggregation Network (GELAN), YOLOv9 has maintained the legacy of efficiency and accuracy. We observed YOLOv9, with its unique architecture, using less parameters(Param), less calculations(FLOPs), and giving significant improvements in performance( AP). As we delved deeper into the architectural components of YOLOv9, one thing became abundantly clear: it’s not just a model; it’s a game-changer. Welcome to the future of object detection.