• Home
  • >
  • CNN
  • >
  • YOLOX Object Detector Paper Explanation and Custom Training

YOLOX Object Detector Paper Explanation and Custom Training

YOLOX object detector is a recent addition in the YOLO family. Read the article for detailed YOLOX paper explanation and learn how to train YOLOX on a custom dataset.

What is YOLOX?

YOLOX is a single-stage real-time object detector. It was introduced in the paper YOLOX: Exceeding YOLO Series in 2021. The baseline model of YOLOX is YOLOv3 SPP with Darknet53 backbone. YOLOX object detector is a very interesting addition to the YOLO family. With some unique feature addition, YOLOX is able to deliver results that are on par with state-of-the-art models. This blog post will take you through a detailed discussion of YOLOX paper and how to train YOLOX on a custom dataset.

In summary, you will get a detailed and intuitive explanation of the following.
1. YOLOX and its salient features
2. How YOLOX work?
3. How to train YOLOX on a custom dataset?
4. How to prepare a dataset for optimal result?
5. Performance comparison of YOLOX models
  1. What’s New in YOLOX?
  2. What are Anchors in Object Detection?
  3. What are Anchor Free Detectors?
  4. Anchor Free YOLOX
  5. Multi-Positives in YOLOX to Improve the Recall
  6. Introducing the Decoupled Head in YOLOX
  7. simOTA Advanced Label Assignment Strategy
  8. Strong Data Augmentation in YOLOX
  9. Custom Drone Dataset for Training YOLOX
  10. Setting up YOLOX Training Environment
  11. Setting up YOLOX Model Configurations for Training
  12. Training YOLOX on Custom Dataset
  13. YOLOX Inference Tests and Observations

YOLO Master Post – Every Model Explained

Unlock the full story behind all the YOLO models’ evolutionary journey: Dive into our extensive pillar post, where we unravel the evolution from YOLOv1 to YOLO-NAS. This essential guide is packed with insights, comparisons, and a deeper understanding that you won’t find anywhere else.
Don’t miss out on this comprehensive resource — Mastering All Yolo Models for a richer, more informed perspective on the YOLO series.

What’s New in YOLOX?

Released in July 2021, YOLOX has switched to the anchor free approach which is different from previous YOLO models. It also introduces advanced detection techniques like Decoupled Head and simOTA label assignment strategy. Moreover, strong data augmentation like MOSAIC and mixUP are incorporated for robust training. YOLOX began with the YOLOv3 SPP model as baseline and performed these modifications one after another.

In short, salient features of YOLOX are,

  • Anchor free design
  • Decoupled head
  • simOTA label assignment strategy
  • Advanced Augmentations: Mixup and Mosaic

What are Anchors in Object Detection?

Anchors are essentially a large set of bounding box presets tiled across an image during detection. The detector is trained to classify whether each anchor box overlaps with the ground truth or not. Since the scale and location of the objects are unknown, a large number of anchors of various sizes and aspect ratios are required. The anchors based design improved upon the sliding window approach, used in early object detection models.

yolox anchor free what are anchors yolox github

With the anchor box approach, the whole image can be processed at once which makes real-time object detection possible. It was so successful that until now almost all CNN-based object detectors use an anchor-based pipeline to attain optimal detection performance. The same goes for the previous versions of the YOLO family, except  YOLOv1.

Drawbacks of Anchor Based Approach

Although the anchor-based approach is widely accepted, it has two drawbacks.

  1. It needs a large set of anchor boxes. For example, it is more than 100k in RetinaNet.
  2. The anchor boxes require a lot of hyperparameters and design tweaks. For example,
    1. Number of anchors 
    2. Size of the anchors
    3. The aspect ratio of the boxes
    4. The number of sections the image should be divided into
    5. IoU value, to decide labels ( +ve or -ve), etc.

It gets even more complicated for a multi-scale approach. These hyperparameters impact the end result even with the slightest change. In the end, processing anchors introduce a significant amount of delay.

What are Anchor Free Object Detectors?

Anchor-based Detectors use pre-defined boxes of various sizes as proposals for predicting the location of the object in an image. This leads to too many hyper-parameters related to anchors and large computation requirements. Anchor-free methods try to localize the objects directly without using the boxes as proposals but using centers or key points. As you will see, this makes the approach more simple, flexible, and intuitive!

Over the years, anchor free detectors have evolved to be on par with or even better than anchor based detectors. Some examples of anchor free detectors are, CornerNet, CenterNet, and FCOS.

???? You might be surprised to know that the popular YOLOv1 detector was also anchor-free!

Anchor free detectors can be categorized into two types:

  • Key-point based
  • Center-based

Keypoint-based detectors locate several predefined (or self-learning) key points throughout the object. The spatial extent of the object is obtained through these clusters of key points.

key point based object detectors yolox github

Whereas, center-based detectors find positives in the center (or center region), then predict four distances from positives to the boundary.

yolox center based object detector yolox github

We also have a dedicated article on center-based anchor-free detector CenterNet. Check out the blog post for more insight on anchor-free detectors.

Anchor-based or anchor-free mechanisms are essentially based on how labels (positive, negative) are assigned to a sample. Anchor free does not mean the removal of bounding boxes. It is just a different design strategy where an anchor preset is not required.

Anchor Free YOLOX

YOLOX adopts the center-based approach which has a per-pixel detection mechanism. In anchor based detectors, the location of the input image acts as the center for multiple anchors.

YOLOv3 SPP outputs 3 predictions per location. Each prediction has an 85D vector with embeddings for the following values.

  1. Class score
  2. IoU score
  3. Bounding box coordinates ( center x, center y, width, height)

Based on these scores, anchors are labeled positive or negative.

head of yolo

On the other hand, YOLOX reduced the predictions at each location (pixel) from 3 to 1. The prediction contains only a single 4D vector, encoding the location of the box at each foreground pixel.

T = {left, top, right, bottom}

Here, left, top, right, and bottom are the distances from the location to the four sides of the bounding box.

anchor free yolox paper yolox github

The locations inside a GT (Ground Truth) clustered around the center, are considered positive. This concept of centeredness is inspired by previous research like CornerNet, CenterNet, and FCOS. It simplifies the complexity of anchor based mechanisms to a great extent. The training inputs in YOLOX are very much less than anchor based networks.

We will discuss FCOS in a separate blog post to understand anchor free design better. Do keep checking our blog for updates.

Introducing the Decoupled Head in YOLOX

Every YOLO architecture consists of three parts. The backbone, neck, and head. The features of the input image are extracted by the backbone. These features are passed through the neck, where aggregation of multiscale features takes place. Further, the head uses these feature maps to output localization and classification scores.

yolo architecture outline explanation in YOLOX

Shared Head in YOLO

Earlier YOLO networks used a coupled head architecture. All the regression and classification scores are obtained from the same head.

shared head architecture of YOLO

The Need for Decoupling the YOLO Head

Coupled or shared head is the most commonly used architecture. It was doing fine until researchers at Microsoft pointed out the loopholes in 2020. The paper Rethinking Classification and Localization for Object Detection proved that there is a conflict between the regression (localization) and classification task. It happens due to the misalignment of features as shown below.

Fig: Spatial misalignment of classification and localization heads

Ideally, a higher classification score of a prediction would mean high localization confidence. However, that is not always the case. Due to spatial misalignment of features, the scores may not synchronize. Consequently, this hurts the training process. The following illustration shows mismatching IoU and Class confidence.

Fig: Demonstrative cases of the misalignment between classification confidence and localization accuracy. The yellow bounding boxes denote the ground truth, while the red and green bounding boxes are both detection results.

The arrival of Decoupled Head in YOLOX

YOLOX implemented a decoupled head for classification and regression tasks. It contains a 1 × 1 Conv layer to reduce the channel dimension, followed by two parallel branches with two 3 × 3 Conv layers respectively.

decoupled head in YOLOX

This approach resulted in better performance. Compared to a shared head, which converges in 300 epochs; YOLOX can converge in 200 epochs.

Multi-Positives in YOLOX to Improve the Recall

With anchor-based methods, there was a significant amount of true positives per ground truth. However, after switching to the center-based method, the total number of positive anchors reduced drastically. Consequently, the recall value was reduced, leading to an imbalance in the training data.

Confused about how the recall is calculated? We have an excellent article that will help you brush up on the basics of precision and recall – Intersection Over Union(IoU).

YOLOX experimented with multi-positive center sampling to fix it. A 3×3 area in the neighborhood of the positive location is also considered a positive sample. Thereby increasing the positive candidates to some extent. Please also note that this is not the finally accepted label-assigning strategy.

simOTA Advanced Label Assignment Strategy

Label assignment is the step where ‘positive’, ‘negative’, and ‘don’t-care’ labels are assigned to the anchors. The previous stage object detectors used hand-crafted label-assigning rules. For example, IoU (Intersection over Union) threshold is used to determine whether an anchor should be labeled positive. It deals with ground truths (GT) independently. However, this method suffers from problems when dealing with ambiguous anchors. For example, it is difficult to decide which GT box should get the ambiguous anchors in the image below.

yolox solving the problem of ambiguous anchors

Ambiguity may lead to biased learning which is not desirable. Hence, it is very important to assign labels in the most optimal way possible.

Optimizing Label Assignment

The research paper, OTA: Optimal Transport Assignment for Object Detection, solves the issue by considering the label assignment task as a Transport Problem. It increases the average precision (AP) significantly.

We know that Transport Problem in statistics is a special type of Linear Programming Problem. It requires a huge number of iterations to go through one sample. Algorithms like Sinkhorn Knopp can solve the Transport Problem through iterations. However, due to the sheer number of iterations, OTA increases the training cost by 25%. 

You may wonder, then why use OTA if training time is more? You are absolutely right! Although OTA improves accuracy, increased training cost is not a desirable feature. Hence YOLOX worked on simplifying the strategy to simOTA and got the best out of it.

“More Accuracy + Less Training Cost”

What is simOTA in YOLOX?

Simplified OTA or simOTA is the redesigned Optimal Transport Assignment strategy. The training cost does not increase but average precision(AP) is definitely improved. It is shown with empirical evidence in the paper. 

In simOTA, iteration is not performed for every positive label. A strategy called Dynamic Top K is used to estimate the approximate number of positive anchors for each ground truth. Here, only the top K number of positive labels are selected. This reduces the number of iterations by many folds.

The number of positive labels per ground truth (GT) varies due to the following factors.

  • Size
  • Scale
  • Occlusion conditions etc.

However, it is difficult to model a mapping function from these factors to the positive anchor number K. Hence it is done on the basis of IoU value. The IoU values of the anchors to the ground truth(GT) are summed up to represent the GT’s estimated number of positive anchors. 

The intuition is such that the number of positive anchors for a certain GT should be positively correlated with the number of anchors that have well regressed.

Strong Data Augmentation in YOLOX

YOLOX incorporates MOSAIC and MixUP augmentation which improves the performance further.

Mixup Augmentation

MixUp augmentation is the weighted addition of two images. It was initially used for classification tasks but later got introduced to object detection as well.

mixup aumentation implemented by YOLOX

Mosaic Augmentation

It was first introduced by Glen Jocher in the YOLOv3 Ultralytics repository. MOSAIC augmentation combines 4 training images into one like a plain collage and then crops in a certain ratio. MOSAIC Augmentation helps the network learn to detect smaller objects better.

yolox mosaic augmentation

Custom Drone Dataset for Training YOLOX

Let’s examine the dataset that we have used for training YOLOX. We are using the Drone Yolo Detection dataset from Kaggle which consists of 4014 labeled images.

Custom drone dataset example sampe for training YOLOX

We observed that the out-of-the-box dataset has several issues. Most images are sequential video frames. These are similar repeating samples that lack diversity. Missing annotations and poor annotations were also observed in the dataset.

Therefore, we extracted 600 images from the set and annotated them properly in Roboflow. Some examples of issues and their modifications are shown below. We have also added 188 more images from our side. The updated version of the dataset is available on Kaggle.

Dataset Modification

Fig. Images sequence from video
Fig. Missing annotation

The modified dataset is in PASCAL VOC format. In YOLOX, the folder structure is required to be in the following hierarchy.

yolox dataset tree requirement yolox github

The JPEGImages folder contains the images. All annotations in .xml format are in the Annotations directory. We have already structured the dataset for you. If you want to use your custom dataset, use the DataGenerator.ipynb notebook provided in the download code.

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

Setting up YOLOX Training Environment

YOLOX installation is pretty straightforward. Just clone the YOLOX GitHub repository and follow the installation instructions provided in the markdown. However, we did face a few issues while testing on different setups. Hence, we are documenting it for your convenience. Feel free to put comments below if you have any other issues. We will also add a link to the colab notebook which is free from dependency issues. 

TEST SETUP CONFIGURATIONS

  1. Intel i7 6th Gen, GTX 1080 Ti, Ubuntu 20.04
  2. Ryzen 5 4th Gen, GTX 1650 Notebook, Ubuntu 20.04, and Windows 10
  3. Intel i7 8th Gen, GTX 1060 Notebook 6 GB, Ubuntu 20.04

Create a New Environment Using Conda

conda create --name yolox python=3.10 -y
conda activate yolox

Install Jupyter and ipykernel for the Environment

This is necessary as otherwise, your notebook will use the base kernel.

pip install ipykernel
pip install jupyter
python -m ipykernel install --user --name yolox --display-name yolox

Clone YOLOX GitHub Repository

git clone https://github.com/Megvii-BaseDetection/YOLOX.git
cd YOLOX

Install Dependencies

Check the requirements.txt file that you have cloned in the YOLOX GitHub repository. It has torch and torchvision which requires CUDA to be set up globally beforehand. If you haven’t done it before, it will install the CPU versions. This is for utilizing the GPU. You may otherwise skip this.

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

We also faced conflicts with onnx while installing. Hence we will skip the installation as it is not needed for training. If needed later, you can install them by removing specific versions. Now go ahead and comment out the following in the requirements.txt file. Don’t forget to save the changes.

# torch>=1.7
# torchvision
# onnx==1.8.1
# onnxruntime==1.8.0
# onnx-simplifier==0.3.5

Install the rest of the dependencies using the following command.

pip install -v -e .

Wget, Stream Editor, and Tensorboard Installation

We will be using wget to download files. You can install it using sudo apt install wget on Linux. If you are on windows, check out wget install. Unzip the binaries in any location, preferably a folder in C Drive. Then add it to the path in environment variables. Alternatively, the files can be downloaded manually.

Stream editor is required for modifying texts. Install it using sudo apt install sed. For windows, check out sed installation for windows.

We will also install tensorboard pip install tensorboard to monitor training logs in real-time.

Setting Up YOLOX Model Configurations for Training

Training is pretty much straightforward with a single line of command. But there are a few configurations that we need to set up properly. Let’s take a look at the YOLOX GitHub repository structure.

yolox github repository structure

1. datasets

It contains the dataset on which we want to train our models. YOLOX supports datasets in PASCAL VOC and COCO format. We need to copy the prepared dataset to this directory.

2. exps

YOLOX requires a configuration script for training and inference. It is a python script in which training parameters are defined. This directory contains model-specific default scripts and an example script in the following manner.

yolox experiment configurations structure yolox github

In YOLOX, training params are stored in python scripts called experiment files. Example scripts are available in exps/example/yolox_voc/ directory. These scripts contain the following parameters that we need to modify.

  • Network Width
  • Network Depth
  • Number of Classes
  • Number of Epochs
  • Augmentation info
  • Path to Training and Validation dataset etc.

The width and Depth for various models are as follows. Check out exps -> defaults for more information.

Network Depth and Width YOLOX models

The rest of the defaults are available in yolox/exp/yolox_base.py. Note that YOLOX has already set the prefixes for Train and Validation data path.

datasets/VOCdevkit/VOC + str(year) + /ImageSets/Main.

Hence, we only need to specify the year 2007 or 2012 in the get_dataloader and get_eval_loader functions respectively.

i.e.,

image_sets=[('2012', 'train)],
image_sets=[('2012','valid')],

3. yolox

In this directory, we need to define the number of classes. Since we are using the PASCAL VOC format, voc_classes.py has to be modified. The path to this file is, yolox/data/datasets/voc_classes.py.

By default, it has 20 classes. It should be modified according to the dataset. For example, if a dataset contains bottle, plastic bag, and soda can; it will look like the following.

VOC_CLASSES = (   bottle, plastic bag, soda can,   )

Notice the comma “,” at the end. Yes, the repository is designed this way. Even if there is one class, the comma has to be put.

4. YOLOX Weights File

Make sure to download the model.pth weights file before training. This is available in the YOLOX GitHub repo releases.

Training YOLOX on Custom Dataset

Let’s go through the training pipeline of the YOLOX Medium model step by step. You can easily switch between models using the correct configurations.

1. Installations

Clone the YOLOX GitHub repository and follow the installation instructions provided above. We will also choose the yolox kernel (created above) while opening the notebooks.

!git clone https://github.com/Megvii-BaseDetection/YOLOX.git

2. Define Classes

Modify voc_classes.py to contain drone labels.

from IPython.core.magic import register_line_cell_magic
@register_line_cell_magic
def writetemplate(line, cell):
   with open(line, 'w') as f:
       f.write(cell.format(**globals()))

Replace the class labels with your custom labels. In our case, it is drones. Again, don’t forget the comma at the end.

%%writetemplate yolox/data/datasets/voc_classes.py

VOC_CLASSES = (
 "drone",
)

3. Download the Drone Dataset

Download the dataset and unzip it in the datasets directory.

[Update]: Dropbox link has been updated with GitHub raw.

%cd datasets
!wget https://github.com/spmallick/learnopencv/blob/master/YOLOX-Object-Detection-Paper-Explanation-and-Custom-Training/Drone-dataset.zip?raw=true -O VOCdevkit.zip -q --show-progress
!unzip -qq VOCdevkit.zip
!rm VOCdevkit.zip
%cd ..

4. Download Exp Config Files

We have already created model-specific config scripts for you. Go ahead and download them in the exps directory.

# Download experiment config files.
%cd exps
!wget https://github.com/spmallick/learnopencv/blob/master/YOLOX-Object-Detection-Paper-Explanation-and-Custom-Training/ExpConfigs.zip?raw=true -O custom_exps.zip -qq --show-progress
!unzip custom_exps.zip
%cd ..

The parameters in the scripts are as follows.

Depth : 0.67
Width : 0.75
Epochs : 300
Number of Classes : 1
Train Data Path : image_sets=[(‘2012’, ‘train)],
Validation Data Path : image_sets=[(‘2012′,’valid’)],
Augmentation

  • Mixup: 1.0
  • Mosaic: 1.0
  • HSV: 1.0
  • FLIP: 0.5

Let’s modify the number of epochs to 25 using the stream editor.

MAX_EPOCH = 25
!sed -i -e 's/self.max_epoch = 300/self.max_epoch = {MAX_EPOCH}/g' "exps/ExpConfigs/yolox_voc_m.py"

We will keep the rest of the parameters as shown above. Feel free to experiment with the parameters. If you want to create exp.py file from scratch, check out the notebook in the download code.

5. How to Train YOLOX?

Run the following command to start training. Check out YOLOX GitHub documentation if required.

python ./tools/train.py -f exps/ExpConfigs/yolox_voc_m.py -d 1 -b 8 --fp16 -o -c yolox_m.pth

Arguments:

  • -f: Path to the experiment file.
  • -d: Devices for Training.
  • -c: Checkpoint weights at intervals.
  • -b: Batch size. Depends on your device. If it shows CUDA error, reduce the batch size. For example, batch size = 2 with GTX1650 4GB notebook edition. batch size = 16 works fine with google colab.
  • -o: Occupy GPU memory for faster training. May need to turn it off if GPU memory is less than 4 GB.
yolox medium training log 25 epochs
Fig: YOLOX Medium training log for 25 epochs

6. Evaluate YOLOX

Evaluate YOLOX with the following command. As you can see, we are using the model with the best mAP value. Feel free to experiment with different checkpoints in the YOLOX_outputs/yolox_voc_m directory.

After training YOLOX medium model for 25 epochs, we are getting 83.1% mAP.

MODEL_PATH = "YOLOX_outputs/yolox_voc_m/best_ckpt.pth"
!python tools/eval.py -f exps/ExpConfigs/yolox_voc_m.py -c {MODEL_PATH} -b 8 -d 1 --conf 0.001

7. Download Test Images and Videos

# Download Images
%mkdir inference_media
%cd inference_media
!wget https://www.dropbox.com/s/1dy29ys1fkce8k3/bird-and-drone.png?dl=1 -O bird-and-drone.jpg -qq --show-progress
!wget https://www.dropbox.com/s/i0afm1nqm6iiuji/eagle-capturing-drone.png?dl=1 -O eagle-capturing-drone.jpg -qq --show-progress
!wget https://www.dropbox.com/s/kje4h0avj2scgjj/eagle-vs-drone.png?dl=1 -O eagle-vs-drone.jpg -qq --show-progress
!wget https://www.dropbox.com/s/jhjy3lfl5908vta/drone-vs-birds.jpg?dl=1 -O drone-vs-birds.jpg -qq --show-progress
!wget https://www.dropbox.com/s/u1kqu0yxj07e35e/Drones-1-original.mp4?dl=1 -O Drones-1-original.mp4 -qq --show-progress
%cd ..

8. Image Inference Command

!python tools/demo.py image -f exps/ExpConfigs/yolox_voc_m.py -c {MODEL_PATH} --path ./inference_media/ --conf 0.25 --nms 0.45 --tsize 640 --save_result --device gpu

9. Video Inference Command

!python tools/demo.py video -f exps/ExpConfigs/yolox_voc_m.py -c {MODEL_PATH} --path ./inference_media/Drones-1-original.mp4 --conf 0.25 --nms 0.45 --tsize 640 --save_result --device gpu

10. YOLOX Medium Inference Results

We deliberately chose challenging images from the internet to fool the model. With 25 epochs, we can not expect very good results. Given the number of training samples (750 images), we can say it is doing a pretty good job.

Drone swarm detected by yolox medium model in 25 epochs
Fig: Drone swarm detected by YOLOX-m

In the following image, the model is able to distinguish between the eagle and the drone.

Eagle capture drone yolox m detection 25 epochs
Fig: Eagle capturing drone detected by YOLOX-m

YOLOX Inference Tests and Observations

We trained nano, tiny, small, medium, and large models on the same dataset. After training for 300 epochs, the mAP improved significantly. The best accuracy is achieved in the YOLOX large model at 86%. YOLOX tiny, small, and medium report comparable mAPs of 76, 80, and 84% respectively.

Note that the graphs have been smoothed for comparability.

YOLOX models 300 epochs training tensorboard logs

Fig: Training Log 300 epochs

YOLOX FPS Comparison

FPS-comparison-in-GTX-1060-6GB-Notebook-GPU yolox paper explanation
YOLOX FPS comparison in GTX 1060 6GB Notebook CPU yolox paper explanation

To test the models, we collected real-world data from various sources. First, let’s check how they performed on the validation set. The following images have been picked from the validation dataset. You can click on the images to get an enlarged view.

yolox inference in yolox nano model on validation dataset yolox paper
Fig: Validation Images
yolox inference in yolox tiny model on validation dataset yolox paper
Fig: Inference results from the YOLOX nano model on the validation set
yolox inference in yolox small model on validation dataset yolox paper
Fig: Inference results from YOLOX tiny model on the validation set
yolox inference in yolox small model on validation dataset yolox paper
Fig: Inference results from YOLOX small model on the validation set
yolox inference in yolox medium model on validation dataset yolox paper
Fig: Inference results from the YOLOX medium model on the validation set
yolox inference in yolox large model on validation dataset
Fig: Inference results from YOLOX large model on the validation set

Testing the Models on Real-World Data

Now let’s test the trained models on some real-world data. The following images have been selected for testing. Let’s number them 1-4 clockwise for our reference.

real world image for testing yolo models yolo paper
Fig: Original Images for Testing

The nano model is detecting the drones in images 1-3. However, birds in image 2 are also being detected as drones. Taking a closer look, we can observe that, the birds have been detected as drones with confidence less than 50%. It is possible to filter those by setting the confidence threshold to 50. The same can not be said for image 4.

yolox paper vs trained model inference comparison nano
Fig: Inference result by YOLOX nano on real-world images

The results improve a bit on using the tiny model. We can see that the False Positive detection has reduced in image 2. However, the rest of the results haven’t improved much.

yolox paper vs trained model inference comparison tiny
Fig: Inference result by YOLOX tiny on real-world images

Switching to the small model generates drastic improvement, as we can see below. Almost all drones are being to be detected in image 1. In image 2, there are no false positive detections and correct detection is observed in image 4.

yolox paper vs trained model inference comparison small
Fig: Inference result by YOLOX small on real-world images

Contrary to the incremental improvements in the previous models, YOLOX medium is showing poor results comparatively. Is it possible due to incorrect hyper-parameters? We will investigate it more later and update our findings.

yolox paper vs trained model inference comparison medium
Fig: Inference result by YOLOX medium on real-world images

The following results are from YOLOX large model. As expected, it is more accurate and generates superior results.

yolox paper vs trained model inference comparison large
Fig: Inference result by YOLOX large on real-world images

Conclusion

So that’s all about the YOLOX paper explanation and Training YOLOX on a custom dataset. In summary, we discussed the following.

  1. YOLOX is anchor free.
  2. It introduced decoupled head.
  3. Label assignment strategy in YOLOX is simOTA.
  4. simOTA is simplified OTA (Optimal Transport Assignment) with Dynamic Top K estimation.
  5. YOLOX has implemented mixup and mosaic augmentation.
  6. How to set up the YOLOX training pipeline.
  7. How to prepare datasets for training a YOLOX model.
  8. YOLOX has experiment scripts for configuring model parameters.

I hope you enjoyed reading the article and understood what makes YOLOX fast and accurate. Have any doubts or suggestions? Feel free to comment below.

References

  1. Cornernet
  2. CenterNet
  3. FCOS
  4. Rethinking Classification and Localization for Object Detection
  5. Spatial misalignment of classification and localization heads
  6. Misalignment between classification confidence and localization accuracy
  7. OTA: Optimal Transport Assignment for Object Detection
  8. YOLOX GitHub



Read Next

VideoRAG: Redefining Long-Context Video Comprehension

VideoRAG: Redefining Long-Context Video Comprehension

Discover VideoRAG, a framework that fuses graph-based reasoning and multi-modal retrieval to enhance LLMs' ability to understand multi-hour videos efficiently.

AI Agent in Action: Automating Desktop Tasks with VLMs

AI Agent in Action: Automating Desktop Tasks with VLMs

Learn how to build AI agent from scratch using Moondream3 and Gemini. It is a generic task based agent free from…

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

Get a comprehensive overview of VLM Evaluation Metrics, Benchmarks and various datasets for tasks like VQA, OCR and Image Captioning.

Subscribe to our Newsletter

Subscribe to our email newsletter to get the latest posts delivered right to your email.

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?

🎃 Halloween Sale: Exclusive Offer – 30% Off on All Courses.
D
H
M
S
Expired
 

Get Started with OpenCV

Subscribe To Receive

We hate SPAM and promise to keep your email address safe.​