Post Training Quantization with OpenVINO Toolkit

Deep Learning models inferencing on video stream inputs in computer vision applications are mostly used for object detection, image segmentation, and image classification. In many cases, we fail to get high FPS while carrying out these tasks. But what about real-time applications like traffic monitoring that call for faster inference and a higher FPS? Floating-point 32 (FP32) precision models may fail to provide a very high FPS. After converting Darknet Tiny YOLOv4 weight to OpenVINO-optimized FP32 model, in our previous Introduction to Intel OpenVINO Toolkit post, you did get a boost in FPS. But that’s not enough, so we need to explore the Post Training Quantization tools in OpenVINO.

This post is the second in the OpenVINO series, which consists of the following posts:

Introduction to OpenVINO Toolkit
Post Training Quantization with OpenVino Toolkit
Running OpenVino Models on Intel Integrated GPU
Introduction to OpenVino Deep Learning Workbench

Let’s discover all these tools that optimize trained models for faster inference, and also get an intuition of the whole Post Training Quantization process..

An Overview of Deep-Learning Model Quantization and Different Precision Formats
Post-Training Optimization Using OpenVINO Toolkit
1. Setting Up POT
Different Quantization Methods
Applying Default Quantization, Using POT to Tiny YOLOV4 FP32-IR format
Accuracy Checker
Summary

After going through this post, you will not only have a better understanding of the different quantization methods in OpenVino, but. also learn to use INT8 models over FP32 models for better inference.

Note: Further on, every mention of the FP32 model will mean the OpenVINO optimized FP32 model.

An Overview of Deep Learning Model Quantization and Different Precision Formats

Almost all the Deep Learning models are trained using very costly GPU clusters, servers, or cloud services. And such training often takes place with floating-point 32-bit arithmetic to cover a wider range for the model’s weights. But where do these models get deployed? Mostly on the edge, using hardware with low computational capacity compared to the systems they had been trained on. Moreover, the FP32-precision format is not meant for models deployed on the edge, so you can’t expect the best performance. No wonder you find Deep-Learning models often quantized to 8-bit precision models. Their trained weights are converted to 8-bit integers instead of the full 32-bit precision.

Check out the ranges for different precision formats in the following image:

numerical formats — Ranges of FP32, FP16, and INT8 precision formats.

In simple words, quantization therefore is the process of converting a Deep Learning model’s weights to a lower precision such that it needs less computation. This inherently leads to a jump in the model’s performance, in terms of its processing speed and throughput, for you get a higher FPS when dealing with video streams.

But logically speaking, aren’t we limiting the model’s capability by applying quantization? With FP32 precision, you can surely accommodate a much wider range of weights compared to FP16 or INT8.

int8-and-fp32-format-ranges — Range of Signed INT8 and FP32 precision format.

From the above figure, it is pretty clear that lowering the precision of the model by applying quantization leads to a loss in prediction capability. Post quantization, we will therefore need to reassess the model’s prediction capability to ensure we have not lost a significant amount of accuracy. At times, the tradeoff between accuracy and FPS is too obvious. That’s because we are also limiting the wider range of bits of FP32 that will not be used after a model is quantized.

Don’t worry. The OpenVINO Toolkit for model quantization also has some calibration tools that can help you quantize models while keeping the model’s prediction capability intact to a large extent. We will cover all these tools in this post, and also show you the gain in performance (in terms of throughput), when quantizing a Deep-Learning model to INT8 precision. This is where the Post-Training Optimization Tool (POT) plays a major role.

100K+ Learners
3 Hours of Learning

Join Free OpenCV Bootcamp

15K+ Learners
3 Hours of Learning

Join Free TensorFlow Bootcamp

10K+ Learners
8 Hours of Learning

Join Free PyTorch Bootcamp

Post-Training Optimization Using OpenVINO Toolkit

For Post-Training Quantization of trained Deep-Learning models, we can use the Post Training Optimization Tool (POT) in the Intel OpenVINO Toolkit. This process is specifically called post-training optimization. You need not re-train or even fine-tune the model for such optimization. Furthermore, no training dataset is necessary to apply Post Training Quantization, using POT.

But yes, some things definitely are important for the POT to work on the trained model.

You need a trained, full-precision Deep Learning model, either FP32 or FP16. And you’ll have to convert this model first to OpenVINO’s IR format.
Also, a sample subset data, preferably the validation data, on which the original, full-precision models have been trained for calibration. This calculates the activation statistics of the neural network. But why calculate this? Well, you need this for activation channel alignment and bias correction while quantizing the model.

The main advantage of using the OpenVINO Post Training Optimization Tool is that it automates the process of model quantization. But there’s more. It also provides two distinct but useful methods:

DefaultQuantization
AccuracyAwareQuantization

Before we get into the details of these two methods, let’s understand how exactly does POT work? The following image depicts the workflow of POT, when quantizing a model, by the process of Post Training Optimization. Have a look.

post training quantization work flow — Caption: Post Training Optimization Tool workflow.

Setting Up POT

By now, you know that POT is a great tool for model quantization. But to utilize it, you need to first configure and set it up. To do this, you need to first set up the:

Model Optimizer
Accuracy Checker – You need it to carry out AccuracyAwareQuantization, using POT. It evaluates the accuracy of the quantized model against a subset of data.

We have already covered the setting up of Model Optimizer in the first post of the series.

To set up the Accuracy Checker:

Go to your OpenVINO installation directory and cd into deployment_tools/open_model_zoo/tools/accuracy_checker

If you have installed OpenVINO with root access, then the path should be

/opt/intel/openvino_2021/deployment_tools/open_model_zoo/tools/accuracy_checker

Next, run the setup script.

python3 setup.py install

After the setup completes, you should be able to access the Accuracy Checker from the terminal, using the accuracy_check command. To verify, just type the following command in the terminal:

accuracy_check -h

If you have completed everything successfully, then you should see a bunch of command-line flags on the terminal just like this:

accuracy_check -h
usage: accuracy_check [-h] [-d DEFINITIONS] -c CONFIG [-m MODELS [MODELS ...]]
                      [-s SOURCE] [-a ANNOTATIONS]
                      [--model_attributes MODEL_ATTRIBUTES]
                      [--input_precision INPUT_PRECISION [INPUT_PRECISION ...]]
                      [-tf TARGET_FRAMEWORK]
                      [-td TARGET_DEVICES [TARGET_DEVICES ...]]
                      [-tt TARGET_TAGS [TARGET_TAGS ...]] [-ss SUBSAMPLE_SIZE]
...

Now, you are ready to set up POT.

Go to the /opt/intel/deployment_tools/tools/post_training_optimization_toolkit directory.

cd /opt/intel/deployment_tools/tools/post_training_optimization_toolkit

Run the Python setup script.

python3 setup.py install

Access POT from your terminal now using the pot command. To verify, type the following in your terminal:

 pot -h

This completes the setup of POT as well.

Different Quantization Methods

Next, let’s discuss the two quantization methods offered by OpenVINO POT, namely, DefaultQuantization and AccuracyAwareQuantization.

We will apply DefaultQuantization to Tiny YOLOV4 FP32 models that have already been converted to the IR format.

Before applying INT8 quantization to the Tiny YOLOv4 FP32 model, let us first understand the quantization process offered by POT.

Quantization in POT

Understanding the quantization process of POT, in general, will help grasp the concepts of Default and Accuracy Aware Quantization much more easily.

You will recall that the trained neural-network model is in full-precision format i.e. FP32. The OpenVINO POT helps convert the floating-point number operations to low-bit numbers, such as Int8. This not only reduces the size of the model but also the computational cost to a great extent.

During quantization using POT, the process inserts an operation called FakeQuantize into the model graph, based on the target hardware. Basically, the model is optimized according to the computation device on which the POT operation is carried on. The Deep-Learning Workbench comes into play here, helping usselect the target platform. You can know more about the OpenVINO Deep Learning Workbench and its applications in post 4 of this series.

For now, look at the above image. See how FakeQuantize operations are inserted before the convolution layers during quantization.

But what do the FakeQuantize layers do?

During runtime, these FakeQuantize layers convert the input to the convolution layer into Int8. For example, if the next convolutional layer has Int8 weights, then the input to that layer will also be converted to Int8. Further on, the precision however depends on the next operation. If the next operation requires a full-precision format, then the inputs will be reconverted to full-precision during runtime.

To know more about FakeQuantize, you may look at the FakeQuantize guide from OpenVINO .

Here, we will be comparing the layers of the neural network before and after quantization so that you can actually see how the FakeQuantize layers are inserted into the model structure. But before that let’s explore the methods of quantization in OpenVINO POT.

DefaultQuantization

This is probably the fastest way to quantize a model, using OpenVINO POT. The DefaultQuantization algorithm provides fast 8-bit quantization methods, and the resulting models are fairly accurate as well.

Check out the workflow of the DefaultQuantization algorithm of POT in the above diagram.

We begin by providing the full-precision model to the algorithm, using POT.
Inside the algorithm, we first apply channel alignment to the trained model. Also called Activation Channel Alignment, this aligns the activation ranges of the convolutional layers to reduce quantization error. Typically, in this process:
1. First, we calculate the mean of the activation values.
2. Then align them, by clipping the activation values within a certain range.
Next, the MinMax Quantization method inserts the FakeQuantize layers into the model graph, as discussed above. Even this step is dependent on the target hardware chosen for the quantized model.
The quantized model then passes through Bias Correction, which makes the model output unbiased. Quantizing a Deep Learning model tends to shift the network’s statistics from the learned distribution. Bias correction helps overcome this, by adding a constant to the bias term of each channel, in every layer of the neural network.

Finally, you get the quantized model.

The DefaultQuantization Algorithm however has its own drawbacks. Although you get the fastest quantized model, you cannot control its accuracy. Most of the time this might not be an issue, but there will be situations where such control becomes crucial. Also, this is mainly true for models like the Tiny-YOLO family, which have already traded accuracy for speed, even in their full-precision model.

Worry not, we will see how to tackle this issue with AccuracyAwareQuantization.

However, you can explore DefaultQuantization in much more detail DefaultQuantization guide.

AccuracyAwareQuantization

What makes the AccuracyAwareQuantization algorithm special is that it does not let the resulting quantized model drop below a predefined accuracy range. Opt for this method whenever inference accuracy of the INT8 model is as important as its inference speed.

Simply put, this algorithm helps you get accurate 8-bit quantized models, which though a bit slower are definitely more accurate than models quantized with the DefaultQuantization algorithm.

accuracy aware quantization pipeline — The AccuracyAwareQuantization algorithm workflow.

The above diagram shows the working of the AccuracyAware Quantization algorithm. Let us go over the steps in brief.

We input the full-precision model, which passes through the DefaultQuantization algorithm to output a fast, 8-bit quantized model.
This quantized model then infers on a sample-validation set, which provides an accuracy score.
- If the accuracy level is satisfactory and matches the required threshold, the resulting model is given as output.
- If the accuracy score does not match the criteria, it may require a number of extra steps before reaching the output stage. Okay, so when the accuracy score is low:
And it happens to be the first iteration, then a layer-wise ranking is done to check which layer affects the accuracy the most.
The quantized layer contributing most to the accuracy drop is completely reverted back to the original full-precision format.
The model is run against the validation set to obtain a fresh accuracy score and compared with the threshold.
If the accuracy still does not meet the criteria, you keep iterating. And in every subsequent iteration, convert the next most problematic layer back to full-precision. The layer-wise ranking done in step 3 helps you determine which layer to target next.
Once the accuracy score reaches the required minimum, the resulting model is provided as output.

Configuration Files Needed for Default Quantization

To apply the DefaultQuantization algorithm to any full-precision Intermediate Representation model, we need a configuration specification file.

This is a .json file that even specifies the type of quantization algorithm.

"compression": {
 "algorithms": [
     {
         "name": "DefaultQuantization", // the name of optimization algorithm
         "params": {
             ...
         }
     }
 ]
}

We define the algorithm under the compression key section, as seen above. The name key holds the algorithm name as the value. You will get to know the entire file structure, when applying the POT to our own Tiny YOLOV4 model.

Also, we will be looking at all the parameters we use in our JSON file. However, if you wish to know all the mandatory and optional parameters for the JSON file, check out DefaultQuantization guide OpenVINO.

Applying DefaultQuantization to Tiny YOLOv4, Using POT

Now that we’re done with theory and even the configuration steps, let’s waste no time and use the POT for quantization.

Here, we will apply the DefaultQuantization algorithm to a Tiny YOLOv4 model, using POT. We already have a Tiny YOLOv4 FP32 model in the OpenVINO-IR format (.bin and .xml files). These IR files will help us generate the Int8 format for the model.

Note: We have already explored how to convert the Tiny YOLOv4 model from the original Darknet weights to the OpenVINO IR format in the previous post of the series. You may go through that post first to know more about the process and obtain the files as well.

After the conversion, we will also carry out inference on both the models and analyze the performance in each case.

Step-by-Step Approach

The steps are pretty straightforward, still, we will go through each step in detail so that you clearly understand each part of the process.

Step 1: Download the MS COCO Validation 2017 Dataset

For quantization, we need a dataset on which the model will do inference to get the model-activation statistics. It is always better to get the validation set of the dataset on which the original model was trained. As the Tiny YOLOv4 model was trained on the MS COCO 2017 dataset, let’s download its validation set.

You can download the validation 2017 dataset, along with the annotations from the official MS COCO challenge website.

Step 2: Prepare the Configuration JSON File

Next, prepare the Configuration JSON file, for only after defining your DefaultQuantization configurations can you move forward.

The following block contains all the necessary parameters (all the file content). In our example, the file name is quantization_spec.json .

{
 /* Model parameters */

 "model": {
      "model_name": "yolo-v4-tiny", // Model name
      "model": "fp32/frozen_darknet_yolov4_model.xml", // Path to model (.xml format)
      "weights": "fp32/frozen_darknet_yolov4_model.bin" // Path to weights (.bin format)
 },

 /* Parameters of the engine used for model inference */
 "engine": {
            "type": "simplified",
      "data_source": "../data/mscoco/val2017"
 },

 /* Optimization hyperparameters */
 "compression": {
         "target_device": "CPU",
          "algorithms": [
         {
             "name": "DefaultQuantization",
             "params": {
                     "preset": "performance",
                     "stat_subset_size": 300,
                      "shuffle_data": false
             }
         }
     ]
 }
}

Let’s examine some of the important parameters:

“model”: The model parameter has three sub-parameters (keys).
- The first one is the model_name, which in our case is yolo-v4-tiny.
- The second sub parameter is the model, which defines the path to the .xml file (the network topology) for Tiny YOLOv4.
- And the third one is the path to model weights file for Tiny YOLOv4 (.bin file).
“engine”: This parameter has two keys.
- First is the type key, which is provided as simplified mode.
- Then there’s the data_source key, which defines the path to the validation images used for quantization. In our example, it is the path to the folder that contains the MS COCO validation 2017 images.

“compression”: The compression parameter defines the optimization hyperparameters.
- The target_device defines the target hardware that we discussed earlier. In this case, it is the CPU.
- The algorithms define the algorithm-specific parameters. We are using the DefaultQuantization algorithm here.
- The preset relates to performance, meaning the model will be optimized for maximum throughput.
- The stat_subset_size defines the number of images that will be used from the validation dataset during optimization to calculate the activation statistics.
- The shuffle_data key is false, implying the dataset will not be shuffled.

Step 3: Execute the POT

At this point, we can apply the DefaultQuantization algorithm to our FP32 models, using the POT (pot command). But first just ensure that you have provided the path to the model files as well as the dataset correctly, as specified in the configuration file. Also, check that the configuration file is in your current working directory.

Next, type the following command in your terminal to use POT and apply DefaultQuantization.

pot -c quantization_spec.json --output-dir yolov4_int8 -d

The following flags are used in the above command:

–c: It accepts the path to the configuration JSON file.
—output-dir: It is the output directory, where you want to store the Int8 optimized model.
–d: This defines that the files will be saved without any subfolders, having only the algorithm name.

After the process completes successfully, you will find the yolov4_int8 directory, with the optimized subdirectory inside it. This will contain the Int8 optimized Tiny YOLOv4 .bin and .xml files.

Checking the Weight Values of FP32 and INT8 Models

Now that we have successfully applied DefaultQuantization to the FP32 model and obtained the INT8-quantized model, let’s print the weight values of both the models to check if the FP32 weight ranges have converted to the INT8 range.

Here’s the output from one of the layers of the Tiny YOLOv4 FP32 precision model:

(3.3732621669769287, 4.0697245597839355, 0.06139416620135307, 4.738661766052246, 4.845362663269043, 3.4683914184570312, 6.189712047576904, 3.6318254470825195, 5.175132751464844, 4.729865074157715, 1.7794770002365112, 3.1909656524658203, 4.396872043609619, 4.6004252433776855, …, 0.07775003463029861, 4.880298137664795, 2.6044061183929443)

See how all the weights are in floating-point format. These floating-point weights can range anywhere, from -3e38 to 3e38.

Next, take a look at the weights from the same layer of the INT8 quantized Tiny YOLOv4 model.

(-3, 1, -1, -6, 1, 0, -1, -9, -1, 14, -25, -18, -16, -90, -77, -7, -70, -74, 1, 5, 16, -12, -13, 14, -13, -12, -4, 0, -9, -22, 6, -12, -25, 12, 10, -3, -4, -32, -14, -9, -23, -11, -7, 9, -1, 18, …, -5, 2, 1, -5, -2, 15, -14, -13, -5, 12, 9, -8, -4, -9, 8, 3, -47, -90, -6, 2, 3, -4)

This time all the weights are integer values. And because the INT8 weights can range between -127 and 127, we even see values as low as -90.

Visualizing the Difference Between FP32 and INT8 Models

Didn’t we discuss above how after a model has been quantized to INT8 format, some FakeQuantize layers are inserted into the model graph. Now, see it. The following diagram brings out the difference between the FP32 and the quantized-INT8 models so clearly. It shows only those parts of FP32 and INT8 models though, which were visualized using Netron.

Insertion of FakeQuantize layers in the Tiny YOLOv4 model after applying DefaultQuantization and converting to INT8 precision format.

Look closely at the above figure:

The image on the left depicts the original model, and you can also see the convolutional layer in the model graph.
The image on the right shows the neural network after quantization. Just observe how the graph now has a new FakeQuantize layer before the convolutional layer. During run time, this FakeQuantize layer will quantize the inputs to INT8 format before they go into the convolutional layer. But as discussed earlier,, the resulting output may get converted back to floating-point format again, if the next operation requires so.

Running Inference, Using Tiny YOLOv4 FP32 and Int8 and Comparing Performance

Till now, we have discussed different quantization algorithms, how to set up POT for applying quantization to full-precision models, and also applying the DefaultQuantization algorithm to the Tiny YOLOv4 model.

Okay, so now we are finally ready to carry out inference with both the FP32 Tiny YOLOv4 and the Int8 models and do a comparative study. We will not only analyze how much was the performance gain after quantizing a model but also assess it against the decline in inference quality in this trade-off.

We will use a traffic crossing video for the inference experiments. This will let us know how the model performs when there are a lot of objects.

Let’s go over these inference experiments step-by-step, starting with the FP32 Tiny YOLOv4 model, then moving onto the Int8 model

Note: All the experiments are carried out on a system with an i7 8th Gen CPU @2.3 GHz clock speed and 16 GB RAM.

Step 1: Go to the OpenVINO Object Detection Demos Directory

OpenVINO already provides a script for object-detection experiments. Go to the /opt/intel/openvino_2021/deployment_tools/open_model_zoo/demos/object_detection_demo/python directory.

cd /opt/intel/openvino_2021/deployment_tools/open_model_zoo/demos/object_detection_demo/python

You will find the object_detection_demo.py script in this directory.

Step 2: Ensure that the FP32 Tiny YOLOv4 Models and the Video are Present

Simply copy the Tiny YOLOv4 models into this directory and keep them in the tiny_yolov4_fp32 folder

Also, copy the above video into the present working directory and name it video_1.mp4.

Step 3: Execute the Command to Run the Inference

Now, open up your terminal in the present working directory and execute the following command:

python object_detection_demo.py --model tiny_yolov4_fp32/frozen_darknet_tiny_yolov4_model.xml -at yolo -i video_1.mp4 -t 0.5 -d CPU -o fp32_output.mp4

Check out the flags that we have used in the above command:

--model: This is the path to the Tiny YOLOv4 .xml file, i.e., path to the network topology. Ensure that the .bin file is also present in the same path as the .xml file.
-at: This flag indicates the model-architecture type. As we are using the YOLOv4 model, therefore in the above command, it is yolo.
-i: The -i flag provides the path to the input image/video file.
-t: Defines the probability threshold for filtering the detections. Detections with a probability score less than 0.5 will be ignored.
-d: The -d flag sets the computation device. In this case, we provide the value as CPU, meaning the CPU will be used as the computation device.
-o: Finally, we use the -o flag to provide the path for the output video, with the detections to be saved.

Press the q key to exit the program at any time.

If the execution is successful, you will see the following outputs in your terminal:

[ INFO ] Initializing Inference Engine...
[ INFO ] Loading network...
[ INFO ] Reading network from IR...
[ INFO ] Loading network to CPU plugin...
[ INFO ] Starting inference...
To close the application, press 'CTRL+C' here or switch to the output window and press ESC key
Latency: 35.4 ms
FPS: 24.1

As you can see, the average FPS is 24.1, with a latency of 35.4 ms. Basically, latency is the wait time of a frame, after it gets read, but before the inference happens. It is also the time the user has to wait to see the output, after the inference. In other words, it is the combined pre-processing and post-processing time, apart from the inference time.

Have a look at the video output that is saved to disk.

As you can see, the detections are pretty decent for a TinY YOLOV4 model. Almost all the people are getting detected. In some frames, even the traffic light is detected. But what about vehicles? It definitely lacks on this count..

So, using the FP32 Tiny YOLOv4 model for inference gave us around 24 FPS on average.

Using the Int8 Quantized Tiny YOLOv4 Model

Now, let us see how much FPS we gain with the Int8 quantized model, and note if it sacrifices on the inference capability.

Just change the path to the model and the name of the out file with which the result will be saved to the disk.

python object_detection_demo.py --model int8/optimized/yolo-v4-tiny.xml -at yolo -i video_1.mp4 -t 0.5 -d CPU -o int8_output.mp4

This time, the model is in the int8/optimized directory, and the resulting file will be saved as int8_output.mp4

Take a look at this sample output from the terminal.

[ INFO ] Initializing Inference Engine...
[ INFO ] Loading network...
[ INFO ] Reading network from IR...
[ INFO ] Loading network to CPU plugin...
[ INFO ] Starting inference...
To close the application, press 'CTRL+C' here or switch to the output window and press ESC key
Latency: 27.2 ms
FPS: 30.4

We got an average FPS of 30.4. So the boost was nearly 6 FPS, which is a lot. Also, the latency is 27.3 milliseconds, which is almost 8 milliseconds less than the FP32 model.

Next, comes the video result.

Here, the difference in detections is clearly visible. Note how it is fluctuating a bit more than FP32, especially when detecting people who are far away. Also, it is unable to detect the traffic light in any of the frames. Although we gained more FPS, the quality and number of detections did slide. So, keep this in mind whenever quantizing any FP32 model to Int8.

Few Useful Flags While Executing the Inference Code

Though small in number, these flags can really improve the detection speed while inferencing.

One such flag is the -nireq flag. It specifies the number of inference requests that will be simultaneously executed with the inference code. Only if the computation device supports parallelization, is this possible. Usually, an increase in the number of this flag signals an increase in FPS as well.
Even the -nthreads flag can speed up execution. This specifies the number of threads to be used, when executing on a CPU. Usually, the number of threads in a CPU is double the number of cores. So, if you have 4 cores, then you are likely to have 8 threads.
The -nstreams cannot increase the FPS directly but has its own use. It defines the number of streams to be used for inference on the CPU, GPU or MYRIAD devices. It is an optional flag, with usually an automatic selection mode (default value is determined automatically for a device). When working with a very powerful multi-core CPU, this can prove handy, as different cores can be used for detection from different streaming devices.

Look carefully at the following graphs and note how the results (FPS) change, when using different values and flag combinations. All these experiments are conducted on the 8th Gen i7 CPU, with 16 GB of RAM.

FP32 and INT8 FPS comparison. — FPS comparison between FP32 and INT8 Tiny YOLOv4 model across different runs with different flags.

Latency comparison between FP32 and INT8 Tiny YOLOv4 model across different runs with different flags.

In the above graphs, we can see that in all cases the INT8 model is giving higher FPS and lower latency. And the different flags are also helping us achieve this better performance.

Accuracy Checker

OpenVINO provides one more tool to measure the accuracy of models converted into the IR format. Use the Accuracy Checker tool to

Evaluate whether your converted models are performing as expected.
You can also measure the accuracy of the original trained models (FP32), from which you obtained the IR files. For example, if you used a TensorFlow model, then you can measure the accuracy of the original .pb file.

Note: We will cover the installation of Accuracy Checker in the next subsection. But when comparing the accuracy of the Tiny YOLOv4 FP32 with the INT8 model used above, we will show results that have been calculated using the COCO mAP evaluator. The reason being, the Tiny YOLOv4 model is not yet fully-supported by Intel OpenVINO. So, the Accuracy Checker tool does not work properly on it. Although they aren’t Accuracy Checker results, you will still get a very detailed comparison of the models.

Using the Accuracy Checker can be a bit tricky for beginners for you need to deal with the configuration files before running the tool.

Let’s first install it, then we can explore how to use it.

Installing Accuracy Checker

Go into the /opt/intel/openvino_2021/deployment_tools/open_model_zoo/tools/accuracy_checker directory.

cd /opt/intel/openvino_2021/deployment_tools/open_model_zoo/tools/accuracy_checker

2. Install the Accuracy Checker, using the following command:

python3 setup.py install

To check whether the Accuracy Checker is installed properly or not, type the following command in the terminal:

accuracy_check -h

This should give all the possible flags that the program accepts.

Configuration Files

Now, let’s check out the configuration files you need to set up to make the Accuracy Checker work:

dataset_definitions.yml: This configuration file holds the path to all the validation datasets, and their annotations in your local system. These paths are used for running the predictions on the model you use the Accuracy Checker on.

Here’s a short example of the file:

datasets:
  - name: ms_coco_mask_rcnn
 annotation_conversion:
   converter: mscoco_mask_rcnn
   annotation_file: instances_val2017.json
   has_background: True
   sort_annotations: True
 annotation: mscoco_mask_rcnn.pickle
 dataset_meta: mscoco_mask_rcnn.json
 data_source: val2017

  - name: ms_coco_detection_80_class_without_background
 data_source: ../data/val2017
 annotation_conversion:
   converter: mscoco_detection
   annotation_file: ../data/instances_val2017.json
   has_background: False
   sort_annotations: True
   use_full_label_map: False
 annotation: mscoco_det_80.pickle
 dataset_meta: mscoco_det_80.json

name: ...

You find this file contains a lot of other dataset name parameters as well. Well, all these datasets are supported by OpenVINO to run with the Accuracy Checker. But the dataset we are interested in is the ms_coco_detection_80_class_without_background.

The second file is the model configuration .yml file. It is generally named after the model. For example, yolo-v3-tiny-tf.yml, yolo-v4-tf.yml, and so on.

This file contains many configuration parameters specific to the model. And only one common parameter, which is the path to the IR format model, i.e., the converted .bin and .xml files.

Want to see what the Tiny-YOLOv3 configuration file looks like? Here’s a snippet:

models:
  - name: yolo-v3-tiny-tf
 launchers:
   - framework: tf
     model: yolo-v3-tiny-tf.pb
     adapter:
       type: yolo_v3
       anchors: tiny_yolo_v3
       num: 6
       coords: 4
       classes: 80
       threshold: 0.001
       anchor_masks: [[3, 4, 5], [1, 2, 3]]
       raw_output: True
       output_format: HWB
       cells: [13, 26]
       outputs:
         - conv2d_9/BiasAdd
         - conv2d_12/BiasAdd
     inputs:
       - name: 'image_input'
         type: INPUT
     outputs:
       - conv2d_9/BiasAdd
       - conv2d_12/BiasAdd
...
         
global_definitions: /opt/intel/deployment_tools/open_model_zoo/tools/accuracy_checker/dataset_definitions.yml

Did you notice the global_definitions parameter at the end of the file? This is the path to the dataset_definitions.yml file, which as we discussed above is used for running predictions It is a mandatory parameter for the Accuracy Checker will use this path to determine the location of the validation datasets, when running inference on the model to come out with the final accuracy score.

Comparing the Accuracy of Tiny YOLOv4 FP32 With INT8 Models

As discussed earlier, we will calculate the accuracy (mAP) of Tiny YOLOv4 FP32 and INT8 models, using the COCO mAP (Mean Average Precision) evaluator (cocoapi).

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

Click here to download the source code to this post

Comparing the mAP of the two models will reveal the accuracy drop due to INT8 quantization. The decrease in accuracy was evident even when we carried out inference with the INT8 model.

Follow these steps to calculate the mAP of the model.

Prerequisites to Calculate the mAP for Tiny YOLOv4 Models

First, install the COCO API for mAP evaluation. For that, you need to clone this repository. Go inside the coco/PythonAPI directory, open your terminal and run make.

cd coco/PythonAPI
make

2. You will also need the MS COCO 2017 validation set (images and annotations). We have already discussed how to download the data. Please visit the official website to download the dataset.

Keep in mind that the MS COCO validation set contains 5000 images. And all the results shown here have been obtained after running the evaluation on the entire dataset.

All the evaluations were done on a machine with 8th generation i7 8670H CPU and 16 GB of RAM.

mAP for Tiny YOLOv4 FP32 and INT8 Models

After running the evaluation, using the COCO API on the Tiny YOLOv4 FP32 model, the results are as follows:

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.152
Average Precision  (AP) @[ IoU=0.50   | area=   all | maxDets=100 ] = 0.275
Average Precision  (AP) @[ IoU=0.75   | area=   all | maxDets=100 ] = 0.151

We have an average precision of 0.152 at 0.50:0.95 IoU (Intersection Over Union). And the average FPS over the entire dataset (5000 images) came out to be 28.3.

Now, let’s see how INT8 quantization affects the Tiny YOLOv4 model. Check out the results of running the evaluation on the Tiny YOLOv4 INT8 model.

Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.142
Average Precision  (AP) @[ IoU=0.50   | area=   all | maxDets=100 ] = 0.257
Average Precision  (AP) @[ IoU=0.75   | area=   all | maxDets=100 ] = 0.139

This time the average precision decreased to 0.142. But at the same time, the average FPS over the 5000 images leaped from 28.3 in the FP32 model to 36.6 in INT8.

Even in the video, we had seen INT8 predictions suffer when it came to inference. Looks like you cannot escape this tradeoff between speed and accuracy when quantizing Deep Learning models. Unless of course, you use Accuracy-Aware Quantization. Yes, in an upcoming post in the series, we will show you how to use OpenVINO’s Deep Learning workbench to mitigate most of these issues to a great extent and achieve both speed and accuracy.

Summary

You have covered a lot about Post Training Quantization of Deep Learning models in this post. Besides understanding the different methods of quantization, you also learned to use the INT8 models over FP32 for better inference. Let’s sum up the key learnings:

Starting with a brief overview of quantization, you saw how the model’s weights might get affected, when quantizing it from FP32 to INT8.
Next, you learned about the Post Training Optimization Tool provided by Intel OpenVINO. Understood its advantages and workflow. Saw how to set up POT for full-precision models. Discussed the process and the two quantization methods it offers in detail.

Following that, you learned to carry out default quantization of the Tiny YOLOv4 model from FP32 to INT8. Also compared the detection results of the two, by running inference on a sample video. Even discussed the various flags used while executing inference scripts.
Finally, evaluated the Tiny YOLOv4 FP32 and INT8 models on the MS COCO 2017 validation dataset. You got to know the clear effect of quantizing a Deep Learning model from the average precision and FPS that we obtained.

Post Training Quantization with OpenVINO Toolkit

An Overview of Deep Learning Model Quantization and Different Precision Formats

Post-Training Optimization Using OpenVINO Toolkit

Setting Up POT

Different Quantization Methods

Quantization in POT

DefaultQuantization

AccuracyAwareQuantization

Configuration Files Needed for Default Quantization

Applying DefaultQuantization to Tiny YOLOv4, Using POT

Step-by-Step Approach

Step 1: Download the MS COCO Validation 2017 Dataset

Step 2: Prepare the Configuration JSON File

Step 3: Execute the POT

Checking the Weight Values of FP32 and INT8 Models

Visualizing the Difference Between FP32 and INT8 Models

Running Inference, Using Tiny YOLOv4 FP32 and Int8 and Comparing Performance

Step 1: Go to the OpenVINO Object Detection Demos Directory

Step 2: Ensure that the FP32 Tiny YOLOv4 Models and the Video are Present

Step 3: Execute the Command to Run the Inference

Few Useful Flags While Executing the Inference Code

Accuracy Checker

Installing Accuracy Checker

Configuration Files

Comparing the Accuracy of Tiny YOLOv4 FP32 With INT8 Models

Prerequisites to Calculate the mAP for Tiny YOLOv4 Models

mAP for Tiny YOLOv4 FP32 and INT8 Models

Summary

References

Get Started with OpenCV

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?

An Overview of Deep Learning Model Quantization and Different Precision Formats

Post-Training Optimization Using OpenVINO Toolkit

Setting Up POT

Different Quantization Methods

Quantization in POT

DefaultQuantization

AccuracyAwareQuantization

Configuration Files Needed for Default Quantization

Applying DefaultQuantization to Tiny YOLOv4, Using POT

Step-by-Step Approach

Step 1: Download the MS COCO Validation 2017 Dataset

Step 2: Prepare the Configuration JSON File

Step 3: Execute the POT

Checking the Weight Values of FP32 and INT8 Models

Visualizing the Difference Between FP32 and INT8 Models

Running Inference, Using Tiny YOLOv4 FP32 and Int8 and Comparing Performance

Step 1: Go to the OpenVINO Object Detection Demos Directory

Step 2: Ensure that the FP32 Tiny YOLOv4 Models and the Video are Present

Step 3: Execute the Command to Run the Inference

Few Useful Flags While Executing the Inference Code

Accuracy Checker

Installing Accuracy Checker

Configuration Files

Comparing the Accuracy of Tiny YOLOv4 FP32 With INT8 Models

Prerequisites to Calculate the mAP for Tiny YOLOv4 Models

mAP for Tiny YOLOv4 FP32 and INT8 Models

Summary

References

Subscribe & Download Code

Get Started with OpenCV

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?