PaddleOCR: Reading huge documents can be very tiring and very time taking. You must have seen many software or applications where you just click a picture and get key information from the document. This is done by a technique called Optical Character Recognition (OCR). Optical Character Recognition is one of the key researches in the field of AI in recent years.
Optical Character Recognition is the process of recognizing text from an image by understanding and analyzing its underlying patterns. This blog post will focus on implementing and comparing various OCR algorithms provided by PaddleOCR using just a few lines of code.
1. Introduction to OCR
Optical Character Recognition is the technique that recognizes and converts text into a machine-readable format by analyzing and understanding its underlying patterns. OCR can recognize handwritten text, printed text and texts “in the wild”. In short, OCR enables computers to read. But how does OCR work? OCR makes use of Deep learning and computer vision techniques. OCR algorithms understand the underlying features of the text and predict the corresponding output for it using Neural networks. OCR can predict the output accurately and that too in a matter of milliseconds.
OCR is one of the first problems addressed in computer vision and deep learning and has seen tremendous development. It is being used for research and development, industrial applications and even for personal use too. Let’s have a look at some of the real-life uses and applications of OCR.
1.1 Uses
Due to OCR’s tremendous performance and various solutions it can provide, It is important to look at some of the areas where OCR can be used.
- Information retrieval: OCR can be applied to documents, receipts, ID cards, etc. to convert the information into machine-readable text. This makes documents searchable and even editable. OCR can also extract important information from documents and store it digitally.
- Automatic License Plate Recognition: ALPR is one of the fields where OCR is used extensively. In ALPR, OCR is used to extract license plate numbers of vehicles. ALPR is widely used nowadays in offices, law enforcement, malls, etc. all of them have OCR as their core component.
2. OCR Architecture
Optical Character Recognition or OCR uses Deep Learning and AI to perform the recognition and extract the texts. It is can be very beneficial to look inside OCR and see how it performs. There are various ways OCR can perform its task. Some popular algorithms are as follows.
- Tesseract engine: Tesseract is one of the first OCR algorithms developed. It was first developed in the 1980s and is now maintained by Google. Tesseract works by first finding every line and word and then performing word classification which gives out the final OCR prediction. One of the first OCRs developed was built on “LeNet 1”, the first convolutional network that could recognize handwritten digits with good speed and accuracy. Researcher Yann LeCun had also posted a video on youtube where we can see that ocr in action back in 1993. It is very fascinating how far the AI field has progressed and has achieved in such a short span of time.
- CRNN: CRNN is one of the most accurate methods or architecture for recognizing text. CRNN uses both CNNs and RNNs to form an architecture along with CTC loss which is proven to be very accurate and fast. This architecture consists of CNNs followed by bi-directional RNNs and a transcription layer.
- Attention-OCR: One other famous architecture is Attention-OCR. It is just CRNN but followed by a seq2seq model or an attention model which translates features to characters. Therefore, The attention model here acts as a decoder.
Due to its high accuracy and good speed, CRNN is an optimal choice for OCR. Latest libraries like Easy-ocr, Keras-ocr and PaddleOCR are based on CRNN and provide easy-to-use pretrained models. We will be looking at CRNN more in-depth in the coming sections.
2.1 CRNN Architecture
CRNN is a combination of both Convolutional and Recurrent neural networks. Hence the name Convolutional Recurrent Neural Network (CRNN). This network consists of three layers, CNNs followed by RNNs and then the Transcription layer. CRNN uses CTC or Connectionist Temporal Classification Loss which is responsible for the alignment of predicted sequences. Let’s have a look at how the CRNN works and make the OCR happen.
Feature Extraction
The first layer is the convolutional neural network (CNN) which consists of Convolutional and max-pooling layers. These are responsible for extracting features from the input images and producing feature maps as outputs. To feed output to the next layer, feature maps are first converted into a sequence of feature vectors. According to the original paper, “Each feature vector of a feature sequence is generated from left to right on the feature maps by column. This means the i-th feature vector is the concatenation of the i-th columns of all the maps.”
Due to the feature extraction, each column of the feature maps corresponds to a rectangular region of the input image, that region is called a receptive field. Each feature vector in the feature sequence is associated with the receptive field and can be called an appearance descriptor for that region. Refer to figure-02 for more clear understanding. The feature sequence is now passed to the next layer of RNNs.
Sequence Labeling
This layer is the Recurrent Neural Network (RNN) which is built on top of the Convolutional Neural Network. In CRNN, two Bi-directional LSTMs are used in the architecture to address the vanishing gradient problem and to have a deeper network. The recurrent layers predict the label for each feature vector or frame in the feature sequence received from CNN layers. Mathematically, the layers predict label y for each frame x in feature sequence x = x1,…..,xt.
Transcription
This layer is responsible for translating the per-frame predictions into a final sequence according to the highest probability. These predictions are used to compute CTC or Connectionist Temporal Classification loss which makes the model learn and decode the output.
2.2 CTC loss
The output received from the RNN layer is a tensor that contains the probability of each label for each receptive field. But, how does this translate to the output? That’s when Connectionist Temporal Classification (CTC) loss comes in. CTC loss is responsible for training the network as well as the inference that is decoding the output tensor. CTC works on the following major principles:
- Text encoding: CTC solves the issue when a character takes more than one time step. CTC solves this by merging all the repeating characters into one. And, when that word ends it inserts a blank character “-”. This goes on for further characters. For example in fig -04, ‘S’ in ‘STATE’ has three time steps. The network might predict those time steps as ‘SSS’. Now, the CTC will merge those outputs and predict the output as ‘S’. For the word, a possible encoding could be SSS-TT-A-TT-EEE, Hence the output ‘STATE’.
- Loss Calculation: For a model to learn, loss needs to be calculated and back-propagated into the network. Here the loss is calculated by adding up all the scores of possible alignments at each time step, that sum is the probability of the output sequence. Finally, the loss is calculated by taking a negative logarithm of the probability, which is then used for back-propagation into the network.
- Decoding: At the time of inference, We need a clear and accurate output. For this, CTC calculates the best possible sequence from the output tensor or matrix by taking the characters with the highest probability per time step. Then it involves decoding which is the removal of blanks “-” and repeated characters.
So, by these processes, we have the final output. Without further delay, Let’s see these in action.
3. PaddleOCR
PaddleOCR is an ocr framework or toolkit which provides multilingual practical OCR tools that help the users to apply and train different models in a few lines of code. PaddleOCR offers a series of high-quality pretrained models. This contains three types of models to make OCR highly accurate and close to the commercial products. It provides Text detection, Text direction classifier and Text recognition. PaddleOCR offers various models in its toolkit, including the flagship PP-OCR and the latest algorithms such as SRN, NRTR and more.
PaddleOCR also offers different models based on size.
- Lightweight models – Models which take less memory and are faster but compromise on accuracy.
- Server models (Heavyweight) – Models which take more memory but are more accurate but compromise on speed.
PaddleOCR supports more than 80 languages (depending upon the OCR algorithm used). But the flagship PP-OCR provides support for both Chinese and English languages. The flagship OCR algorithm PP-OCR is one of the best OCR tools available. So far, It has three versions as of now PP-OCR, PP-OCRv2 and PP-OCRv3. All of these models are built on CRNN as seen in the previous section and are ultra-lightweight. Let’s take a look and apply it to some of the various types of scenarios.
3.1 Implementation
In this section, we will implement PaddleOCR’s PP-OCRv3. This model can be implemented in just a few lines of code and that too in a matter of milliseconds. First off, Let’s install the required toolkits and dependencies. These dependencies and tools will help us access all the required files and scripts needed for OCR implementation.
!pip install paddlepaddle-gpu
!pip install paddleocr
After the installation, OCR needs to be initialized according to our requirements.
# Importing required functions for inference and visualization.
from paddleocr import PaddleOCR,draw_ocrimport os
import cv2
import matplotlib.pyplot as plt
%matplotlib inline
ocr = PaddleOCR(use_angle_cls=True)
In the above code snippet, We have initialized PP-OCRv3 and the required weights will be downloaded automatically. This package by default provides all of the models of the system which are detection, angle classification and recognition. It provides several arguments to access only the required functionalities.
- lang: The language which we want to recognise is passed here. For example, en for English, ch for Chinese, french for French, etc. The OCR can recognise English and Chinese by default.
- rec_algorithm: Takes the recognition algorithm to be used as arguments. The OCR uses CRNN as its default recognition algorithm.
- det_algorithm: Takes the text detection algorithm to be used as arguments. The OCR uses a DB text detector as its default detector.
- use_angle_cls: Specifies if angle classifier is to be used or not and takes bool as the argument.
The OCR is now initialized and can be used in just one line of code.
result = ocr.ocr(img_path)
This function also takes some arguments.
- img: This is the first parameter in the ocr function. In this, the image array or the image path is passed to perform OCR.
- det: Takes bool as an argument and specifies whether to use a detector or not.
- rec: Takes bool as argument and specifies whether to use a recognizer or not.
- cls: Takes bool as argument and specifies whether to use an angle classifier or not.
By default, all three det, rec and cls are set as True. Before moving further, we will create a function to extract the predictions, plot and save them. Let’s call the function save_ocr
.
def save_ocr(img_path, out_path, result, font):
save_path = os.path.join(out_path, img_path.split('/')[-1] + 'output')
image = cv2.imread(img_path)
boxes = [line[0] for line in result]
txts = [line[1][0] for line in result]
scores = [line[1][1] for line in result]
im_show = draw_ocr(image, boxes, txts, scores, font_path=font)
cv2.imwrite(save_path, im_show)
img = cv2.cvtColor(im_show, cv2.COLOR_BGR2RGB)
plt.imshow(img)
3.2 Inference
As discussed in an earlier section, There are various scenarios where OCR can be applied. Let’s look at them one by one. Before that, we will import some important libraries and files. Download the font file from here.
import os
import cv2
import matplotlib.pyplot as plt
%matplotlib inline
# Specifying output path and font path.
out_path = './output_images'
font = './simfang.ttf'
Receipts
Receipts are one of the documents where OCR is used extensively and has a lot of commercial uses. It can be used to extract important information like bill amount, taxes, buyer information and etc. For example, have a look at this image and apply OCR to it.
img_path = './input_images/05-receipt1.jpg'
result = ocr.ocr(img_path)
Here is the output,
save_ocr(img_path, out_path, result, font)
Let’s test our OCR on another image of the receipt.
img_path = './input_images/07-receipt2.png'
result = ocr.ocr(img_path)
Finalizing the output,
save_ocr(img_path, out_path, result, font)
As it is seen, OCR has performed very well on the receipts. It’s been able to capture almost all the details like amount, orders, order number etc and that too in the same numeric order as the receipt. So, we can say that PP-OCR performs pretty good on the receipts and similar documents.
ID-cards
ID cards are mostly used for security purposes and identification purposes. When OCR is applied to ID cards, it can be used to extract information like name, code, branch, and more, which can be used to give access at the electronic gates or store the information in a database. We will try OCR on the following image.
img_path = './input_images/09-id-card.jpg'
result = ocr.ocr(img_path)
Let’s look at the output
save_ocr(img_path, out_path, result, font)
WOW!! That was fast and very accurate. It detected all the fields like the boat number, date, ID number and more which are the key information here even while the text was at an angle.
Documents
Document recognition has been one of the prominent research areas for OCR. Documents are used almost every day in our life. When OCR is applied to a document, it can be used to retrieve important information, retrieve form fields, analyze layout, store digitally and also for reading old manuscripts. All these tasks can be done with ease with OCR. Let’s have a look at some images and see how the PP-OCR performs.
img_path = './input_images/11-document-1.jpg'
result = ocr.ocr(img_path)
Displaying the output.
save_ocr(img_path, out_path, result, font)
The output is quite accurate in detection as well as in recognition. PP-OCR is able to detect all the text fields in the document and the recognizer did an amazing job in recognizing those texts. The text recognized looks very accurate, it has detected special characters and spaces accurately too. We will implement OCR on another similar document.
img_path = './input_images/13-document-2.png'
result = ocr.ocr(img_path)
Let’s display the output using save_ocr()
.
save_ocr(img_path, out_path, result, font)
The detector in this image has missed some of the texts in this document image. But, whatever text is detected by the detector is correctly predicted by the OCR Looking at the document images we can say that the detector and recognizer are not good at handling small texts. When small text is encountered it misses or incorrectly predicts the output. We can also test our pipeline on a handwritten text document. For example, we will try on this image.
img_path = './input_images/15-document-3.jpg'
result = ocr.ocr(img_path)
save_ocr(img_path, out_path, result, font)
As you can see the OCR is not at all accurate. Although the detector was pretty good but the recognizer wasn’t at all. The main reason behind it could be the data the OCR was trained on. PP-OCR is trained on MJSynth and SynthText dataset, which is a synthetic text dataset. These datasets are not real-life text images but rather computer-generated text images. So, Due to lack of handwritten text in the training dataset could be a big factor in the poor performance of OCR on these types of images.
License plate
License plate is one of the most popular and important use cases in OCR and it performs pretty well. ALPR is nowadays used in various commercial as well as research areas. The recognized license plate can be used to look for any violation, vehicle registration, toll booths and many more.
img_path = './input_images/17-license-plate.jpg'
result = ocr.ocr(img_path)
Here’s the output
save_ocr(img_path, out_path, result, font)
That was amazing! The bounding box predicted is very tight and even the text recognised is on point. ALPR can be applied to the video feed too along with some more tweaking to improve the accuracy. To know more visit our ALPR blog post.
Road signs
One of the most important scenarios where OCR can be applied is Road signs. Because of the growth in self-driving cars, This application has gained huge importance for example to read speed limits, stop signs, etc. We will try it on the following image first.
img_path = './input_images/23-sign-board-1.jpg'
result = ocr.ocr(img_path)
That was pretty accurate. The OCR was able to recognize all texts even the special character like brackets too. Let’s try it on another image.
img_path = './input_images/25-sign-board-2.jpg
result = ocr.ocr(img_path)
save_ocr(img_path, out_path, result, font)
This was pretty accurate. The detector worked very well, by detecting every text field. The OCR also performed fabulously on these scenarios. We can surely say that self-driven cars can rely easily on this OCR.
Trading cards
Trading cards or collectable cards are very popular among people nowadays from kids to adults which are used for playing and trading. Some of these contain very high monetary value going up to millions of dollars. So, It is worth trying OCR on these cards.
img_path = './input_images/27-trading-card.jpg'
result = ocr.ocr(img_path)
save_ocr(img_path, out_path, result, font)
The OCR has performed alright on this, not too good and not too bad. There are a few things to notice here, The OCR was predicting some text without spaces where the text is a bit small in size similar to a case in the document section. Also, the detector has not detected some text fields which are also very small in size.
Curved Text
So far we have only seen texts which are only in a straight line, but what if the text is curved. How will the OCR and text detector perform? That will be tested in this section.
img_path = './input_images/19-curved-text-1.jpg'
result = ocr.ocr(img_path)
Displaying the output
save_ocr(img_path, out_path, result, font)
We will try it on another image too.
save_ocr(img_path, out_path, result, font)
img_path = './input_images/21-curved-text-2.jpg'
result = ocr.ocr(img_path)
Well, As you can see the results are very poor. The detector wasn’t able to detect the text fields. The reason is because of training data. The data on which the default detector is trained on, contains all straight-line text. Even the neural network should predict curved bounding boxes which are not possible by using the same network as used now. Because of this when a curved text is seen, the detector cannot detect it. PaddleOCR offers a text detector called SAST specially created and trained for curved text. But OCR is not offered by Paddle OCR for this pipeline as of now. For more information about SAST visit here.
From the above experiments, we can conclude that PP-OCR is a very fast and highly accurate OCR along with a text detector. But it fails in some cases such as handwritten texts, curved texts and small texts which leave the text undetected or recognized inaccurately. To solve these problems the OCR and detector can be finetuned on more datasets which can help in increasing accuracy and getting better performance in various different scenarios.
4. PaddleOCR models comparison
PaddleOCR offers various models in its toolkit and is very easy to apply. It is always a good practice to compare the models based on accuracy as well as speed. In this section, we will be comparing four models provided by Paddle OCR which are SRN, PP-OCRv2, PP-OCRv3, and NRTR. The comparison will be performed on the COCO-text dataset which is a scene text dataset based on MSCOCO using Tesla K80 GPU on Google Colab. The models will be tested using a string similarity metric called Levenshtein distance. Levenshtein distance is a distance metric that is calculated by comparing the changes required in a string to achieve another string. Before jumping into the comparison let’s have an overview of the dataset.
4.1 COCO-text Dataset
COCO-text dataset which was part of ICDAR2017 is a dataset based on the MSCOCO dataset containing complex images of everyday scenes. The dataset contains over 173,589 labelled text regions in over 63,686 images. This dataset can be used for the training and evaluation of text detection and text recognition.
As mentioned earlier, this dataset contains over 63,686 images out of which 10,000 are assigned to the validation set and 10,000 to the test set. COCO-text offers 3 different challenges and types of data.
- Text Localization
- Cropped Word Recognition
- End-to-End Recognition
For our task, We will be downloading the dataset for Cropped Word Recognition. The ground truth labels are contained in a single text file where the image name is followed by the label for every image in a new line. For example,
img1,label1
img2,label2
..
..
The validation set contains around 10,000 images but we will be extracting a random of 500 images from that set. Download the dataset by registering here (Go to the downloads section, register using your email and download the COCO-text dataset using the link under Cropped words dataset section with the name Cropped word train and validation images and their annotations).
Let’s proceed with the comparison, preprocess the data, extract a random 500 images and edit the ground truth file accordingly. After downloading data, unzip the val_words folder and val_words_gt.txt into a folder COCO-text. We will be now extracting a random of 500 images from the validation set into a different folder called COCO_test which should be created under COCO-text.
%cd ./COCO-text/val_words
# Moving 500 images from val_words to a new folder COCO_test.
!shuf -n 500 -e * | xargs -i mv {} ../COCO_test
Ground truth file is also needed to be modified according to the images extracted so that labels of unnecessary images are not kept. Whole preprocessed data can also be downloaded from here.
# Go through each line of the original GT file and check if the image name exists in the extracted test dataset folder and save its label.
%cd ../
with open('./val_words_gt.txt') as f:
for line in f:
if os.path.isfile(os.path.join('./COCO_test', line.split(',')[0] + '.jpg')):
with open('./gt-test.txt', 'a') as new_file:
new_file.writelines(line)
Let’s display some of the images from our dataset. For this, we will create a helper function called display() to display images along with their labels.
def disp(pth, gt_annot = '', gt = False, out = False, num = 10):
img_arr = []
annot_arr = []
# Appending image array into a list.
for fimg in sorted((os.listdir(pth))):
if fimg.endswith('.jpg') or fimg.endswith('.png'):
demo = img.imread(pth+fimg)
img_arr.append(demo)
# Appending dataset output into a list (OCR outputs are stored in different text files for every image).
if out:
with open(pth + ''.join(fimg.split())[:-8]+ '.txt') as f:
out = f.read()
f.close()
annot_arr.append(out.lower())
# Appending ground truth annotations into a list (Ground truth of all images are stored in a single text file).
if gt:
with open(gt_annot) as f:
for line in f:
if line.split(',')[0] == fimg.split('.')[0]:
gt = line.split(',')[1].lower()
annot_arr.append(gt)
break
if len(img_arr) == num:
break
# Displaying the images along with labels.
_, axs = plt.subplots(2, 5, figsize=(25, 14))
axs = axs.flatten()
for cent, ax,val in zip(img_arr, axs, annot_arr):
ax.imshow(cent)
ax.set_title(val,fontsize=25)
plt.show()
pth = './COCO_test/'
annot = './gt-test.txt'
disp(pth, annot, gt = True, num = 10)
Here are some of the images from the COCO-text dataset.
4.2 PP-OCR
Time to test the flagship PP-OCR on the COCO-text dataset. PP-OCR is a series of high-quality pre-trained OCR models which provide end to end text recognition pipeline. For comparison, we will be comparing both PP-OCRv3 and PP-OCRv2 which are ultra-lightweight models, supporting english and chinese languages .Without further delay, Let’s clone the original repository and define the required functions.
# Cloning PaddleOCR repository.# Current directory should be the root directory.%cd ../
!git clone https://github.com/PaddlePaddle/PaddleOCR.git
Now, required dependencies and libraries will be installed.
%cd PaddleOCR
!pip install -r requirements.txt
!python -m pip install paddlepaddle-gpu
!pip install python-Levenshtein
We will create a function called rec(), which will be responsible for performing OCR by taking several parameters and outputting OCR results along with a confidence score.
# Importing required libraries.
import cv2
import os
import numpy as np
import sys
import re
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.image as img
import time
import numpy
# Importing functions and methods for OCR
from tools.infer.predict_rec import *
import tools.infer.utility as utility
from ppocr.postprocess import build_post_process
from ppocr.utils.logging import get_logger
from ppocr.utils.utility import get_image_file_list, check_and_read_gif
def rec(args, out_path, input, rec_model_dir,rec_image_shape = "3, 32, 320", rec_char_type = "ch", rec_algorithm = "CRNN", show = True, save = True):
# Assigning values to args as the code is not running from the console.
args.rec_model_dir = rec_model_dir
args.rec_image_shape = rec_image_shape
args.rec_char_type = rec_char_type
args.rec_algorithm = rec_algorithm
# Initializing some helper variables.
t1 = 0
t2 = 0
tot = []
os.chdir('./PaddleOCR')
# Passing required values to the args variables.
if rec_algorithm == "SRN":
args.rec_char_dict_path = './ppocr/utils/ic15_dict.txt'
args.use_space_char = False
if rec_algorithm == 'NRTR':
args.rec_char_dict_path = './ppocr/utils/EN_symbol_dict.txt'
args.rec_image_shape = "1,32,100"
# Initializing recognizer.
image_file_list = get_image_file_list(input)
text_recognizer = TextRecognizer(args)
valid_image_file_list = []
img_list = []
# Warming up the GPU to run it at its full capacity.
if args.warmup:
image = np.random.uniform(0, 255, [32, 320, 3]).astype(np.uint8)
for i in range(10):
res = text_recognizer([image])
# Reading and appending all images' array to a list.
for image_file in image_file_list:
image, flag = check_and_read_gif(image_file)
if not flag:
image = cv2.imread(image_file)
if image is None:
logger.info("error in loading image:{}".format(image_file))
continue
valid_image_file_list.append(image_file)
img_list.append(image)
# Applying OCR to the images
t1 = time.time()
try:
rec_res, _ = text_recognizer(img_list)
except Exception as E:
logger.info(traceback.format_exc())
logger.info(E)
exit()
# Calculating FPS and printing the info.
t2 = time.time()
fps = str(t2-t1)
for ino in range(len(img_list)):
logger.info("Predicts of {}:{}".format(valid_image_file_list[ino],
rec_res[ino]))
if save:
cv2.imwrite(os.path.join(out_path,valid_image_file_list[ino].split('/')[-1].split('.')[0] + '_rec' + '.jpg'),img_list[ino])
with open(os.path.join(out_path,valid_image_file_list[ino].split('/')[-1].split('.')[0] + '.txt'), 'w') as f:
f.write(str(rec_res[ino]))
logger.info("Time taken recognize all images : {}".format(fps))
print(len(image_file_list))
logger.info("Average fps : {}".format(1/(float(fps)/len(image_file_list))))
#cv2.putText(image, str(rec_res[ino]),(0, int(15 * 1)), cv2.FONT_HERSHEY_PLAIN, 2, (0, 0, 255), thickness=1)
# Displaying and saving the output according to the parameters set.
if show:
plt.figure(figsize = (25,14))
plt.imshow(image)
plt.show()
We will also be needing a function for calculating the metric that is levenshtein distance from the output and the ground truth text. Let’s call that function score_calc().
def score_calc(pth, annot):
# Importing distance metric.
from Levenshtein import distance
score_all = []
# Looping through the output text files and storing the OCR output in a variable.
for out_file in os.listdir(pth):
if out_file.endswith('.txt'):
with open(os.path.join(pth, out_file), 'rb') as f:
out = f.read()
f.close()
# Cleaning the OCR output text.
try:
out = str(out).split(',')[1].split(',')[0].replace("'", '').lower()
except:
print('OCR output does not exist')
# Opening the ground truth file and calculating weighted distance between ground truth and OCR output.
with open(annot) as f:
for line in f:
#print(out_file.split('.')[0], line.split(',')[0])
if line.split(',')[0] == out_file.split('.')[0]:
gt = line.split(',')[1].lower()
score = distance(str(out), str(gt))/len(gt)
score_all.append(score)
break
# Printing the average score.
print("final score:", sum(score_all)/len(score_all))
Time to put all the work to use.
PP-OCRv3
PaddleOCR has recently launched a new version of their flagship PP-OCR that is version 3. PP-OCRv3 claims to be 11% more accurate than its previous version PP-OCRv2 for english language. It is an ultra lightweight model with size nearly 17Mb. Now, we will download the weights and extract it as below.
# Downloading and extracting weights.!wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_infer.tar
!tar xf ch_PP-OCRv3_rec_infer.tar
# Running the OCR and storing the output in a folder specified under out_path in the root directory.
%cd ../
out_path = '../pp-ocrv3_output'
rec_model_dir = './ch_PP-OCRv3_rec_infer'
input = '../COCO-text/COCO_test'
sys.argv = ['']
rec(utility.parse_args(), out_path, input, rec_model_dir, show = False)
The speed is very impressive. The pipeline ran with FPS of 489 and recognizing all of the 500 images in nearly just a second. The output images and the predictions are saved at the path specified under out_path. It will contain the output images and their corresponding predictions in text files. It is always a good idea to display some of the predictions and see them ourselves. We will be using the disp() function defined earlier for displaying the predictions.
%cd ../
pth = './pp-ocrv3_output/'
disp(pth, out = True, num = 10)
Accuracy is also as important as the speed. Let’s see how the model performed in terms of accuracy.
result_path = './pp-ocrv3_output'
gt = './COCO-text/gt-test.txt'
score_calc(result_path, gt)
From the above code snippet, We can get the metric to judge how our PP-OCRv3 performs on COCO-text dataset. PP-OCRv3 has performed really amazing with a score of 1.5 and that too with very high speed. In the following sections, we will see how other models perform.
PP-OCRv2
PP-OCRv2 is also a very accurate model but not as good as the latest version 3 in theory. In the following code snippets, we will compute the score for PP-OCRv2 and see how it performed.
# Downloading and extracting weights.
!wget https://paddleocr.bj.bcebos.com/dygraph_v2.1/chinese/ch_PP-OCRv2_rec_infer.tar
!tar xf ch_PP-OCRv2_rec_infer.tar
# Running the OCR and storing the output in a folder specified under out_path in the root directory.%cd ../
out_path = '../lightweight_paddle_output'
rec_model_dir = './ch_PP-OCRv2_rec_infer'
input = '../COCO-text/COCO_test'
sys.argv = ['']
rec(utility.parse_args(), out_path, input, rec_model_dir, show = False)
WOW!! That was fast. The whole dataset was processed within 1.13s with an average FPS of 440.8.
%cd ../pth = './lightweight_paddle_output/'
disp(pth, out = True, num = 10)
The results look really good on the above set of images, To get exact measure of performance, we will use score_calc() to calculate the metrics as explained earlier.
result_path = './lightweight_paddle_output'
gt = './COCO-text/gt-test.txt'
score_calc(result_path, gt)
Finally!! We got the metrics, The above code snippet outputs the score of 1.84, which means the average Levenshtein distance of the 500 images is 1.84. It is certainly not better than PP-OCRv3 in terms of speed as well as accuracy. Paddle OCR also provides various other recognition algorithms, we’ll see if any other models can outperform PP-OCR.
4.3 SRN
SRN is another model supported by PaddleOCR. It stands for semantic reasoning network which overcomes the shortcomings of RNN-like structures. SRN is a very huge model having a size of more than 200 MB. SRN claims to be pretty accurate but at the cost of speed. Let’s test it on the dataset and see how it performs.
%cd ./PaddleOCR
# Downloading and extracting weights.
!wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/en/rec_r50_vd_srn_train.tar
!tar xf rec_r50_vd_srn_train.tar
# Converting saved model to inference model.
!python tools/export_model.py -c configs/rec/rec_r50_fpn_srn.yml -o Global.pretrained_model=./rec_r50_vd_srn_train/best_accuracy Global.save_inference_dir=./inference/srn
Run the recognizer on the dataset and store the output in a folder called srn_output. SRN is trained on different image size, so, the input images need to be resized to that size, which is ‘1,64,256’. Therefore, we need to pass some arguments to get the required functionality of SRN.
We need to pass –
- rec_image_shape = ‘1, 64, 256’
- rec_char_type = ‘en’,
- rec_algorithm = ‘SRN’
# Running the OCR.%cd ../
out_path = '../srn_output'
image_dir = '../COCO-text/COCO_test'
rec_model_dir = './inference/srn'
sys.argv = ['']
rec(utility.parse_args(), out_path, image_dir, rec_model_dir, rec_image_shape = '1, 64, 256', rec_char_type = 'en',rec_algorithm = 'SRN', show = False)
Now the output seems pretty accurate, but ran a bit slow with around 54 FPS. Let’s display some of the images and look at the results.
%cd ../
pth = './srn_output/'
disp(pth, out = True, num = 10)
Woah! The outputs looks good and with very high confidence scores. Time to see what the metrics say.
result_path = './srn_output2'
gt = './COCO-text/gt-test.txt'
score_calc(result_path, gt)
The score comes out as 1.83 which is very close to the previously seen PP-OCRv2 . So, PP-OCRv3 still is the best among the tested algorithms yet.
4.4 NRTR
NRTR is one of the most accurate models supported by PaddleOCR. NRTR stands for A No-Recurrence Sequence-to-Sequence Text Recognizer. According to its paper, NRTR follows the encoder-decoder approach, where the encoder uses stacked self-attention to extract image features, and the decoder applies stacked self-attention to recognize texts based on encoder output. Time to see it in action.
# Downloading, extracting and converting training model into inference model
%cd ./PaddleOCR!wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/en/rec_mtb_nrtr_train.tar
!tar xf rec_mtb_nrtr_train.tar
!python tools/export_model.py -c configs/rec/rec_mtb_nrtr.yml -o Global.pretrained_model=./rec_mtb_nrtr_train/best_accuracy Global.save_inference_dir=./inference/nrtr
Similar to SRN, NRTR is trained on different image sizes, so we need to pass it as an argument. Here, the image size is 1,32,100.
# Running the OCR.
%cd ../
out_path = '../nrtr_output'
image_dir = '../COCO-text/COCO_test'
rec_model_dir = './inference/nrtr'
sys.argv = ['']
rec(utility.parse_args(), out_path, image_dir, rec_model_dir, rec_char_type = 'en', rec_image_shape = "1,32,100", rec_algorithm = 'NRTR', show = False)
Well, that was slow! The average FPS is around 27. We will now see some of the results and extract metrics.
%cd ../
pth = './nrtr_output/'
disp(pth, out = True, num = 10)
result_path = './nrtr_output'
gt = './COCO-text/gt-test.txt'
score_calc(result_path, gt)
The distance metric for NRTR is 1.5 which is very good and comparable to PP-OCRv3 but the speed is also the lowest of all. This accuracy is achieved at the cost of speed which is not worth it as we have PP-OCRv3 which does better work and at a much higher speed.
4.3 Result
From the above experiments we can conclude that PP-OCR is a very powerful algorithm in terms of speed as well as accuracy also. PP-OCRv3 has performed best among all the algorithms implemented, outperforming in terms of speed as well as accuracy by a large margin. Being lightweight models, PP-OCRv2 and v3 have performed comparable to or even better than the latest large scale models like SRN and NRTR. The whole experimentation is summarized in the table below.
Table: 01-ocr-comparison
5. Conclusion
In this blog post, we saw the power PaddleOCR holds. From its flagship PP-OCR to the latest advanced algorithms, PaddleOCR has performed marvelously. There are certain scenarios where PP-OCR did not perform well like small texts, curved and handwritten texts. Don’t worry, these problems can be fixed if appropriate data is used for training. What are the next steps? Try more scenarios to test, on your own, explore the source code, and train the PP-OCR on some datasets.
We also tested some of the models Paddle OCR offers. In conclusion, we have seen, that PP-OCRv3 is a very powerful algorithm and gives comparable results to NRTR but with a very high speed. SRN is a heavyweight algorithm but does not perform well as compared to PP-OCR. PaddleOCR offers so many other algorithms like SAR, RARE and more. You can try more of these on your own and comment your findings below.
6. References
- PaddleOCR: https://github.com/PaddlePaddle/PaddleOCR
- CRNN: https//arxiv.org/abs/1507.05717
- COCO-text: https://rrc.cvc.uab.es/?ch=5
- 00-OCR-feature-image: Freepik.com
- 06-Receipt-1: Photo by Alpha is licensed under CC BY 2.0
- 08-Receipt-2: Photo by Chris Messina is licensed under CC BY 2.0
- 10-id-card.jpg: Photo by Jenni Konrad is licensed under CC BY 2.0
- 14-document-2: @inproceedings{jaume2019,
title = {FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents},
author = {Guillaume Jaume, Hazim Kemal Ekenel, Jean-Philippe Thiran},
booktitle = {Accepted to ICDAR-OST},
year = {2019}}
- 24-sign-board-1: Photo by Paul Keller is licensed under CC BY 2.0
- 28-trading-card: Photo by CLF is licensed under CC BY 2.0