In this article, we will learn deep learning based OCR and how to recognize text in images using an open-source tool called Tesseract and OpenCV. The method of extracting text from images is called Optical Character Recognition (OCR) or sometimes text recognition.
Tesseract was developed as a proprietary software by Hewlett Packard Labs. In 2005, it was open-sourced by HP in collaboration with the University of Nevada, Las Vegas. Since 2006 it has been actively developed by Google and many open-source contributors.
Tesseract acquired maturity with version 3.x when it started supporting many image formats and gradually added many scripts (languages). Tesseract 3.x is based on traditional computer vision algorithms. In the past few years, Deep Learning based methods have surpassed traditional machine learning techniques by a huge margin in terms of accuracy in many areas of Computer Vision. Handwriting recognition is one of the prominent examples. So, it was just a matter of time before Tesseract too had a Deep Learning based recognition engine.
In version 4, Tesseract has implemented a Long Short Term Memory (LSTM) based recognition engine. LSTM is a kind of Recurrent Neural Network (RNN).
Version 4 of Tesseract also has the legacy OCR engine of Tesseract 3, but the LSTM engine is the default, and we use it exclusively in this post.
Tesseract library is shipped with a handy command line tool called tesseract. We can use this tool to perform OCR on images; the output is stored in a text file. If we want to integrate Tesseract in our C++ or Python code, we will use Tesseract’s API. The usage is covered in Section 2, but let us first start with installation instructions.
1. How to Install Tesseract on Ubuntu and macOS
We will install:
- Tesseract library (libtesseract)
- Command line Tesseract tool (tesseract-ocr)
- Python wrapper for tesseract (pytesseract)
Later in the tutorial, we will discuss how to install language and script files for languages other than English.
1.1. Install Tesseract 4.0 on Ubuntu 18.04
Tesseract 4 is included with Ubuntu 18.04, so we will install it directly using Ubuntu package manager.
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
sudo pip install pytesseract
1.2. Install Tesseract 4.0 on Ubuntu 14.04, 16.04, 17.04, 17.10
Due to certain dependencies, only Tesseract 3 is available from official release channels for Ubuntu versions older than 18.04.
Luckily Ubuntu PPA – alex-p/tesseract-ocr maintains Tesseract 4 for Ubuntu versions 14.04, 16.04, 17.04, 17.10. We add this PPA to our Ubuntu machine and install Tesseract. If you have a Ubuntu version other than these, you must compile Tesseract from the source.
sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt-get update
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
sudo pip install pytesseract
1.3. Install Tesseract 4.0 on macOS
We will use Homebrew to install Tesseract on Homebrew. By default, Homebrew installs Tesseract 3, but we can nudge it to install the latest version from the Tesseract git repo using the following command.
# If you have tesseract 3 installed, unlink first by uncommenting the line below
# brew unlink tesseract
brew install tesseract --HEAD
pip install pytesseract
1.4. Checking Tesseract Version
To check if everything went right in the previous steps, try the following on the command line
tesseract --version
And you will see the output similar to
leptonica-1.76.0
libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.8
Found AVX2
Found AVX
Found SSE
2. Tesseract Basic Usage
As mentioned earlier, we can use the command line utility or the Tesseract API to integrate it into our C++ and Python applications. In the fundamental usage, we specify the following
1. Input filename: We use image.jpg in the examples below.
2. OCR language: The language in our basic examples is set to English (eng). On the command line and pytesseract, it is specified using the -l option.
3. OCR Engine Mode (oem): Tesseract 4 has two OCR engines — 1) Legacy Tesseract engine 2) LSTM engine. There are four modes of operation chosen using the --oem
option.
0 Legacy engine only. 1 Neural nets LSTM engine only. 2 Legacy + LSTM engines. 3 Default, based on what is available.
4. Page Segmentation Mode (psm): PSM can be very useful when you have additional information about the structure of the text. We will cover some of these modes in a follow-up tutorial. In this tutorial, we will stick to psm = 3 (i.e. PSM_AUTO). Note: When the PSM is not specified, it defaults to 3 in the command line and python versions but to 6 in the C++ API. If you are not getting the same results using the command line version and the C++ API, explicitly set the PSM.
2.1. Command Line Usage
The examples below show how to perform OCR using Tesseract command line tool. The language is chosen to be English and the OCR engine mode is set to 1 ( i.e. LSTM only ).
# Output to terminal
tesseract image.jpg stdout -l eng --oem 1 --psm 3
# Output to output.txt
tesseract image.jpg output -l eng --oem 1 --psm 3
2.2. Using pytesseract
In Python, we use the pytesseract module. It is a wrapper around the command line tool with the command line options specified using the config argument. The basic usage requires us first to read the image using OpenCV and pass the image to image_to_string method of the pytesseract class along with the language (eng).
import cv2
import sys
import pytesseract
if __name__ == '__main__':
if len(sys.argv) < 2:
print('Usage: python ocr_simple.py image.jpg')
sys.exit(1)
# Read image path from command line
imPath = sys.argv[1]
# Uncomment the line below to provide path to tesseract manually
# pytesseract.pytesseract.tesseract_cmd = '/usr/bin/tesseract'
# Define config parameters.
# '-l eng' for using the English language
# '--oem 1' sets the OCR Engine Mode to LSTM only.
config = ('-l eng --oem 1 --psm 3')
# Read image from disk
im = cv2.imread(imPath, cv2.IMREAD_COLOR)
# Run tesseract OCR on image
text = pytesseract.image_to_string(im, config=config)
# Print recognized text
print(text)
2.3. Using the C++ API
In the C++ version, we first need to include tesseract/baseapi.h and leptonica/allheaders.h. We then create a pointer to an instance of the TessBaseAPI class. We initialize the language to English (eng) and the OCR engine to tesseract::OEM_LSTM_ONLY ( this is equivalent to the command line option --oem 1
) . Finally, we use OpenCV to read in the image, and pass this image to the OCR engine using its SetImage method. The output text is read out using GetUTF8Text().
#include <string>
#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
#include <opencv2/opencv.hpp>
using namespace std;
using namespace cv;
int main(int argc, char* argv[])
{
string outText;
string imPath = argv[1];
// Create Tesseract object
tesseract::TessBaseAPI *ocr = new tesseract::TessBaseAPI();
// Initialize OCR engine to use English (eng) and The LSTM OCR engine.
ocr->Init(NULL, "eng", tesseract::OEM_LSTM_ONLY);
// Set Page segmentation mode to PSM_AUTO (3)
ocr->SetPageSegMode(tesseract::PSM_AUTO);
// Open input image using OpenCV
Mat im = cv::imread(imPath, IMREAD_COLOR);
// Set image data
ocr->SetImage(im.data, im.cols, im.rows, 3, im.step);
// Run Tesseract OCR on image
outText = string(ocr->GetUTF8Text());
// print recognized text
cout << outText << endl;
// Destroy used object and release memory
ocr->End();
return EXIT_SUCCESS;
}
You can compile the C++ code by running the following command on the terminal,
g++ -O3 -std=c++11 ocr_simple.cpp `pkg-config --cflags --libs tesseract opencv`-o ocr_simple
Now you can use it by passing the path of an image
./ocr_simple image.jpg
2.4. Language Pack Error
You may encounter an error that says
Error opening data file tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.
It means the language pack (tessdata/eng.traineddata) is not in the right path. You can solve this in two ways.
Option 1 : Make sure the file is in the expected path ( e.g. on linux the path is /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata).Option 2 : Create a directory tessdata, download the eng.traineddata and save the file to tessdata/eng.traineddata. Then you can direct Tesseract to look for the language pack in this directory using
tesseract image.jpg stdout --tessdata-dir tessdata -l eng --oem 1 --psm 3
Similarly, you will need to change line 20 of the python code to
config = ('--tessdata-dir "tessdata" -l eng --oem 1 --psm 3')
and Line 18 of the C++ code to
ocr->Init(NULL, "eng", tesseract::OEM_LSTM_ONLY);
Also Read : Deep Learning Based Text Detection Using OpenCV
3. Use Cases
Tesseract is a general purpose OCR engine, but it works best when we have clean black text on solid white background in a standard font. It also works well when the text is approximately horizontal, and the text height is at least 20 pixels. If the text has a surrounding border, it may be detected as some random text.
For example, the results would be great if you scanned a book with a high-quality scanner. But if you took a passport with a complex guilloche pattern in the background, the text recognition may not work either. In such cases, there are several tricks that we need to employ to make reading such text possible. We will discuss those advanced tricks in our next post.
Let’s look at these relatively easy examples.
3.1 Documents (book pages, letters)
Let’s take an example of a photo of a book page.
When we process this image using tesseract, it produces the following output:
1.1 What is computer vision? As humans, we perceive the three-dimensional structure of the world around us with apparent
ease. Think of how vivid the three-dimensional percept is when you look at a vase of flowers
sitting on the table next to you. You can tell the shape and translucency of each petal through
the subtle patterns of light and Shading that play across its surface and effortlessly segment
each flower from the background of the scene (Figure 1.1). Looking at a framed group por-
trait, you can easily count (and name) all of the people in the picture and even guess at their
emotions from their facial appearance. Perceptual psychologists have spent decades trying to
understand how the visual system works and, even though they can devise optical illusions!
to tease apart some of its principles (Figure 1.3), a complete solution to this puzzle remains
elusive (Marr 1982; Palmer 1999; Livingstone 2008).
Even though there is a slight slant in the text, Tesseract does a reasonable job with very few mistakes.
3.2 Receipts
The text structure in book pages is very well defined i.e. words and sentences are equally spaced, and very less variation in font sizes which is not the case in bill receipts. A slightly difficult example is a receipt with a non-uniform text layout and multiple fonts. Let’s see how well tesseract performs on scanned receipts.
Store #056663515
DEL MAR HTS,RD
SAN DIEGO, CA 92130
(858) 792-7040Register #4 Transaction #571140
Cashier #56661020 8/20/17 5:45PMwellnesst+ with Plenti
Plenti Card#: 31XXXXXXXXXX4553
1 G2 RETRACT BOLD BLK 2PK 1.99 T
SALE 1/1.99, Reg 1/4.69
Discount 2.70-
1 Items Subtotal 1.99
Tax .15
Total 2.14
*xMASTER* 2.14
MASTER card * #XXXXXXXXXXXX548S
Apo #AA APPROVAL AUTO
Ref # 05639E
Entry Method: Chip
3.3 Street Signs
If you get lucky, you can also get this simple code to read simple street signs.
SKATEBOARDING
BICYCLE RIDING
ROLLER BLADING
SCOOTER RIDING
®
Note it mistakes the screw for a symbol.
Let’s look at a slightly more difficult example. You can see some background clutter, and the text is surrounded by a rectangle.
Tesseract does not do a very good job with dark boundaries and often assumes it to be text.
| THIS PROPERTY
} ISPROTECTEDBY ||
| VIDEO SURVEILLANCE
However, if we help Tesseract a bit by cropping out the text region, it gives perfect output.
THIS PROPERTY
IS PROTECTED BY
VIDEO SURVEILLANCE
The above example illustrates why we need text detection before text recognition. A text detection algorithm outputs a bounding box around text areas which can then be fed into a text recognition engine like Tesseract for high-quality output. We will cover this in a future post.
Dear Vaibhaw,
Thank you for this article. Are there limitations to how pytesseract will detect text according to font? For example, the above article handled images for sans-serif type fonts. Could it detect serif type fonts such as Times New Roman. To the extremes, could tesseract detect decorative fonts such as baroque fonts. Examples, blackletter font https://en.wikipedia.org/wiki/Blackletter, http://www.1001fonts.com/old-english-fonts.html. Similarly, could tesseract detect handwriting?
Thank you,
Anthony of Sydney
The phone( smartisan ) I am using actually assembly this function for designer.
Vaibhaw,
Thanks for the article it will be very useful in the future as part of a project. I have this running in two different environments on Windows 10. Environment 1 uses Tesserocr and Environment 2 uses pytesseract, giving me a bit more knowledge about environment building and packages.
This is great information, thanks for putting this together! I have a slightly more complex use case that may serve as a great example to tie in some features/functionality for your future posts? Based on the attached picture (as an example), I (a beginner at this) am putting together an example project for a friend that would 1. Extract each tile (1-24) and 2. extract the text from each tile. The text would be used to link the beer info with an online description. Any help/advice would be appreciated! https://uploads.disquscdn.com/images/9ead355a45fb235924336ea901666f3048b904fdcfe1f950d06843e35c395277.jpg
Looks like you have control over the set up. If so, you should put AR markers on the four corners of the large board. You can then detect these markers
https://docs.opencv.org/3.1.0/d5/dae/tutorial_aruco_detection.html
and then rectify / align it. You can then find the location of boxes, crop them and perform OCR. I know it requires more details, but it is tough to put them in one comment.
Very nicely put article. Thanks for the effort!
Can you comment on how it works with images containing tabular data? Specially interesting for me is the use case where the table itself is a fixed template and printed on a paper. The cells are handwritten and need to be transformed to text.
Example: A feedback form containing a table with columns very good …. very bad. Rows containing the feedback items. The feedback form in this case will remain exactly the same. Only the handwritten feedback written by different participants need to be converted to text.
I have been trying to accomplish image to text for tabular data with Cloud APIs like Google Vision, Amazon Rekognition but to no avail. These APIs are too generic to handle specificly formatted data. Tesseract seems the right option in this case, I just need to figure out how to do the transfer learning for this i.e. make use of the already available information in terms of tabular template to improve the image to text conversion. Any leads in that direction is highly appreciated.
I am guessing you are either taking a picture of the filled form or scanning it. In either case, you first need to align it to the template. See this post
https://learnopencv.com/image-alignment-feature-based-using-opencv-c-python/
After alignment you know can find bounding boxes to crop out the region where you expect text and crop out these regions.
The final step is the hardest — handwriting recognition is not trivial. You can train tesseract with handwrittten text, but I don’t think it will perform very well. You can check out this link
https://github.com/hugrubsan/Offline-Handwriting-Recognition-with-TensorFlow
but I have not tried it myself, so don’t know how well it works.
I created a python env for Tessaeract. The python code works very well with the Tesseract engine. However, when I try and compile the C++ code I get this:
//usr/lib/x86_64-linux-gnu/liblept.so.5: undefined reference to `[email protected]_4.0′
//usr/lib/x86_64-linux-gnu/liblept.so.5: undefined reference to `[email protected]_4.0′
//usr/lib/x86_64-linux-gnu/liblept.so.5: undefined reference to `[email protected]_4.0′
//usr/lib/x86_64-linux-gnu/liblept.so.5: undefined reference to `[email protected]_4.0′
//usr/lib/x86_64-linux-gnu/liblept.so.5: undefined reference to `[email protected]_4.0′
//usr/lib/x86_64-linux-gnu/liblept.so.5: undefined reference to `[email protected]_4.0′
//usr/lib/x86_64-linux-gnu/liblept.so.5: undefined reference to `[email protected]_4.0′
//usr/lib/x86_64-linux-gnu/liblept.so.5: undefined reference to `[email protected]_4.0′
//usr/lib/x86_64-linux-gnu/liblept.so.5: undefined reference to `[email protected]_4.0′
//usr/lib/x86_64-linux-gnu/liblept.so.5: undefined reference to `[email protected]_4.0′
//usr/lib/x86_64-linux-gnu/liblept.so.5: undefined reference to `[email protected]_4.0′
//usr/lib/x86_64-linux-gnu/liblept.so.5: undefined reference to `[email protected]_4.0′
//usr/lib/x86_64-linux-gnu/liblept.so.5: undefined reference to `[email protected]_4.0′
//usr/lib/x86_64-linux-gnu/liblept.so.5: undefined reference to `[email protected]_4.0′
//usr/lib/x86_64-linux-gnu/liblept.so.5: undefined reference to `[email protected]_4.0′
//usr/lib/x86_64-linux-gnu/liblept.so.5: undefined reference to `[email protected]_4.0′
//usr/lib/x86_64-linux-gnu/liblept.so.5: undefined reference to `[email protected]_4.0′
//usr/lib/x86_64-linux-gnu/liblept.so.5: undefined reference to `[email protected]_4.0′
Anyone have any ideas? If all my web site chasing is correct there is no libtiff 4.0, it is 5.0?
Thanks,
Doug
Hi , Can you help me with converting the image( handwritten english) to text . What modification I should bring in the code .
text = pytesseract.image_to_string(im, config=config) This is working for printed text. The code is executing well but getting and output https://uploads.disquscdn.com/images/5edf3d9e4f1f8204c890c2360df7454c648c2f909aa0855d94f3941e612d9414.png
My input was https://uploads.disquscdn.com/images/8b5f3bb01d07864eaa0c6f873ac805ac730d25e1eb52c2008be7522672eee4bb.jpg
Handwriting recognition is very much in the research domain. You can train tesseract with handwritten data
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
But I am afraid you will not see very good results. You can also try,
https://github.com/hugrubsan/Offline-Handwriting-Recognition-with-TensorFlow
Unfortunately, I have never directly worked on handwriting recognition to give a very informed opinion.
Hi, great work. I’m interested in reading some data, what kind of filtering or image proccess would you recommend? https://uploads.disquscdn.com/images/4dc1f082cf05754944af962664f3c59df49f38b0713d6e0ed0ac526d07ae25e4.jpg
AOCR Tensorflow should be good for your use case
If you are able to find the dark bounding boxes, even tesseract will do a decent job in this case.
Hey I need to read text from image which is at an angle. Like the number plate in the attached image. Any idea on that? OCR is not working in this case. https://uploads.disquscdn.com/images/c91ee77c54b170e0ff461db5eda874c86a61f0444655bc81bfbba0b14ba19c0c.jpg
https://hackernoon.com/latest-deep-learning-ocr-with-keras-and-supervisely-in-15-minutes-34aecd630ed8
Explains exactly how to recognise plate numbers
Hi I need to detect the digit along with their position like the attached image.
I tried tesseract but it has a hard time to detect single digit. Any idea or suggestion?
https://uploads.disquscdn.com/images/d2628ecb085cceb344304f25f9e1aa738a7ee82a71877b357faa2a6edaf474a1.jpg
You need to first detect text.
The example you shared is relatively simple, and so I expect MSER or ERSTAT text detection included with OpenCV to work fine. Check out this link.
https://docs.opencv.org/3.1.0/da/d56/group__text__detect.html
https://github.com/opencv/opencv_contrib/blob/master/modules/text/samples/textdetection.py
Thanks for the hands on, which open source library/model would you recommend for recognizing various types of grocery receipts as each store has different text conventions, locations, etc.,
Do we expect to train Tesseract with each retailer receipt type?
Grocery receipts are easy with white background and black text. So Tesseract may just work out of the box. However, you will need additional logic on top of it to figure out what the text information means — which ones to ignore and which ones to keep etc.
Thanks for quick response, so there is no need for training or tuning the model with Tesseract for grocery receipts? Is there any other library or model that uses Tesseract and has tuned it better for images such as receipts?
Tesseract is heavily turned for grocery lists / book pages etc where the text is black and the background is white. All you need to do is make sure, the receipts is properly rectified ( i.e. text is horizontal)
Thanks Satya, now all I need to find is a image transformation tool that can extract the image and pre-process for tesseract right, any tools that can do this preprocessing?
One person in my course did this project.
https://www.youtube.com/watch?v=qKKHgX2MDp4
If you want I can put you in touch with Behdad.
thanks for great article, i need some help in training for urdu language using tesseract, i am facing some issues.
Hi Mohammad,
Unfortunately, I have never dealt with Urdu text detection. But try the language pack and see if you get good results. If you don’t, try to binarize the image yourself before passing it to tesseract.
Satya
https://uploads.disquscdn.com/images/ac0c48c7695d9c37cff7fc04920dcbd984e0177b9f7f5c376d62219f746656be.png
I read the entire article with all the comments. It is a very straightforward article with concepts and implementation hand-in-hand and you explain it very well. I plan to read through all the relevant articles on your site. I would like to ask a question about your example 3.3. You said that tesseract makes a mistake with the screw. Why doesn’t it make a mistake with the two wheels of the bike in the first street sign where those wheels look like O?? Does it also check in the vicinity of each character, how much big is the bounding box?
Also, may I draw your attention to an interesting use case that I am currently working on? The first attached figure is from a research article where authors used some heuristics for creating colored bounding boxes around the text. How can I detect and localize all the text in this image?
You also said that bounding boxes are a problem for tesseract but we can train tesseract to detect text inside them too. How can we extract all the text in the second figure?
Many thanks for your response Sir.
What’s your question Saurabh? Do you want to recognize text in the circled areas?
I appreciate your response Satya. My question is how can I recognize and localize all the text in the attached image here? Is tesseract the best available solution or are there some others based on CNNs? What heuristics can be employed in this use case? Thanks
https://uploads.disquscdn.com/images/56454f275979a992edeb64d516a43f4b32f5637944b4bccf2465902c1f6660b0.png
You need to first detect text for this. OpenCV has some built in text detectors. Try these links
https://docs.opencv.org/3.1.0/da/d56/group__text__detect.html
https://github.com/opencv/opencv_contrib/blob/master/modules/text/samples/textdetection.py
Satya thanks for the links and the approach. Why detect text first thorugh OpenCV? On the contrary, I actually followed your commands and passed the entire image to tesseract 4.0 with -oem 3 and it gives me almost the entire text in the image, even the vertical ones. I guess, with the hOCR output I Can have BBOX coordinates and can use them to bound the relevant text. What do you say?
Also, can you advise any resources or approach for symbol detection/localization in the image? I trained through YOLOv2 without resizing the training images and it gives good accuracy for single-class classification BUT fails completely for multi-class classification. Any guesses please?? Once again many thanks for sharing your work.
Hey If want to detect text on electronic https://uploads.disquscdn.com/images/960dc4fd20bdfc905cc96165ade4112ea356f9c7bb163bd52ed9dbe6169410f9.png chips how can i do that
Tensorflow or opencv?
I would say,
1) capture a good resolution image of this board
2) follow Satya’s this article with all default parameters which he explained very well,
3) See how much text you get correct and how many false positives. Your text is very much legible and I guess tesseract should have no problem except some text within boundaries. Also, Satya tells that it is better with white text and black background so you need many trials and errors.
4) Read all the comments here, especially the one with the grocery list.
5) Get in touch with Satya or follow his other articles for tuning tesseract to read more text.
Do you want to read the text on the top of the chips too?
Hi,
Great article. In section 3 you mentioned that you will discuss about tricks to make text readable ( from passports). Did you cover that?
Regards
Hi Sega,
We have not covered that yet. But there are a few basic ways you can help Tesseract
1. Aligning the image so the text is horizontal.
2. Binarizing the image.
3. Specifying the language and when appropriate specifying the character set.
Hi, I am using Tesseract 4, OpenCV 3.4.2 and Ubuntu 16.04. Sometimes my computer make a reboot. Some have had similar behavior???
This happens to me on 2 different computers.
I have not seen this behavior on my Mac or Linux box.
Hi, I have a question regarding the OCR Engine modes in Tesseract 4. Can you give guidance on which one to choose when and which performs best? Are there some performance tests somewhere I can checkout? What does it mean when they combine the legacy + lstm in mode 2?
Thanks a lot! Great post 🙂
Hi, Is there any way to train tesseract to look for specfic text only?
i am searching for the same, if u will get plz write to me [email protected]
Hi @spmallick:disqus : Thanks for great article. Is there any example where I can see how to read extract metadata based on template(fix) form (kind of passport example) using tesseract? It could be either image of pdf input. Thanks in advance.
Thank you thank you thank you!! I’ve been meaning to start a project for which i needed ocr, but since my programming experience is rather slim i was afraid it’d be to much for me.
I’m eager to see the text detection post
How can we extract Tables from the image. I have few scanned images which contain Tabular data inside a PDF, basically its a sandwich kind pdf where you have both image and text inside a PDF.
Thank you