
In this post we will learn how to use pre-trained models trained on large datasets like ILSVRC, and also learn how to use them for a different task than it was trained on. We will be covering the following topics in the next three posts :
- Image classification using different pre-trained models ( this post )
- Training a classifier for a different task, using the features extracted using the above-mentioned models – This is also referred to Transfer Learning.
- Training a classifier for a different task, by modifying the weights of the above models – This is called Fine-tuning.
What is ImageNet
ImageNet is a project which aims to provide a large image database for research purposes. It contains more than 14 million images which belong to more than 20,000 classes ( or synsets ). They also provide bounding box annotations for around 1 million images, which can be used in Object Localization tasks. It should be noted that they only provide urls of images and you need to download those images.
What is ILSVRC
ImageNet Large Scale Visual Recognition Challenge ( ILSVRC ) is an annual competition organized by the ImageNet team since 2010, where research teams evaluate their computer vision algorithms various visual recognition tasks such as Object Classification and Object Localization. The training data is a subset of ImageNet with 1.2 million images belonging to 1000 classes. Deep Learning came to limelight in 2012 when Alex Krizhevsky and his team won the competition by a margin of a whooping 11%. ILSVRC and Imagenet are sometimes used interchangeably.
Why use pre-trained models?
Allow me a little digression.
Imagine two people, Mr. Couch Potato and Mr. Athlete. They sign up for soccer training at the same time. Neither of them has ever played soccer and the skills like dribbling, passing, kicking etc. are new to both of them.
Mr. Couch Potato does not move much, and Mr. Athlete does. That is the core difference between the two even before the training has even started. As you can imagine, the skills Mr. Athlete has developed as an athlete (e.g. stamina, speed and even sporting instincts ) are going to be very useful for learning soccer even though Mr. Athlete has never trained for soccer.
Mr. Athlete benefits from his pre-training.
The same holds true for using pre-trained models in Neural Networks. A pre-trained model is trained on a different task than the task at hand but provides a very useful starting point because the features learned while training on the old task are useful for the new task.
We have seen earlier that we can create and train small convolutional networks ( CNNs ) to classify digits ( using MNIST ) or different objects ( using CIFAR10 ). These small networks fall short when there are many classes and the objects vary in size / shape / appearance etc, as the model lacks the complexity which is required to model such large variations in data.
Even though it is possible to model any function using just a single hidden layer theoretically, but the number of neurons required to do so would be very large, making the network difficult to train. Thus, we use deep networks with many hidden layers which try to learn different features at different layers as we saw in the previous post on CNNs.
Deep networks have a large number of unknown parameters ( in millions ). The task of training a network is to find the optimum parameters using the training data. From linear algebra, we know that in order to solve an equation with three unknown parameters, we need three equations ( data ). And, if we know only two equations, we can get exact values of maximum 2 parameters and only an approximate value for the 3rd unknown parameter.
Similarly, for finding all the unknown parameters accurately, we would need a lot of data ( in millions ). If we have very few data, we will get only approximate values for most of the parameters, which we don’t want. Moral of the story is
For Deep Networks – More data -> Better learning.
The problem is that it is difficult to get such huge labeled datasets for training the network.
Another problem, related to deep networks is that even if you get the data, it takes a large amount of time to train the network ( hundreds of hours ). Thus, it takes a lot of time, money and effort to train a deep network successfully.
Fortunately, we can leverage the models already trained on very large amounts of data for difficult tasks with thousands of classes. Many Research groups share the models they have trained for competitions like ILSVRC. The models have been trained on millions of images and for hundreds of hours on powerful GPUs. Most often we use these models as a starting point for our training process, instead of training our own model from scratch.
Enough of background, let’s see how to use pre-trained models for image classification in Keras.
Download Code To easily follow along this tutorial, please download code by clicking on the button below. It’s FREE!
Pre-trained models present in Keras
The winners of ILSVRC have been very generous in releasing their models to the open-source community. There are many models such as AlexNet, VGGNet, Inception, ResNet, Xception and many more which we can choose from, for our own task. Apart from the ILSVRC winners, many research groups also share their models which they have trained for similar tasks, e.g, MobileNet, SqueezeNet etc.
These networks are trained for classifying images into one of 1000 categories or classes.
Keras comes bundled with many models. A trained model has two parts – Model Architecture and Model Weights. The weights are large files and thus they are not bundled with Keras. However, the weights file is automatically downloaded ( one-time ) if you specify that you want to load the weights trained on ImageNet data. It has the following models ( as of Keras version 2.1.2 ):
- VGG16,
- InceptionV3,
- ResNet,
- MobileNet,
- Xception,
- InceptionResNetV2
Loading a Model in Keras
We can load the models in Keras using the following code:
import numpy as np # import the models for further classification experiments from tensorflow.keras.applications import ( vgg16, resnet50, mobilenet, inception_v3 ) # init the models vgg_model = vgg16.VGG16(weights='imagenet') inception_model = inception_v3.InceptionV3(weights='imagenet') resnet_model = resnet50.ResNet50(weights='imagenet') mobilenet_model = mobilenet.MobileNet(weights='imagenet')
In the above code, we first import the python module containing the respective models. Then we load the model architecture and the imagenet weights for the networks. If you don’t want to initialize the network with imagenet weights, replace ‘imagenet’ with None
.
Loading and pre-processing an image
We can load the image using any library such as OpenCV, PIL, skimage etc. Keras also provides an image module which provides functions to import images and perform some basic pre-processing required before feeding it to the network for prediction. We will use the keras functions for loading and pre-processing the image. Specificallly, we perform the following steps on an input image:
- Load the image. This is done using the
load_img()
function. Keras uses the PIL format for loading images. Thus, the image is in width x height x channels format. - Convert the image from PIL format to Numpy format ( height x width x channels ) using
img_to_array()
function. - The networks accept a 4-dimensional Tensor as an input of the form ( batchsize, height, width, channels). This is done using the expand_dims() function in Numpy.
import matplotlib.pyplot as plt from tensorflow.keras.preprocessing.image import load_img from tensorflow.keras.preprocessing.image import img_to_array from tensorflow.keras.applications.imagenet_utils import decode_predictions # assign the image path for the classification experiments filename = 'images/cat.jpg' # load an image in PIL format original = load_img(filename, target_size=(224, 224)) print('PIL image size',original.size) plt.imshow(original) plt.show() # convert the PIL image to a numpy array # IN PIL - image is in (width, height, channel) # In Numpy - image is in (height, width, channel) numpy_image = img_to_array(original) plt.imshow(np.uint8(numpy_image)) plt.show() print('numpy array size',numpy_image.shape) # Convert the image / images into batch format # expand_dims will add an extra dimension to the data at a particular axis # We want the input matrix to the network to be of the form (batchsize, height, width, channels) # Thus we add the extra dimension to the axis 0. image_batch = np.expand_dims(numpy_image, axis=0) print('image batch size', image_batch.shape) plt.imshow(np.uint8(image_batch[0]))
Output:
PIL image size (224, 224) numpy array size (224, 224, 3) image batch size (1, 224, 224, 3)

Predicting the Object Class
Once we have the image in the right format, we can feed it to the network and get the predictions. The image we got in the previous step should be normalized by subtracting the mean of the ImageNet data. This is because the network was trained on the images after this pre-processing. We follow the following steps to get the classification results.
- Preprocess the input by subtracting the mean value from each channel of the images in the batch. Mean is an array of three elements obtained by the average of R, G, B pixels of all images obtained from ImageNet. The values for ImageNet are :
[ 103.939, 116.779, 123.68 ]
. This is done using the preprocess_input() function. - Get the classification result, which is a Tensor of dimension ( batch size x 1000 ). This is done by model.predict() function.
- Convert the result to human-readable labels – the vector obtained above has too many values to make any sense. Keras provides a function decode_predictions() which takes the classification results, sorts it according to the confidence of prediction and gets the class name ( instead of a class-number ). We can also specify how many results we want, using the top argument in the function. The output shows the class ID, class name and the confidence of the prediction.
# prepare the image for the VGG model processed_image = vgg16.preprocess_input(image_batch.copy()) # get the predicted probabilities for each class predictions = vgg_model.predict(processed_image) # print predictions # convert the probabilities to class labels # we will get top 5 predictions which is the default label_vgg = decode_predictions(predictions) # print VGG16 predictions for prediction_id in range(len(label_vgg[0])): print(label_vgg[0][prediction_id])
Output:
('n02123597', 'Siamese_cat', 0.30934194) ('n01877812', 'wallaby', 0.08034124) ('n02326432', 'hare', 0.07509843) ('n02325366', 'wood_rabbit', 0.0505307) ('n03223299', 'doormat', 0.048173614)
- Comparison of Results from various models. Let us see what the different models say for a few images. Giving a cat image as input, and running it on the 4 models, we get the following output.

Giving a dog as an input, this is the output:

With an orange:

For a tomato we get:

For a watermelon:

Well, it looks like the ILSVRC does not recognize tomatoes and watermelons. We will see how to train a classifier using these same models with our own data to recognize any other set of objects which are not present in the ILSVRC dataset. This would be the topic of our next two posts. Stay tuned!
Subscribe & Download Code
If you liked this article and would like to download code and example images used in this post, please subscribe to our newsletter. You will also receive a free Computer Vision Resource Guide. In our newsletter, we share OpenCV tutorials and examples written in C++/Python, and Computer Vision and Machine Learning algorithms and news.