Training a Custom Object Detector with TensorFlow and Using it with OpenCV DNN module

Training a Custom Object Detector with TensorFlow and Using it with OpenCV DNN module

Main Image

This is a really descriptive and interesting tutorial, let me highlight what you will learn in this tutorial.

  1. A Crystal Clear step by step tutorial on training a custom object detector.
  2. A method to download videos and create a custom dataset out of that.
  3. How to use the custom trained network inside the OpenCV DNN module so you can get rid of the TensorFlow framework.

Plus here are two things you will receive from the provided source code:

  1. A Jupyter Notebook that automatically downloads and installs all the required things for you so you don’t have to step outside of that notebook.
  2. A Colab version of the notebook that runs out of the box, just run the cells and train your own network.

I will stress this again that all of the steps are explained in a neat and digestible way. I’ve you ever plan to do Object Detection then this is one tutorial you don’t want to miss.

As mentioned, by downloading the Source Code you will get 2 versions of the notebook: a local version and a colab version.

So first we’re going to see a complete end to end pipeline for training a custom object detector on our data and then we will use it in the OpenCV DNN module so we can get rid of the heavy Tensorflow framework for deployment. We have already discussed the advantages of using the final trained model in OpenCV instead of Tensorflow in my previous post.

Today’s post is the 3rd tutorial in our 3 part Deep Learning with OpenCV series. All three posts are titled as:

  1. Deep Learning with OpenCV DNN Module, A Comprehensive Guide
  2. Training a Custom Image Classifier with OpenCV, Converting to ONNX, and using it in OpenCV DNN module.
  3. Training a Custom Object Detector with Tensorflow and using it with OpenCV DNN (This Post)

Now to follow along and to learn the full pipeline of training a custom object detector with TensorFlow you don’t need to read the previous two tutorials but when we move to the last part of this tutorial and use the model in OpenCV DNN then those tutorials would help.

What is Tensorflow Object Detection (TFOD) API:

To train our custom Object Detector we will be using TensorFlow Object Detection API (TFOD API). This API is a framework built on top of TensorFlow that makes it easy for you to train your own custom models.

The workflow generally goes like this :

You take a pre-trained model from this model zoo and then fine-tune the model for your own task.
Fine-tuning is a transfer learning method that allows you to utilize features of the model which it learned from a different task to your own task. Because of this, you won’t require thousands of images to train the network, only a few hundred will suffice.
If you’re someone who prefers PyTorch instead of Tensorflow then you may want to look at Detectron 2

For this Tutorial I will be using TensorFlow Object Detection API version 1, If you want to know why we are using version 1 instead of the recently released version 2, then you can read below optional explanation.

Why we’re using TFOD API Version 1? (OPTIONAL READ)


TFOD v2 comes with a lot of improvements, the new API contains some new State of The ART (SoTA) models, some pretty good changes including New binaries for train/eval/export that are eager mode compatible. You can check out this release blog from the TFOD API developers.

But the thing is because TF 2 no longer supports sessions so you can’t easily export your model to frozen_inference_graph, furthermore TensorFlow depreciates the use of frozen_graphs and promotes saved_model format for future use cases.

For TensorFlow, this is the right move as the saved_model format is an excellent format.

So what’s the issue?

The problem is that OpenCV only works with frozen_inference_graphs and does not support saved_model format yet, so for this reason if your end goal is to deploy it in OpenCV then you should use TFOD API v1. Although you can still generate frozen_graphs, those graphs produce errors with OpenCV most of the time, we’ve tried limited experiments with TF2 so feel free to carry out your experiments but do share if you find something useful.

Now One great thing about this situation is that the Tensorflow team decided to keep the whole pipeline and code of TFOD API 2 almost identical to TFOD API 1 so learning how to use TFOD v1 will also teach you how to use TFOD API v2.

Now Let’s start with the code

Code For TF Object Detection Pipeline:

Download Source Code For This Tutorial

Download Source Code 

Make sure to download the source code, which also contains the support folder with some helper files that you will need.

Here’s the hierarchy of the source code folder:

Here’s a Description of what these folder & files are:

  • Custom_Object_Detection.ipynb: This is the main notebook which contains all the code.
  • Colab Notebook Link: This text file contains the link for the colab version of the notebook.
  • This file will create tf records from the images and labels.
  • fronzen_graph_inference.pb: This is the model we trained, you can try to run this on test images.
  • graph_ours.pbtxt: This is the graph file we generated for OpenCV, you’ll learn to generate your own.
  • This file creates the above graph.pbtxt file for OpenCV.
  • This is a helper file used by the file.
  • labels: These are .xml labels for each image
  • test_images: These are some sample test images to do inference on.

Note: There are some other folder and files which you will generate along the way, I will explain their use later.

Now Even though I make it really easy but still if you don’t want to worry about environment setup, installation, then you can use the colab version of the notebook that comes with the source code.

The Colab version doesn’t require any Configuration, It’s all set to go. Just run the cells in order. You should also be able to use the Colab GPU to speed up the training process.

The full code can be broken down into the following parts

  • Part 1: Environment Setup
  • Part 2: Installation & TFOD API Setup
  • Part 3: Data Collection & Annotation
  • Part 4: Downloading Model & Configuring it
  • Part 5: Training and Exporting Inference Graph.
  • Part 6: Generating .pbtxt and using the trained model with just OpenCV.

Part 1: Environment Setup:

First let’s Make sure you have correctly set up your environment.

Since we are going to install tensorflow version 1.15.0 so we should use a virtual environment, you can either install virtualenv or anaconda distribution.. I’m using Anaconda. I will start by creating virtual environment.

Open up the command prompt and do conda create --name tfod1 python==3.7

Now you can move into that environment by activating it:

conda activate tfod1

Make sure there is a (tfod1) at the beginning of each line in your cmd. This means you’re using that environment. Now anything you install will be in that environment and won’t affect your base/root environment.

The first thing You want to do install a jupyter notebook in that environment. Otherwise, your environment will use the jupyter notebook of the base environment, so do:

pip install jupyter notebook

Now you should go into the directory/folder which I provided you and contains this notebook and open up the command prompt.

First, activate the environment tfod1 environment and then launch the jupyter notebook by typing jupyter notebook and hit enter.

This will launch the jupyter notebook in your newly created environment. You can now Open up Custom_Object_Detection Notebook.

Make sure your Notebook is Opened up in the Correct environment


Part 2: Installation & TFOD API Setup: 

You can install all the required libraries by running this cell

If you want to install Tensorflow-GPU for version 1 then you can take a look at my tutorial for that here

Note: You would need to change the Cuda Toolkit version and CuDNN version in the above tutorial, since you’ll be installing for TF version 1 instead of version 2. You can look up the exact version requirements here

Another Library you will need is pycocotools

Alternatively You can also use this command to install in windows:

pip install git+^&subdirectory=PythonAPI

Alternatively you can also use this command to install in Linux and osx:

pip install pycocotools

Note: Make sure you have Cython installed first by doing: pip install Cython

Import Required Libraries

This will also confirm if your installations were successful or not.

This should be Version 1.15.0, DETECTED VERSION: 1.15.0

Clone Tensorflow Object Detection Model Repository

You need to clone the TF Object Detection API repository, you can either download the zip file and extract it or if you have git installed then you can git clone it.

Option 1: Download with git:

You can run git clone if you have git installed, this is going to take a while, its 600 MB+, have a coffee or something.

Option 2: Download zip and extract all: (Only do this if you don’t have git)

You can download the zip by clicking here, after downloading make sure to extract the contents of this zip inside the directory containing this notebook. I’ve already provided you the code that automatically downloads and unzips the repo in this directory.

The models we’ll be using are in the research directory of the above repo. The research directory contains a collection of research model implementations in TensorFlow 1 or 2 by researchers. There are a total of 4 directories in the above repo, you can learn more about them here.

Install Tensorflow Object Deteciton API & Compile Protos

Download Protobuff Compiler:

TFOD contains some files .proto format, I’ll explain more about this format in a later step, for now you need to download the protobuf compiler from here, make sure to download the correct one based on your system. For e.g. I downloaded for my 64 bit windows. For linux and osx there are different files.

After downloading unzip the proto folder, go to its bin directory, and copy the proto.exe file. Now paste this proto.exe inside the models/research directory.

The below script does all of this, but you can choose to do it manually if you want. Make sure to change the URL if you’re using a system other than 64-bit windows.

Now you can install the object detection API and compile the protos:
Below two operations must be performed in this directory, otherwise it won’t work, especially the proto command.

Note: Since I already had installed pycocotools so after running this line cp object_detection/packages/tf1/ . I edited the file to get rid of pycocotools package inside the REQUIRED_PACKAGES list then I saved the file and ran the python -m pip install . command. I did this because I was facing issues installing pycocotools this way which is why I installed the pycocotools-windows package, you probably won’t need do this.

If you wanted to install TFOD API version 2 instead of version 1 then you can just replace tf1 with tf2 in the cp object_detection/packages/tf1/ . command.

You can Check your installation of TFOD API by running

Part 3: Data Collection & Annotation:

Now for this tutorial I’m going to train a detector to detect the faces of Tom & Jerry. I didn’t wanted to use the common animal datasets etc. So I went with this.

While I was writing the above sentence I just realized I’m still using a Cat, mouse dataset albeit an animated one so I guess its still a unique dataset.

In this tutorial, I’m not only going to show you how to annotate the data but also show you one approach on how to go about collecting data for a new problem.

So What I’ll be doing is that I’m going to download a video of Tom & Jerry from Youtube and then split the frames of the video to create my dataset and then annotate each of those frames with bounding boxes. Now instead of downloading my Tom & Jerry video you can use any other video and try to detect your own classes.

Alternatively you can also generate training data from other methods including getting images from Google Images.

To prepare the Data we need to perform these 5 steps:

  • Step 1: Download Youtube Video.
  • Step 2: Split Video Frames and store it.
  • Step 3: Annotate Images with labelImg.
  • Step 4: Create a label Map file.
  • Step 5: Generate TFRecords.

Step 1: Download Youtube Video:

11,311,502.0 Bytes [100.00%] received. Rate: [7788 KB/s]. ETA: [0 secs]

For more options on how you can download the video take a look at the documentation here

Step 2: Split Video Frames and store it:

Now we’re going to split the video frames and store them in a folder. Since most videos have a high FPS (30-60 frames/sec) and we don’t exactly need this many frames for two reasons:

  1. If you take a 30 FPS video then for each second of the video you will get 30 images and most of those images won’t be different from each other, there will be a lot of repetition of information.
  2. We’re already going to use Transfer Learning with TFOD API, the benefit of this is that we won’t be needing a lot of images and this is good since we don’t want to annotate thousands of images.

So we can do two two things we can skip frames and save every nth frame or we can save a frame every nth second of the video. I’m going with the latter approach, although both are valid approaches.

Done Splitting Video, Total Images saved: 165

You can go to the directory where the images are saved and manually go through each image and delete the ones where Tom & Jerry are not visible or hardly visible. Although this is not a strict requirement since you can easily skip these images in the annotation step.

Step 3: Annotate Images with labelImg

You can watch this video below to understand how to use labelImg to annotate images and export annotations. You can also take a look at the github repo here.

For the current Tom & Jerry problem I am providing you with a labels folder which already contains the .xml annotation file for each image. If you want to try a different dataset then go ahead, make sure to put the labels of that dataset in the labels folder

Note: We are not splitting the images into train and validation folder right now because we’ll be doing that automatically at tfrecord creation step. Although it would still be a good idea to separate 10% of the data for proper testing/evaluation of the final trained detector, but since my purpose is to make this tutorial as simple as possible so I won’t be doing that today, I already have test folder with 4-5 images which I will evaluate on.

Step 4: Create a label Map file

TensorFlow requires a label map file, which maps each of the class labels to an integer values. This label map is used in training and detection process. This file should be saved in training directory which also contains the labels folder

Step 5: Generate TFrecords

What are TFrecords?

Tfrecords are just protocol buffers, they help make the data reading/processing process computationally efficient. The only downside they have is that they are not human readable.

What are protocol Buffers?

A protocol buffer is a type of serialized structured data. It is more efficient than JSON, XML, pickle, and text storage formats. Google created this Protobuf (protocol buffer) format in 2008 because of their efficiency, Since then they have been widely used by Google and the community. To read the protobuf files (.proto files) you will first need to compile them by a protobuf compiler. So now you probably understand why we needed to compile those proto files at the beginning.

Here’s a nice tutorial by Naveen that explains how you can create a tfrecord for different data types and Here’s a more detailed explanation of protocol buffers with an example.

The script I’ll be using to convert images/labels to tfrecords is taken from the TensorFlow’s pet example but I’ve modified the script so now it accepts the following 5 arguments:

  1. Directory of images
  2. Directory of labels
  3. % of Split of Training data
  4. Path to label_map.pbtxt file
  5. Path to output tfrecord files

And it returns a train.record and val.record. So it splits the training data into training/validation sets. For this data I’m using a training set of 70% and validation is 30%.

Done Writing, Saved: training\\tfrecords\train.record Done Writing, Saved: training\\tfrecords\val.record

You can ignore these warnings, we already know that we’re using an older 1.15 version of TFOD API which contains some depreciated functions.

Most of the tfrecord scripts available online will first tell you to convert your xml files to csv and then you will use another script to split the data into training and validation folder and then another script to convert to tfrecords. The Script above is doing all of this.

Part 4: Downloading Model & Configuring it:

You can now go to the Model Zoo, select a model, and download its zip. Now unzip the contents of that folder and put inside a directory named pretrained_model. The below script does this automatically for a Faster-RCNN-Inception model which is already trained on the COCO dataset. You can change the model name to download a different model.

Model Downloaded

Modify pipline.config file:

After downloading you will have a number of files present in the pretrained_model folder, I will explain about them later but for now, let’s take a look at the pipeline.config file.

Pipeline.config defines how the whole training process will take place, what optimizers, loss, learning_rate, batch_size will be used. Most of these params are already set by default, its up to you if you want to change them or not but there are some paths in the pipeline.config file that you will need to change so that this model can trained on our data.

So open up pipeline.config with a text editor like Notepad ++ and change these 4 paths:

  • Change: PATH_TO_BE_CONFIGURED/model.ckpt  to  pretrained_model/model.ckpt
  • Change: PATH_TO_BE_CONFIGURED/mscoco_train.record  to  training/tfrecords/train.record
  • Change: PATH_TO_BE_CONFIGURED/mscoco_val.record   to  training/tfrecords/val.record
  • Change: PATH_TO_BE_CONFIGURED/mscoco_label_map.pbtxt  to  training/label_map.pbtxt
  • Change: num_classes: 90  to  num_classes: 2

If you’re lazy like me then no prob, below script does all this

Notice the correction I did by replacing step: 0 with step: 1, unfortunately for different models sometimes there are some corrections required but you can easily understand what exactly needs to be changed by pasting the error generated during training on google. Click on github issues for that error and you’ll find a solution for that.

Note: These issues seems to be mostly present in TFOD API Version 1

Changing Important Params in Pipeline.config File:

Additionally I’ve also changed the batch size of the model, just like batch_size there are lots of important parameters that you would want to tune. I would strongly recommend that you try to change the values according to your problem. Almost always the default values are not optimal for your custom use case. I should tell you that to tune most of these values you need some prior knowledge, make sure to atleast change the batch_size according to your system’s memory and learning_rate of the model.

Part 5 Training and Exporting Inference Graph: 

You can start training the model by calling the script from the Object_detection folder, we are giving it the following arguments.

  • num_train_steps: These are the number of times your model weights will be updated using a batch of data.
  • pipeline_config_path: This is the path to your pipeline.config file.
  • model_dir: Path to the output directory where the final checkpoint files will be saved.

Now you can run below cell to start training but I would recommend that you run this cell in the command line, you can just paste this line:

Note: When you start training you will see a lot of warnings, just ignore them as TFOD 1 contains a lot of depreciated functions.

Once you start training, the network will take some time to initialize and then the training will start, after every few minutes, you will see a report of loss values and a global loss. The Network is learning if the loss is going down. If you’re not familiar with the Object detection Jargon Like IOU etc, then just make note of the final global loss after each report.

You ideally want to set the num_train_steps to tens of thousands of steps, you can always end training by pressing CTRL + C on the command prompt if the loss has decreased sufficiently. If training is taking place in jupyter notebook then you can end it by pressing the Stop button on top.

After training has ended or you’ve stopped it, there would be some new files in the pre_trained folder. Among all these files we will only need the checkpoint (ckpt) files.

If you’re training for 1000s of steps (which is most likely the case) then I would strongly recommend that you don’t use your CPU but utilize a GPU. If you don’t have one then its best to use Google Colab’s GPU. I’m already providing you a ready to run colab Notebook.

Note: There’s another script for training called, this is an older script where you can see the loss value for each step, if you want to use that sicpt then you can find it at models / research / object_detection / legacy /

You can run this script by doing:

The best way to monitor training is to use Tensorboard, I will discuss about this another time

Export Frozen Inference Graph:

Now we will use the script to create a frozen_inference_graph from the checkpoint files.

Why are we doing this?

After training our model it is stored in checkpoint format and a saved_model format but in OpenCV we need the model to be in a frozen_inference_graph format. So we need to generate the frozen_inference_graph using the checkpoint files.

What are these checkpoint files?

After Every few minutes of training, tensorflow outputs some checkpoint (ckpt) files. The number on those files represent how many train steps they have gone through. So during the frozen_inference_graph creation we only take the latest checkpoint file (i.e. the file with the highest number) because this is the one which has gone through the most training steps.

Now every time a checkpoint file is saved, its split into 3 parts.

For the initial step these files are:

  • This file contains the value of each single variable, its pretty large.
  • This file contains metadata for each tensor. e.g. checksum, auxiliary data etc.
  • model.ckpt-000.meta: This file stores the graph structure of the model

If you take a look at the fine_tuned_model folder wihch will be created after running the above command then you’ll find that it contains the same files you got when you downloaded the pre_trained model. This is the final folder.

Now Your trained model is in 3 different formats, the saved_model format, the frozen_inference_graph format and the checkpoint file format. For OpenCV we only need the frozen inference graph format.

The checkpoint format is ideal for retraining purposes and getting to know other sorts of information about the model, for production and serving the model you will need to use is either the frozen_inference_graph or saved_model format. Its worth mentioning that both these files contain the extension .pb

In TF 2, frozen_inference_graph is depreciated and TF 2 encourages to use the saved_model format, as said previously unfortunately we can’t use the saved_model format with OpenCV yet.

Run Inference on Trained Model (Bonus Step):

You can optionally choose to run inference using tensorflow sessions, I’m not going to explain much here as Tf sessions are depreciated and our final goal is to actually use this model in OpenCV DNN.

Part 6: Generating .pbtxt and using the trained model with just OpenCV 

6 a) Export Graph.pbxt with frozen inference graph:

We can use the above generated frozen graph inside the OpenCV DNN module to do detection but most of the time we need another file called a graph.pbtxt file. This file contains a description of the network architecture, it is required by OpenCV to rewire some network layers for Optimization purposes.

This graph.pbtxt can be generated by using one of the 4 scripts provided by OpenCV. These scripts are:


They can be downloaded here, you will also find more information regarding them on that page.

Now since the Detection architecture we’re using is Faster-RCNN (you can tell by looking at the name of the downloaded model) so we will use to generate the pbtxt file. For .pbtxt generation you will need the frozen_inference_graph.pb file and the pipeline.config file.

Note: When you’re done with training then you will also see a graph.pbtxt file inside the pretrained folder, this graph.pbtxt is different from the one generated by OpenCV’s .pbtxt generator scripts. One major difference is that the OpenCV’s graph.pbtxt do not contain the model weights but only contains the graph description, so they will be much smaller in size.

Number of classes: 2
Scales: [0.25, 0.5, 1.0, 2.0] Aspect ratios: [0.5, 1.0, 2.0]
Width stride: 16.000000
Height stride: 16.000000
Features stride: 16.000000

For model architectures that are not one of the above 4, then for those, you will need to convert TensorFlow’s .pbtxt file to OpenCV’s version. You can find more on how to do that here. But we warned this conversion is not a smooth process and there are a lot of low-level issues that come up.

6 b) Using the Frozen inference graph along with Pbtxt file in OpenCV:

Now that we have generated the graph.pbtxt file with OpenCV’s tf_text_graph function we can pass this file to cv2.dnn.readNetFromTensorflow() to initialize the network. All of our work is done now Make sure you’re familiar with with OpenCV’s DNN module, if not you can read my previous post on it.

Now we will create following two functions:

Initialization Function: This function will intialize the network using the .pb and .pbtxt file, it will also set the class labels.

Main Function: This function will contain all the rest of the code from preprocessing to postprocessing, it will also have the option to either return the image or display it with matplotlib

This is our Main function, the comments will explain what’s going on

Note: When you do net.forward() you get an output of shape (1,1,100,7). Since we’re predicting on a single image instead of a batch of images so you will get (1,1) at the start now the remaining (100,7) means that there are 100 detections for that image and each image contains 7 properties/variables.

There will be 100 detections for each image, this was set in the pipeline.config, you can choose to change that.

So here are what these 7 properties correspond to:

  1. This is the index of image for a single image its 0
  2. This is the index of the target CLASS
  3. This is the score/confidence of that CLASS

Remaining 4 values are x1,y1,x2,y2. These are used to draw the bounding box of that CLASS object

  1. x1
  2. y1
  3. x2
  4. y2

Initialize the network

You will just need to call this once to initialize the network

Predict On Images

Now you can use the main funciton to predict on different images, The images we will predict on are placed inside a folder namded test_images. These images were not in the training dataset.

What’s Next?

computer vision

If you want to go forward from here and learn more advanced things and go into more detail, understand theory and code of different algorithms then be sure to check out our Computer Vision & Image Processing with Python Course (Urdu/Hindi). In this course, I go into a lot of detail regarding vision fundamentals and cover a plethora of algorithms and techniques to help you master Computer Vision.

If you want to start a career in Computer Vision & Artificial Intelligence then this course is for you. One of the best things about this course is that the video lectures are in Urdu/Hindi Language without any compromise on quality, so there is a personal/local touch to it.


Limitations: Our Final detector has a decent accuracy but it’s not that robust because of 4 reasons:

  1. Transfer Learning works best when the dataset you’re training on shares some features with the original dataset it was trained on, most of the models are trained on ImageNet, COCO, PASCAL VOC datasets. Which is filled with animals and other real-world images. Now our dataset is a dataset of Cartoon images, which is drastically different from real-world images. We can solve this problem by including more images and training more layers of the model.

  2. Animations of cartoon characters are not consistent, they change a lot in different movies. So if you train the model on these pictures and then try to detect random google images of tom and jerry then you won’t get good accuracy. We can solve this problem by including images of these characters from different movies so the model learns the features that are the same throughout the movies.

  3. The images generated from the sample video created an imbalanced dataset, There are more Jerry Images than Tom images, there are ways to handle this scenario but try to get a decent balance of images for both classes to get the best results.

  4. The annotation is poor, Yeah so the annotation I did was just for the sake of making this tutorial, in reality, you want to set a clear outline and standard about how you’ll be annotating, are you going to annotate the whole head, are ears included, is the neck part of it.. so you need answer all these questions ahead of time.

I will stress this again that if you’re not planning to use OpenCV for the final deployment then use TFOD API version 2, it’s a lot more cleaner. However, if the final objective is to use OpenCV at the end then you could get away with TF 2 but its a lot of trouble.

Even with TFOD API v1, you can’t be sure that your custom trained model will always be loaded in OpenCV correctly, there are times when you would need to manually edit the graph.pbtxt file so that you can use the model in OpenCV. If this happens and you’re sure you have done everything correctly then your best bet is to raise an issue here.

Hopefully, OpenCV will catch up and start supporting TF 2 saved_model format but its gonna take time. If you enjoyed this tutorial then please feel free to comment and I’ll gladly answer you.

Training a Custom Image Classifier with Tensorflow, Converting to ONNX and using it in OpenCV DNN module

Training a Custom Image Classifier with Tensorflow, Converting to ONNX and using it in OpenCV DNN module

In the previous tutorial we learned how the DNN module in OpenCV works, we went into a lot of details regarding different aspects of the module including where to get the models, how to configure them, etc. 

This Tutorial will build on top of the previous one so if you haven’t read the previous post then you can read that here. 

Today’s post is the second tutorial in our brand new 3 part Deep Learning with OpenCV series. All three posts are titled as:

  1. Deep Learning with OpenCV DNN Module, A Comprehensive Guide
  2. Training a Custom Image Classifier with OpenCV, Converting to ONNX, and using it in OpenCV DNN module.
  3. Using a Custom Trained Object Detector with OpenCV DNN Module.

In this post, we will train a custom image classifier with Tensorflow’s Keras API. So if you want to learn how to get started creating a Convolutional Neural Network using Tensorflow, then this post is for you, and not only that but afterward, we will also convert our trained .h5 model to ONNX format and then use it with OpenCV DNN module.

Converting your model to onnx will give you more than 3x reduction in model size.

This whole process shows you how to train models in Tensorflow and then deploy it directly in OpenCV.

What’s the advantage of using the trained model in OpenCV vs using it in Tensorflow ?

So here are some points you may want to consider.

  • By using OpenCV’s DNN module, the final code is a lot compact and simpler.
  • Someone who’s not familiar with the training framework like TensorFlow can also use this model.
  • There are cases where using OpenCV’s DNN module will give you faster inference results for the CPU. See these results in LearnOpenCV by Satya.
  • Besides supporting CUDA based NVIDIA’s GPU, OpenCV’s DNN module also supports OpenCL based Intel GPUs.
  • Most Importantly by getting rid of the training framework (Tensorflow) not only makes the code simpler but it ultimately gets rid of a whole framework, this means you don’t have to build your final application with a heavy framework like TensorFlow. This is a huge advantage when you’re trying to deploy on a resource-constrained edge device, e.g. a Raspberry pie

So this way you’re getting the best of both worlds, a framework like Tensorflow for training and OpenCV DNN for faster inference during deployment.

This tutorial can be split into 3 parts.

  1. Training a Custom Image Classifier in OpenCV with Tensorflow
  2. Converting Our Classifier to ONNX format.
  3. Using the ONNX model directly in the OpenCV DNN module.

Let’s start with the Code

Download Code for this post

Download Code for this post

Part 1: Training a Custom Image Classifier with Tensorflow:

For this tutorial you need OpenCV and Tensorflow 2.2

So you should do:

pip install opencv-contrib-python==
Or install from Source, Make sure to change the version)

pip install tensorflow
(Or install tensorflow-gpu from source)

Note: The reason I’m asking you to install version 4.0 instead of the latest 4.3 version of OpenCV is because later on we’ll be using a function called readNetFromONNX() now with our model this function was giving an error in 4.3 and 4.2, possibly due to some bug in those versions. This does not mean that you can’t use custom models with those versions but that for my specific case there was an issue. Converting models only takes 2-3 lines of code but sometimes you get ambiguous errors which are hard to diagnose, but it can be done.

Hopefully, the conversion process will get better in the future.

One thing you can do is create a custom environment (with Anaconda or virtualenv) in which you can install version 4.0 without affecting your root environment and if you’re using google colab for this tutorial then you don’t need to worry about that.

You can go ahead and download the source code from the download code section. After downloading the zip folder, unzip it and you will have the following directory structure.

You can start by importing the libraries:

Let’s see how you would go about training a basic Convolutional Network in Tensorflow. I assume you know some basics of deep learning. Also in this tutorial, I will be teaching how to construct and train a classifier using a real-world dataset, not a toy one, I will not go in-depth and explain the theory behind neural networks. If you want to start learning deep learning then you can take a look at Andrew Ng’s Deep Learning specialization, although this specialization is basic and covers mostly foundational things now if your end goal is to specialize in computer Vision then I would strongly recommend that you first learn Image Processing and Classical Computer Vision techniques from my 3 month comprehensive course here.

The Dataset we’re going to use here is a dataset of 5 different flowers, namely rose, tulips, sunflower, daisy and dandelion. I avoided the usual cats and dogs dataset.

You can download the dataset from a url, you just have to run this cell

After downloading the dataset you’ll have to unzip it, you can also do this manually.

After extracting you can check the folder named flower_photos in your current directory which will contain these 5 subfolders.

You can check the number of images in each class using the code below.

Found 699 images of sunflowers
Found 898 images of dandelion
Found 633 images of daisy
Found 799 images of tulips
Found 641 images of roses
[‘daisy’, ‘dandelion’, ‘roses’, ‘sunflowers’, ‘tulips’]

Generate Images:

Now it’s time to load up the data, now since the data is approx 218 MB, we can actually load it in RAM but most real datasets are large several GBs in size, and will fit in your RAM. In those scenarios, you use data generators to fetch batches of data and feed it to the neural network during training, so today we’ll also be using a data generator to load the data.

Before we can pass the images to a deep learning model, we need to do some preprocessing, like resize the image in the required shape, convert them to floating-point tensors, rescale the pixel values from 0-255 to 0-1 range as this helps in training.

Fortunately, all of this can be done by the ImageDataGenerator class in tf.keras. Not only that but the ImageDataGenerator Class can also perform data augmentation. Data augmentation means that the generator takes your image and performs random transformations like randomly rotating, zooming, translating, and performing other such operations to the image. This is really effective when you don’t have much data as this increases your dataset size on the fly and your dataset contains more variation which helps in generalization.

As you’ve already seen that each flower class has less than 1000 examples, so in our case data augmentation will help a lot. It will expand our dataset.

When training a Neural Network, we normally use 2 datasets, a training dataset and a validation dataset. The neural network tunes its parameters using the training dataset and the validation dataset is used for the evaluation of the Network’s performance.

Found 2939 images belonging to 5 classes.
Found 731 images belonging to 5 classes.

Note: Usually when using an ImageDataGenerator to read from a directory with data augmentation we usually have two folders for each class because data augmentation is done only to the training dataset, not the validation set as this set is only used for evaluation. So I’ve actually created two data generators instances for the same directory with a validation split of 20% and used a constant random seed on both generators so there is no data overlap.

I’ve rarely seen people split with augmentation this way but this approach actually works and saves us the time of splitting data between two directories.

Visualize Images:

It’s always a good idea to see what images look like in your dataset, so here’s a function that will plot new images from the dataset each time you run it.

Alright, now we’ll use the above function to first display few of the original images using the validation generator.

Now we will generate some Augmented images using the train generator. Notice how images are rotated, zoomed etc.

Create the Model

Since we’re using Tensorflow 2 (TF2) and in TF2 the most popular way to go about creating neural networks is by using the Keras API. Previously Keras used to be a separate framework (it still is) but not so long ago because of Keras’ popularity in the community it was included in TensorFlow as the default high-level API. This abstraction allows developers to use TensorFlow’s low-level functionality with high-level Keras code. 

This way you can design powerful neural networks in just a few lines of code. E.g. take a look at how we have created an effective Convolutional Networks.

A typical neural network has a bunch of layers, in a Convolutional network, you’ll see convolutional layers. These layers are created with the Conv2d function. Take a look at the first layer:

      Conv2D(16, 3, padding=’same’, activation=’relu’, input_shape =(IMG_HEIGHT, IMG_WIDTH ,3))

The number 16 refers to the number of filters in that layer, normally we increase the number of filters as you add more layers. You should notice that I double the number of filters in each subsequent convolutional layer i.e. 16, 32, 64 … , this is common practice. In the first layer, you also specify a fix input shape that the model will accept, which we have already set as 200x200

Another thing you’ll see is that typically a convolutional layer is followed by a pooling layer. So the Conv layer outputs a number of feature maps and the pooling layer reduces the spatial size (width and height) of these feature maps which effectively reduces the number of parameters in the network thus reducing computation.

So you’ll commonly a convolutional layer followed by a pooling layer, this is normally repeated several times, at each stage the size is reduced and the no of filters is increased. We are using a MaxPooling layer there are other pooling types too e.g. AveragePooling.

The Dropout layer randomly drops x% percentage of parameters from the network, this allows the network to learn robust features. In the network above I’m using dropout twice and so in those stages I’m dropping 10% of the parameters. The whole purpose of the Dropout layer is to reduce overfitting.

Now before we add the final layer we need to flatten the output in a single-dimensional vector, this can be done by the flatten layer but a better method is using the  GlobalAveragePooling2D Layer, which flattens the output while reducing the parameters.

Finally, before our last layer, we also use a Dense layer (A fully connected layer) with 1024 units. The final layer contains the number of units equal to the number of classes. The activation function here is softmax as I want the network to produce class probabilities at the end.

Compile the model

Before we can start training the network we need to compile it, this is the step where we define our loss function, optimizer, and metrics.

For this example, we are using the ADAM optimizer and a categorical cross-entropy loss function as we’re dealing with a multi-class classification problem. The only metric we care about right now is the accuracy of the model.

Model summary

By using the built-in method called summary() we can see the whole architecture of the model that we just created. You can see the total parameter count and the number of params in each layer.

Notice how the number of params are 0 in all layers except the Conv and Dense layers, this is because these are the only two types of layers here which are actually involved in learning.

Training the Model:

You can start training the model using the method but first specify the number of epochs, and the steps per epoch. 

Epoch: A single epoch means 1 pass of the whole data meaning an epoch is considered done when the model goes over all the images in the training data and uses it for gradient calculation and optimizations. So this number decides how many times the model will go over your whole data.

Steps per epoch: A single step means the model goes over a single batch of the data, so steps per epoch tells, after how many steps should an epoch be considered done. This should be set to dataset_size / batch_size which is the number of steps required to go over the whole data once.

Let’s train our model for 60 epochs.


You can see in the last epoch that our validation loss is low and accuracy is high so our model has successfully converged, we can further verify this by plotting the loss and accuracy graphs.

After you’re done training it’s a good practice to plot accuracy and loss graphs.

The model has slightly overfitted at the end but that is okay considering the number of images we used and our model’s capacity.

You can test out the trained model on a single test image using this code. Make sure to carry out the same preprocessing steps you used before training for e.g. since we trained on normalized images in range 0-1, we will need to divide any new image with 255 before passing it to the model for prediction.

Predicted Flowers is : roses, 85.61%

Notice that we are converting our model from BGR to RGB color format. This is because TensorFlow has trained the model using images in RGB format whereas OpenCV reads images in BGR format, so we have to reverse channels before we can perform prediction.

Finally when you’re satisfied with the model you save it in .h5 format using function.

Part 2: Converting Our Classifier to ONNX format

Now that we have trained our model, it’s time to convert it to ONNX format.

What is ONNX ?

ONNX stands for Open neural network exchange. ONNX is an industry-standard format for changing model frameworks, this means you can train a model in PyTorch or any other common frameworks and then convert to onnx and then convert back to TensorFlow or any other framework. 

So ONNX allows developers to move models between different frameworks such as CNTK, Caffe2, Tensorflow, PyTorch etc.

So why are we converting to ONNX ?

Remember our goal is to use the above custom trained model in DNN module but the issue is the DNN module does not support using the .h5 Keras model directly. So we have to convert our .h5 model to a .onnx model after doing this we will be able to take the onnx model and plug it into the DNN module.

Note: Even if you saved the model in saved_model format then you still can’t use it directly 

You need to use keras2onnx module to perform the conversion so you should  go ahead and install keras2onnx module.

pip install keras2onnx

You also need to install onnx so that you can save .onnx models to disk.

pip install onnx

After installing keras2onnx, you can use its convert_keras function to convert the model, we will also serialize the model to disk using keras2onnx.save_model  so we can use it later.

tf executing eager_mode: True
tf.keras model eager_mode: False
The ONNX operator number change on the optimization: 57 -> 25

Now we’re ready to use this model in the DNN module. Check how your ~7.5 MB .h5 model now has reduced to ~2.5 MB .onnx model, a 3x reduction in size. Make sure to check out  keras2onnx repo for more details.

Note: You can even use this model with just ONNX using onnxruntime module which itself is pretty powerful considering the support of multiple hardware accelerations.

Using the ONNX model in the OpenCV DNN module:

Now we will take this ONNX model and use it directly in our DNN module.

Let’s use this as a test image.

Here’s the code to test the ONNX model on the image.

Here’s the result of a few images which I took from google, I’m using my custom function classify_flower() to classify these images. You can find this function’s code inside the downloaded Notebook.

If you want to learn about doing image classification using the DNN module in detail then make to read the previous post,  Deep learning with OpenCV DNN module. Where I have explained each step in detail.

What’s Next?

computer vision

If you want to go forward from here and learn more advanced things and go into more detail, understand theory and code of different algorithms then be sure to check out our Computer Vision & Image Processing with Python Course (Urdu/Hindi). In this course, I go into a lot of detail regarding vision fundamentals and cover a plethora of algorithms and techniques to help you master Computer Vision.

The 3 month course contains:

✔ 125 Video Lectures
✔ Discussion Forums
✔ Quizzes
✔ 100+ High Quality Jupyter notebooks
✔ Practice Assignments
✔Certificate of Completion

If you want to start a career in Computer Vision & Artificial Intelligence then this course is for you. One of the best things about this course is that the video lectures are in Urdu/Hindi Language without any compromise on quality, so there is a personal/local touch to it.


In today’s post we first learned how to train an image classifier with tf.keras, after that we learned how to convert our trained .h5 model to .onnx model.

Finally we learned to use this onnx model using OpenCV’s DNN module.

Although the model we converted today was quite basic but this same pipeline can be used for converting complex models too.

A word of Caution: I personally have faced some issues while converting some types of models so the whole process is not foolproof yet but it’s still pretty good. Make sure to look at keras2onnx repo and this excellent repo of ONNX conversion tutorials.

Deep Learning with OpenCV DNN Module, A Comprehensive Guide

Deep Learning with OpenCV DNN Module, A Comprehensive Guide

In this tutorial we will go over OpenCV’s DNN module in detail, I plan to cover various important details of the DNN module that is never discussed, things that usually trip of people like, selecting preprocessing params correctly and designing pre and postprocessing pipelines for different models.

This post is the first of 3 in our brand new Deep Learning with OpenCV series. All three posts are titled as:

  1. Deep Learning with OpenCV DNN Module, A Comprehensive Guide
  2. Training a Custom Image Classifier with Tensorflow, Converting to ONNX and using it in OpenCV DNN module
  3. Using a Custom Trained Object Detector with OpenCV DNN Module

This post can be split into 3 sections.

  1. Introduction to OpenCV’s DNN module.
  2. Using a Caffe DenseNet121 model for classification.
  3. Important Details regarding the DNN module, e.g. where to get models, how to configure them, etc.

If you’re just interested in the image classification part then you can skip to the second section or you can even read this great classification with DNN module post by Adrian. However, if you’re interested in getting to know the DNN module in all its glory then keep reading.

Introduction to OpenCV’s DNN module

First let me start by introducing the DNN module for all those people who are new to it, so as you can probably guess, the DNN module stands for Deep Neural Network module. This is the module in OpenCV which is responsible for all things deep learning related.

It was introduced in OpenCV version 3 and now in version 4.3 it has evolved a lot. This module lets you use pre trained neural networks from popular frameworks like tensorflow, pytorch  etc and use those models directly in OpenCV.

This means you can train models using a popular framework like Tensorflow and then do inference/prediction with just OpenCV.

So what are the benefits here?

Here are some advantages you might want to consider when using OpenCV for inference.

  • By using OpenCV’s DNN module for inference the final code is a lot compact and simpler.
  • Someone who’s not familiar with the training framework can also use the model.
  • Beside supporting CUDA based NVIDIA’s GPU, OpenCV’s DNN module also supports OpenCL based Intel GPUs.
  • Most Importantly by getting rid of the training framework not only makes the code simpler but it ultimately gets rid of a whole framework, this means you don’t have to build your final application with a heavy framework like TensorFlow. This is a huge advantage when you’re trying to deploy on a resource-constrained edge device, e.g. a Raspberry pie.

One thing that might put you off is the fact that OpenCV can’t be used for training deep learning networks. This might sound like a bummer but fret not, for training neural networks you shouldn’t use OpenCV there are other specialized libraries like Tensorflow, PyTorch etc for that task.

So which frameworks can you use to train Neural Networks:

These are the frameworks that are currently supported with the DNN module.

Now there are many interesting pre-trained models already available in the OpenCV Model Zoo that you can use, to keep things simple for this tutorial, I will be using an image classification network to do classification.

I have also made a tutorial on doing Super-Resolution with DNN module and Facial expression recognition that you can look at after going through this post.

Details regarding other types of models are discussed in the 3rd section. By the way, I actually go over 13-14 different types of models in our Computer Vision and Image processing Course. These contain notebooks tutorials and video walk-throughs.

Image Classification pipeline with OpenCV DNN

Now we will be using a DenseNet121 model, which is a caffe model trained on 1000 classes of ImageNet. The model is from the paper Densely Connected Convolutional Networks by Gap Huang et al.

Generally there are 4 steps you need to perform when doing deep learning with DNN module.

  1. Read the image and the target classes.
  2. Initialize the DNN module with an architecture and model parameters.
  3. Perform the forward pass on the image with the module
  4. Post-process the results.

The pre and post processing steps are different for different tasks.

Let’s start with the code

Download Code for this post

Download Code for this post

You can go ahead and download the source code from the download code section. After downloading the zip folder, unzip it and you will have the following directory structure.

Now run the Image Classification with DenseNet121.ipynb notebook, and start executing the cells.

Import Libraries

First we will import the required libraries.

Loading Class Labels

Now we’ll start by loading class names, In this notebook, we are going to classify among 1000 classes defined in ImageNet.

All these classes are in the text file named synset_words.txt. In this text file, each class is in on a new line with its unique id, Also each class has multiple labels for e.g look at the first 3 lines in the text file:

  • ‘n01440764 tench, Tinca tinca’
  • ‘n01443537 goldfish, Carassius auratus’
  • ‘n01484850 great white shark, white shark

So for each line, we have the Class ID, then there are multiple class names, they all are valid names for that class and we’ll just use the first one. So in order to do that we’ll have to extract the second word from each line and create a new list, this will be our labels list.

Number of Classes 1000 [‘n01440764 tench, Tinca tinca’, ‘n01443537 goldfish, Carassius auratus’, ‘n01484850 great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias’, ‘n01491361 tiger shark, Galeocerdo cuvieri’, ‘n01494475 hammerhead, hammerhead shark’]

Extract the Label

Here we will extract the labels (2nd element from each line) and create a labels list.

[‘tench’, ‘goldfish’, ‘great white shark’, ‘tiger shark’, ‘hammerhead’, ‘electric ray’, ‘stingray’, ‘cock’, ‘hen’, ‘ostrich’, ‘brambling’, ‘goldfinch’, ‘house finch’, ‘junco’, ‘indigo bunting’, ‘robin’, ‘bulbul’, ‘jay’, ‘magpie’, ‘chickadee’]

Initializing the DNN Module

Now before we can use the DNN Module we must initialize it using one of the following functions.

  • Caffe Modles: cv2.dnn.readNetFromCaffe
  • Tensorflow Models: cv2.dnn.readNetFromTensorFlow
  • Pytorch Models: cv2.dnn.readNetFromTorch

As you can see the function you use depends upon Original Architecture the model was trained on.

Since we’ll be using a DenseNet121 which was trained using Caffe so our function will be:

retval = cv2.dnn.readNetFromCaffe( prototxt[, caffeModel] )


  • prototxt: Path to the .prototxt file, this is the text description of the architecture of the model.
  • caffeModel: path to the .caffemodel file, this is your actual trained neural network model, it contains all the weights/parameters of the model. This is usually several MBs in size.

Note: If you load the model and proto file via readNetFromTensorFlow then the order of architecture and model inputs are reversed.

Read An Image

Let’s read an example image and display it with matplotlib imshow

Pre-processing the image

Now before you pass an image in the network you need to preprocess it, this means resizing the image to the size it was trained on, for many networks, this is 224×224, in pre-processing step you also do other things like Normalize the image (make the range of intensity values between 0-1) and mean subtraction, etc. These are all the steps the authors did on the images that were used during model training.

Fortunately, In OpenCV you have a function called cv2.dnn.blobFromImage() which most of the time takes care of all the pre-processing for you.

blob = cv2.dnn.blobFromImage(image[, scalefactor[, size[, mean[, swapRB[, crop]]]]])


  • Image Input image.
  • Scalefactor Used to normalize the image. This value is multiplied by the image, value of 1 means no scaling is done.
  • Size The size to which the image will be resized to, this depends upon the each model.
  • Mean These are mean R,G,B Channel values from the whole dataset and these are subtracted from the image’s R,G,B respectively, this gives illumination invariance to the model.
  • swapRB Boolean flag (false by default) this indicates weather swap first and last channels in 3-channel image is necessary.
  • crop flag which indicates whether the image will be cropped after resize or not. If crop is true, input image is resized so one side after resize is equal to the corresponding dimension in size and another one is equal or larger. Then, a crop from the center is performed. If crop is false, direct resize without cropping and preserving aspect ratio is performed.

So After this function we get a 4d blob, this is what we’ll pass to the network.

(1, 3, 224, 224)

Note: There is also blobFromImages() which does the same thing but with multiple images.

Input the Blob Image to the Network 

Here you’re setting up the blob image as the input to the network.

Forward Pass 

Here the actual computation will take place, Most of the time in your whole pipeline will be taken here. Here your image will go through all the model parameters and in the end, you will get the output of the classifier.

Wall time: 166 ms

Total Number of Predictions are: 1000

array([[[-2.0572357 ]], [[-0.18754716]], [[-3.314731 ]], [[-6.196114 ]]], dtype=float32)

Apply Softmax Function to get Probabilities

By looking at the output, you can tell that the model has returned a set of scores for each class but we need Probabilities between 0-1 for each class. We can get them by applying a softmax function on the scores.

array([5.7877337e-06, 3.7540856e-05, 1.6458317e-06, 9.2260699e-08], dtype=float32)

The Maximum probability is the confidence of our target class.


The index Containing the maximum confidence/probability is the index of our target class.


By putting the index from above into our labels list we can get the name of our target class.


As we have successfully performed the classification, now we will just annotate the image with the information we have.

Creating Functions 

Now that we have understood step by step how to create the pipeline for classification using OpenCV’s DNN module, we’ll now create functions that do all the above in a single step. In short we will be creating following two functions.

Initialization Function: This function will contain parts of the network that will be set once, like loading the model.

Main Function: This function will contain all the rest of the code from preprocessing to postprocessing, it will also have the option to either return the image or display it with matplotlib.

Initialization Function

This method will be run once and it will initialize the network with the required files.

Main Method

returndata is set to True when we want to perform classification on video.

Initialize the Classifier

Calling our initializer to initialize the network.

Using our Classifier Function

Now we can call our classifier function and test on multiple images.

Real time Image Classification

If you want to this classifier in real time then here is the code for that.

Important Details Regarding the DNN module 

Let’s discuss some interesting details and some tips to fully utilize the DNN module.

Where to get the pre-trained Models:

Earlier I mentioned that you can get other pre-trained models, so where are they? 

The best place to get pre-trained models is here. This page is a wiki for Deep learning with OpenCV, you will find models that have been tested by the OpenCV team.

There are a variety of models present here, for things like Classification, Pose Detection, Colorization, Segmentation, Face recognition, text detection, style transfer, and more. You can take models from any of the above 5 frameworks.

Just click on the models to go to their repo and download them from there. Note: The models listed on the page above are only the tested models, in theory, you can almost take any pre-trained model and use it in OpenCV. 

A faster and easier way to download models is to go here. Now, this is a python script that will let you download not only the most commonly used models but also some State of the Art ones like Yolo v4 etc. You can download this script and then run from the command line. Alternatively, if you’re in a rush and just one specific model then you can take the downloadable URL of any model and download it.

After downloading the model, you will need a couple of more things before you can actually use the model in the OpenCV dnn module.

You’re now probably familiar with those things, so yeah you will need the model configuration file like the prototxt file we just used with our Caffe model above. You will also need class labels, now for classification problems, models are usually trained on the ImageNet dataset so we needed synset_word.txt file, for Object detection you will find models trained on COCO or Pascal VOC dataset. And similarly, other tasks may require other files.

So where are all these files present ?

You will find most of these configuration files present here and the class names here. If the configuration file you’re looking for is not present in the above links then I would recommend that you look at the GitHub repo of the model, the files would be present there. Otherwise, you have to create it yourself. (More on this later)

After getting the configuration files, the only thing you need is the pre-processing parameters that go in blobFromImage. E.g. the mean subtraction values, scaling params etc. 

You can get that information from here. Now, this script only contains parameter details for a few popular models. 

So how do you get the details for other models ?

For that you would need to go to the repo of the model and look in the ReadMe section, the authors usually put that information there. 

For e.g. If I visit the github repo of the Human Pose Estimation model using this link which I got from the model downloading script.

By scrolling down the readme I can find these details here:

Note: These details are not always present in the Readme and sometimes you have to do quite some digging before you can find these parameters.

What to do if there is no GitHub repo link with the model, for e.g. this shuffleNet model does not have a GitHub link, in that case, I can see that the framework is ONNX.

So now I will visit the ONNX model zoo repo and find that model. 

After clicking on the model I will find its readme and then its preprocessing steps.

Notice that this model contains some preprocessing steps that are not supported by blobfromImage function. So this could happen and at times you would need to write custom preprocessing steps without using blobfromImage function, for e.g. in our Super Resolution post, I had to write a custom pre-processing pipeline for the network.

How to use our own Custom Trained Networks

Now that we have learned to use different models, you might wonder exactly how can we use our own custom trained models. So the thing is you can’t directly plug a trained network in a DNN module but you need to perform some operations to get a configuration file, which is why we needed a prototxt file along with the model.

Fortunately, In the next two blog posts, I plan to cover exactly this topic and show you how to use a custom trained classifier and a custom trained Detection network.

For now, you can take a look at this page which briefly describes how you can use models trained with Tensorflow Object Detection API in OpenCV.

One thing to note is that not all networks are supported by the DNN module, this is because DNN module supports some 30+ layer types, these layer names can be found at the wiki here. So if a model contains layers that are not among the supported layers then it won’t run, this is not a major issue as most common layers used in deep learning models are supported. 

Also OpenCV provides a way for you to define your own custom layers.

Using GPU’s and Faster Backends to speed up OpenCV DNN Module

By default OpenCV’s DNN module runs on the default C++ implementation which itself is pretty fast but OpenCV further allows you to change this backend to increase the speed even more.

Option 1: Use NVIDIA GPU with CUDA backend in the DNN module:

If you have an Nvidia GPU present then great, you can use that with the DNN module, you can follow my OpenCV source installation guide to configure your NVIDIA GPU for OpenCV and learn how to use it. This will make your networks run several times faster.

Option 2: Use OpenCL based INTEL GPU’s:

If you have an OpenCL based GPU then you can use that as a backend, although this increases speed but in my experience, I’ve seen speed gains only in 32 bit systems. To use the OpenCL as a backend you can see the last section of my OpenCV source installation section linked above.

Option 3: Use Halide Backend:

As described on this post from, for some time in the past using the halide backend increased the speed but then OpenCV engineer’s optimized the default C++ implementation so much that the default implementation actually got faster. So I don’t see a reason to use this backend now, Still here’s how you configure halide as a backend.

Option 4: Use Intel’s Deep Learning Inference Engine backend:

Intel’s Deep Learning Inference Engine backend is part of OpenVINO toolkit, OpenVINO stands for Open Visual Inferencing and Neural Network Optimization. OpenVINO is designed by Intel to speed up inference with neural networks, especially for tasks like classification, detection, etc. OpenVINO speeds up by optimizing the model in a hardware-agnostic way. You can learn to install OpenVINO here and here’s a nice tutorial for it.

What’s Next?

computer vision

If you want to go forward from here and learn more advanced things and go into more detail, understand theory and code of different algorithms then be sure to check out our Computer Vision & Image Processing with Python Course (Urdu/Hindi). In this course, I go into a lot of detail regarding vision fundamentals and cover a plethora of algorithms and techniques to help you master Computer Vision.

The 3 month course contains:

✔ 125 Video Lectures
✔ Discussion Forums
✔ Quizzes
✔ 100+ High Quality Jupyter notebooks
✔ Practice Assignments
✔Certificate of Completion

If you want to start a career in Computer Vision & Artificial Intelligence then this course is for you. One of the best things about this course is that the video lectures are in Urdu/Hindi Language without any compromise on quality, so there is a personal/local touch to it.


In today’s tutorial, we went over a number of things regarding OpenCV’s DNN module. From using pre-trained models to Optimizing for faster inference speed.

We also learned to perform a classification pipeline using densenet121.

This post should serve as an excellent guide for anyone trying to get started in Deep learning using OpenCV’s DNN module.

Finally, OpenCV’s DNN repo contains an example python scripts to run common networks like classification, text, object detection, and more. You can start utilizing the DNN module by using these scripts and here are a few DNN Tutorials by OpenCV.

The main contributor for the DNN module in OpenCV is Dmitry Kurtaev and formerly it was Aleksandr Rybnikov, so big thanks to them and the rest of the contributors for making such a great module.

I hope you enjoyed today’s tutorial, feel free to comment and ask questions.

A Crash Course with Dlib Library, 101 to Mastery

A Crash Course with Dlib Library, 101 to Mastery

Main Image

This tutorial will serve as a crash course to dlib library. Dlib is another powerful computer vision library out there. It is not as extensive as OpenCV but still, there is a lot you can do with it.

This crash course assumes you’re somewhat familiar with OpenCV, if not then I’ve also published a crash course on OpenCV too. Make sure to download Dlib Resource Guide above which includes all important links in this post.

Side Note: I missed publishing a tutorial last week as I tested covoid positive and was ill, still not 100% but getting better 🙂

Dlib is created and maintained by Davis King, It’s a C++ toolkit containing machine learning & Computer Vision algorithms for a number of important tasks including, Facial Landmark detection, Deep Metric Learning, Object tracking and more. It also has a python API.

Note: It’s worth noting that the main power of dlib is in numerical optimization but today I’m only going to focus on applications, you can look at optimization examples here.

Its a popular library which is used by people in both industry and academia in a wide range of domains including robotics, embedded devices and other areas.

I plan to cover most of the prominent features and algorithms present in dlib so this blog post alone can give you the best overview of dlib and its functionality. Now, this is a big statement, If I had to explain most of dlib features in a single place then I would probably be writing a book or making a course on it but rather I plan to explain it all in this post.

So how am I going to accomplish that?

So here’s the thing I’m not going to write and explain the code for each algorithm with dlib, because I don’t want to write several thousand’s of words worth of a blog post and also because almost all of the features of dlib have been explained pretty well in several posts on the internet.

So if everything is out there then why the heck am I trying to make a crash course out of it ?

So here’s the real added value of this crash course:

In this post, I will connect all the best and the most important tutorials on different aspects of dlib out there in a nice hierarchical order. This will not only serve as a golden Dlib 101 to Mastery post for people just starting out with dlib but will also serve as a well-structured reference guide for dlib users.

The post is split into various sections, in each section, I will briefly explain a useful algorithm or technique present in dlib. If that explanation intrigues you and you feel that you need to explore that particular algorithm further then in each section I provide links to high-quality tutorials that goes in-depth about that function, the links would mostly be from Pyimagesearch, LearnOpenCV as these are golden sites when it comes to Computer Vision Tutorials. 

When learning some topic, ideally we prefer these two things:

  • A Collection of all the useful material regarding the topic presented at one place in a nice and neat hierarchical order.
  • Each material presented and delivered in a high-quality format preferably by an author who knows how to teach it the right way.

In this post, I’ve made sure both of these points are true, all the information is presented in a nice order and the posts that I link to will be of high quality. Other than that I will also try to include other extra resources where I feel necessary. 

Now let’s get started

Download Resource Guide for this post

Download Resource Guide for this post

Here’s the outline for this crash course:


The easiest way to install dlib is to do:

pip install dlib

This will only work if you have Visual Studio (i.e. you need a C++ compiler) and CMake installed as dlib will build and compile first before installing. If you don’t have these then you can use my OpenCV’s source installation tutorial to install these two things.

If you don’t want to bother installing these then here’s what you can do, if you have a python version greater then 3.6 then create a virtual environment for python 3.6 using Anaconda or virtualenv.

After creating a python 3.6 environment you can do:

pip install dlib==19.8.1

This will let you directly install pre-built binaries of dlib but this currently only works with python 3.6 and below.

Extra Resources:

Installing dlib in Mac, Raspi & Ubuntu.

Face Detection:

Now that we have installed dlib, let’s start with face detection.

Why face detection ?

Well, most of the interesting use cases in dlib for computer vision is with faces, like facial landmark detection, face recognition, etc so before we can detect facial landmarks, we need to detect faces in the image.

Dlib not only comes with a face detector but it actually comes with 2 of them. If you’re a computer vision practitioner then you would most likely be familiar with the old Haar cascade based face detector. Although this face detector is a lot popular, it’s almost 2 decades old and not very effective when it comes to different orientations of the faces.

Dlib comes with 2 face detection algorithms that are way more effective than the haar cascade based detectors.

These 2 detectors are:

HOG (histogram of oriented gradients) based detection: This detector uses HOG and Support vector machines, its slower than haar cascades but its more accurate and able to handle different orientations
CNN Based Detector: This is a really accurate deep learning based detector but its extremely slow on a CPU, you should only use this if you’ve compiled dlib with GPU.

You can learn more about these detectors here. Other than that I published a library called bleedfacedetector which lets you use these 2 detectors using just a few lines of the same code, and the library also has 2 other face detectors including the haar cascade one. You can look at bleedfacedetector here.

Extra Resources:

Here’s a tutorial on different Face detection methods including the dlib ones.

Facial Landmark Detection:

Now that we have learned how to detect faces in images, we will now learn the most common use case of dlib library which is facial landmark detection, with this method you will be able to detect key landmarks/features of the face like eyes, lips, etc.

The detection of these features will allow you to do a lot of things like track the movement of eyes, lips to determine the facial expression of a person, control a virtual Avatar with your facial expressions, understand 3d facial pose of a person, virtual makeover, face swapping, morphing, etc.

Remember those smart Snapchat overlays which trigger based on the facial movement, like that tongue that pops out when you open your mouth, well you can also make that using facial landmarks.

So its suffice to say that Facial landmark detection has a lot of interesting applications.

The landmark detector in dlib is based on the paper “One Millisecond Face Alignment with an Ensemble of Regression Trees”, its robust enough to correctly detect landmarks in different facial orientations and expressions. And it easily runs in real-time.

The detector returns 68 important landmarks, these can be seen in below image.

The 68 specific human face landmarks | Download Scientific Diagram

You can read a detailed tutorial on Facial Landmark detection here.

After reading the above tutorial the next step is to learn to manipulate the ROI of these landmarks so, you can modify or extract the individual features like the eyes, nose lips, etc. You can learn that by reading this Tutorial.

After you have gone through both of the above tutorials then you’re ready for running the landmark detector in real time but if you’re still confused about the exact process then take a look at this tutorial

Extra Resources:

Here’s another great tutorial on Facial Landmark Detection.

Facial Landmark Detection Applications (Blink, yawn, smile detection & Snapchat filters):

After you’re fully comfortable working with facial landmarks that’s when the fun starts. Now you’re ready to make some exciting applications, you can start by making a blink detection system by going through the tutorial here. 

The main idea for a blink detection system is really simple, you just look at 2 vertical landmark points of the eyes and take the distance between these points, if the distance is too small (below some threshold) then that means the eyes are closed.

Of course, for a robust estimate, you won’t just settle for the distance between two points but rather you will take a smart average of several distances. One smart approach is to calculate a metric called Eye aspect ratio (EAR) for each eye. This metric was introduced in a paper called “ Real-Time Eye Blink Detection using Facial Landmarks

This will allow you to utilize all 6 x,y landmark points of the eyes returned by dlib, and this way you can accurately tell if there was a blink or not.

Here’s the equation to calculate the EAR.

The full implementation details are explained in the tutorial linked above.

You can also easily extend the above method to create a drowsiness detector that alerts drivers if they feel drowsy, this can be done by monitoring how long the eyes are closed for. This is a really simple extension of the above and have real-world applications and could be used to save lives. Here’s a tutorial that explains how to build a step by step drowsiness detection system.

Interestingly you can take the same blink detection approach above and apply it to lips instead of the eyes, and create a smile detector. Yeah, the only thing you would need to change would be the x,y point coordinates (replace eye points with lip points), the EAR equation (use trial and error or intuition to change this), and the threshold.

Few years back I created this smile camera application with only a few lines of code, it takes a picture when you smile. You can easily create that by modifying the above tutorial.

What more can you create with this ?

How about a yawn detector, or a detector that tells if the user’s mouth is opened or not. You can do this by slightly modifying the above approach, you will be using the same lips x,y landmark points, the only difference would be how you’re calculating the distance between points.

Here’s a cool application I built a while back, its the infamous google dino game that’s controlled by me opening and closing the mouth.

The only drawback of the above application is that I can’t munch food while playing this game.

Taking the same concepts above you can create interesting snapchat overlay triggers. 

Here’s an eye bulge and fire throw filter I created that triggers when I glare or open my mouth.

Similarly you can create lots of cool things using the facial landmarks.

Facial Alignment & Filter Orientation Correction:

Doing a bit of math with the facial landmarks will allow you to do facial alignment correction. Facial alignment allows you to correctly orient a rotated face.

Why is facial alignment important?

One of the most important use case for facial alignment is in face recognition, there are many classical face recognition algorithms that will perform better if the face is oriented correctly before performing inference on them.

Here’s a full tutorial on facial Alignment.

One other useful thing concerning facial alignment is that you can actually extract the angle of the rotated face, this is pretty useful when you’re working with an augmented reality filter application as this will allow you to rotate the filters according to the orientation of the face.

Here’s an application I built that does that. 

Head Pose Estimation:

A problem similar to facial alignment correction could be head pose estimation. In this technique instead of determining the 2d head rotation, you will learn to extract the full 3d head pose orientation. This is particularly useful when you’re working with an augmented reality application like overlaying a 3d mask on the face. You will only be able to correctly render the 3d object on the face if you know the face’s 3d orientation.

Here’s a great tutorial that teaches you head pose estimation in great detail.

Single & Multi-Object Tracking with Dlib:

Landmark detection is not all dlib has to offer, there are other useful techniques like a correlation tracking algorithm for Object Tracking that comes packed with dlib.

The tracker is based on Danelljan et al’s 2014 paper, Accurate Scale Estimation for Robust Visual Tracking

This tracker works well with changes in translation and scale and it works in real time.

Object Detection VS  Object Tracking:

If you’re just starting out in your computer vision journey and have some confusion regarding object detection vs tracking then understand that in Object Detection, you try to find an instance of the target object in the whole image. And you perform this detection in each frame of the video. There can be multiple instances of the same object and you’ll detect all of them with no differentiation between those object instances.

What I’m trying to say above is that a single image or frame of a video can contain multiple objects of the same class for e.g. multiple cats can be present on the same image and the object detector will see it as the same thing CAT with no difference between the individual cats throughout the video.

Whereas an Object Tracking algorithm will track each cat separately in each frame and will recognize each cat by a unique ID throughout the video. 

You can read this tutorial that goes over Dlib correlation tracker.

After reading the above tutorial you can go ahead and read this tutorial for using the correlation tracker to track multiple objects.

Face Swapping, Averaging & Morphing:

Here’s a series of cool facial manipulations you can do by utilizing facial landmarks and some other techniques.

Face Morphing:

What you see in the above video is called facial morphing. I’m sure you have seen such effects in other apps and movies. This effect is a lot more than a simple image pixel blending or transition.

To have a morph effect like the above, you need to do image alignment, establish pixel correspondences using facial landmark detection and more.

Here’s a nice tutorial that teaches you face morphing step by step.

By understanding and utilizing facial morphing techniques you can even do morphing between dissimilar objects like a face to a lion.

Face Swapping:

After you’ve understood face morphing then another really interesting you can do is face swapping, where you take a source face and put it over a destination face. Like putting Modi’s face over Musharaf’s above.

The techniques underlying face swapping is pretty similar to the one used in face morphing so there is not much new here.

The way this swapping is done makes the results look real and freakishly weird. See how everything from lightning to skin tone is matched.

Here’s a full tutorial on face swapping.

Tip: If you want to make the above code work in real-time then you would need to replace the seamless cloning function with some other faster cloning method, the results won’t be as good but it’ll work in real-time.

Alternative Tutorial:
Switching eds with python

Note: It should be noted this technique although gives excellent results but the state of the art in face swapping is achieved by deep learning based methods (deepfakes, FaceApp etc).

Face Averaging:

Average face of: Aiman Khan, Ayeza Khan, Mahira Khan, Mehwish Hayat, Saba Qamar & Syra Yousuf 

Similar to above methods there’s also Face averaging where you smartly average several faces together utilizing facial landmarks.

The face image you see above is the average face I created using 6 different Pakistani female celebrities.

Personally speaking out of all the applications here I find face averaging the least useful or fun. But Satya has written a really interesting Tutorial on face averaging here that is worth a read.

Face Recognition:

It should not come as a surprise that dlib also has a face recognition pipeline, not only that but the Face recognition implementation is really robust one and is a modified version of  ResNet-34, based on the paper “ Deep Residual Learning for Image Recognition paper by He et al.”, it has an accuracy of 99.38% on the Labeled Faces in the Wild (LFW) dataset. This dataset contains ~3 million images.

The model was trained using deep metric learning and for each face, it learned to output a 128-dimensional vector. This vector encodes all the important information about the face. This vector is also called a face embedding.

First, you will store some face embeddings of target faces and then you will test on different new face images. Meaning you will extract embedding from test images and compare it with the saved embeddings of the target faces.

If two vectors are similar (i.e. the euclidean distance between them is small) then it’s said to be a match. This way you can make thousands of matches pretty fast. The approach is really accurate and works in real-time.

Dlib’s Implementation of face recognition can be found here. But I would recommend that you use the face_recognition library to do face recognition.This library uses dlib internally and makes the code a lot simpler.

You can follow this nice tutorial on doing face recognition with face_recognition library.

Extra resources:

An Excellent Guide on face recognition by Adam Geitgey.

Face Clustering:

Image Credit: Dlib Blog

Consider this, you went to a museum with a number of friends, all of them asked you to take their pictures behind several monuments/statues such that each of your friend had several images of them taken by you. 

Now after the trip, all your friends ask for their pictures, now you don’t want to send each of them your whole folder. So what can you do here?

Fortunately, face clustering can help you out here, this method will allow you to make clusters of images of each unique individual.

Consider another use case: You want to quickly build a face recognition dataset for 10 office people that reside in a single room. Instead of taking manual face samples of each person, you instead record a short video of everyone together in the room, you then use a face detector to extract all the faces in each frame, and then you can use a face clustering algorithm to sort all those faces into clusters/folders. Later on, you just need to name these folders and your dataset is ready.

Clustering is a useful unsupervised problem and has many more use cases.
Face clustering is built on top of face recognition so once you’ve understood the recognition part this is easy.

You can follow this tutorial to perform face clustering.

Training a Custom Landmark Predictor:

Just like the Dlib’s Facial Landmark detector, you can train your own custom landmark detector. This detector is also called a shape predictor. Now you aren’t restricted to only facial landmarks but you can go ahead and train a landmark detector for almost anything, body joints of a person, some key points of a particular object, etc. 

As long as you can get sufficient annotated data for the key points, you can use dlib to train a landmark detector on it.

Here’s a tutorial that teaches you how to train a custom Landmark detector.

After going through the above tutorial, you may want to learn how to further optimize your trained model in terms of model size, accuracy, and speed. 

So there are multiple Hyperparameters that you can tune to get better performance, here’s a tutorial that lets you automate the tuning process, also take a look a this too.

Extra Resources:

Here’s another tutorial on training a shape predictor.

Training a Custom Object Detector:

Just like a custom landmark detector, you can train a custom Object detector with dlib. Dlib uses Histogram of Oriented Gradients (HOG) as features and a Support Vector Machine (SVM) Classifier. This combined with sliding windows and image pyramids, you’ve got yourself an Object detector. The only limitation is that you can train it to detect a single object at a time.

The Object detection approach in dlib is based on the same series of steps used in the sliding window based object detector first published by Dalal and Triggs in 2005 in the Histograms of Oriented Gradients for Human Detection.

HOG + SVM based detector are the strongest non Deep learning based approach for object detection, Here’s a hand detector I built using this approach a few years back. 

I didn’t even annotated nor collected training data for my hands but instead made a sliding window application that automatically collected my hand pictures as it moved on the screen and I placed my hands in the bounding box.

Afterward, I took this hand detector created a  Video car game controller, so now I was steering the Video game car with my hands literally. To be honest, that wasn’t a pleasant experience, my hand was sore afterwards. Making something cool is not hard but it would take a whole lot effort to make a practical VR or AR-based application. 

Here’s Dlib Code for Training an Object Detector and here’s a blog post that teaches you how to do that.

Extra Resources:
Here’s another Tutorial on training the detector.

Dlib Optimizations For Faster & Better Performance:

Here’s a bunch of techniques and tutorials that will help you get the most out of dlib’s landmark detection.

Using A Faster Landmark Detector:

Beside’s the 68 point landmark detector, dlib also has 5 point landmark detector that is 10 times smaller and faster (about 10%) than the 68 point one. If you need more speed and the 5 landmark points as visualized above is all you need then you should opt for this detector. Also from what I’ve seen its also somewhat more efficient than the 68 point detector.

Here’s a tutorial that explains how to use this faster landmark detector.

Speeding Up the Detection Pipeline:

There are a bunch of tips and techniques that you can use to get a faster detection speed, now a landmark detector itself is really fast, the rest of the pipeline takes up a lot of time. Some tricks you can do to increase speed are:

Skip Frames:

If you’re reading from a high fps camera then it won’t hurt to perform detection on every other frame, this will effectively double your speed.

Reduce image Size: 

If you’re using Hog + Sliding window based detection or a haar cascade + Sliding window based one then the face detection speed depends upon the size of the image. So one smart thing you can do is reduce the image size before face detection and then rescale the detected coordinates for the original image later.

Both of the above techniques and some others are explained in this tutorial.

Tip: The biggest bottleneck you’ll face in the landmark detection pipeline is the HOG based face detector in dlib which is pretty slow. You can replace this with haar cascades or the SSD based face detector for faster performance.

What’s Next?

computer vision

If you want to go forward from here and learn more advanced things and go into more detail, understand theory and code of different algorithms then be sure to check out our Computer Vision & Image Processing with Python Course (Urdu/Hindi). In this course, I go into a lot of detail regarding vision fundamentals and cover a plethora of algorithms and techniques to help you master Computer Vision.

The 3 month course contains:

✔ 125 Video Lectures
✔ Discussion Forums
✔ Quizzes
✔ 100+ High Quality Jupyter notebooks
✔ Practice Assignments
✔Certificate of Completion

If you want to start a career in Computer Vision & Artificial Intelligence then this course is for you. One of the best things about this course is that the video lectures are in Urdu/Hindi Language without any compromise on quality, so there is a personal/local touch to it.


Let’s wrap up, in this tutorial we went over a number of algorithms and techniques in dlib.

We started with installation, moved on to face detection and landmark prediction, and learned to build a number of applications using landmark detection. We also looked at other techniques like correlation tracking and facial recognition.

We also learned that you can train your own landmark detectors and object detectors with dlib.

At the end we learned some nice optimizations that we can do with our landmark predictor. 

Extra Resources:

Final Tip: I know most of you won’t be able to go over all the tutorials linked here in a single day so I would recommend that you save and bookmark this page and tackle a single problem at a time. Only when you’ve understood a certain technique move on to the next.

It goes without saying that Dlib is a must learn tool for serious computer vision practitioners out there.

I hope you enjoyed this tutorial and found it useful. If you have any questions feel free to ask them in the comments and I’ll happily address it.

Super Resolution, Going from 3x to 8x Resolution in OpenCV

Super Resolution, Going from 3x to 8x Resolution in OpenCV

A few weeks ago I published a tutorial on doing Super-resolution with OpenCV using the DNN module.

I would recommend that you go over that tutorial before reading this one but you can still easily follow along with this tutorial. For those of you who don’t know what Super-resolution is then here is an explanation.

Super Resolution can be defined as the class of Algorithms that upscales an image without losing quality, meaning you take a  low-resolution image like an image of size 224×224 and upscale it to a high-resolution version like 1792×1792 (An 8x resolution)  without any loss in quality. How cool is that?

Anyways that is Super resolution, so how is this different from the normal resizing you do?

When you normally resize or upscale an image you use Nearest Neighbor Interpolation. This just means you expand the pixels of the original image and then fill the gaps by copying the values of the nearest neighboring pixels.

The result is a pixelated version of the image.

There are better interpolation methods for resizing like bilinear or bicubic interpolation which take weighted average of neighboring pixels instead of just copying them.

Still the results are blurry and not great.

The super resolution methods enhance/enlarge the image without the loss of quality, Again, for more details on the theory of super resolution methods, I would recommend that you read my Super Resolution with OpenCV Tutorial.

In the above tutorial I describe several architectural improvements that happened with SR Networks over the years.

But unfortunately in that tutorial, I only showed you guys a single SR model which was good but it only did a 3x resolution. It was also from a 2016 paper Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network” 

That all changes now, in this tutorial we will work with multiple models, even those that will do 8x resolution.

Today, we won’t be using the DNN module, we could do that but for the super resolution problem OpenCV comes with a special module called dnn_superres which is designed to use 4 different powerful super resolution networks. One of the best things about this module is that It does the required pre and post processing internally, so with only a few lines of code you can do super resolution.

The 4 models we are going to use are:

  • EDSR: Enhanced Deep Residual Network from the paper Enhanced Deep Residual Networks for Single Image Super-Resolution (CVPR 2017) by Bee Lim et al.

  • ESPCN: Efficient Subpixel Convolutional Network from the paper Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network (CVPR 2016) by Wenzhe Shi et al.

  • FSRCNN: Fast Super-Resolution Convolutional Neural Networks from the paper Accelerating the Super-Resolution Convolutional Neural Network (ECCV 2016) by Chao Dong et al.

  • LapSRN: Laplacian Pyramid Super-Resolution Network from the paper Deep Laplacian pyramid networks for fast and accurate super-resolution (CVPR 2017) by Wei-Sheng Lai et al.

Here are the papers for the models and some extra resources.

Make sure to download the zip folder from the download code section above. As you can see by clicking the Download models link that each model has different versions like 3x, 4x etc. This means that the model can perform 3x resolution, 4x resolution of the image, and so on. The download zip that I provide contains only a single version of each of the 4 models above.

You can feel free to test out other models by downloading them. These models should be present in your working directory if you want to use them with the dnn_superres module.

Now the inclusion of this super easy to use dnn_superres module is the result of the work of 2 developers Xavier Weber and Fanny Monori. They developed this module as part of their GSOC (Google summer of code) project. GSOC 2019 also made NVIDIA GPU support possible. 

It’s always amazing to see how a summer project for students by google brings forward some great developers making awesome contributions to the largest Computer Vision library out there.

The dnn_superes module in OpenCV was included in version 4.1.2  for C++ but the python wrappers were added in 4.3 version about a month back, so you have to make sure that you have OpenCV version 4.3 installed. And of course, since this module is included in the contrib module so make sure you have also installed OpenCV contrib package.


Note: You can’t install OpenCV 4.3 version by doing pip install as the latest version here open-contrib-python from pip is still version

So the pypi version of OpenCV is maintained by just one guy named: Olli-Pekka Heinisuo by username: skvark and he updates the pypi OpenCV package in his free time. Currently, he’s facing a compiling issue which is why 4.3 version has not come out as of 7-15-2020. But from what I have read, he will be building the .whl files for 4.3 version soon, it may be out this month. If that happens then I’ll update this post.

So right now the only way you will be able to use this module is if you have installed OpenCV 4.3 from Source. If you haven’t done that then you can easily follow my installation tutorial.

I should also take this moment to highlight the fact you should not always rely on OpenCV’s pypi package, no doubt skvark has been doing a tremendous job maintaining OpenCV’s pypi repo but this issue tells you that you can’t rely on  a single developer’s free time to update the library for production use cases, learn to install the Official library from source. Still, pip install opencv-contrib-python is a huge blessing for people starting out or in early stages of learning OpenCV, so hats off to skvark.

As you might have noticed among the 4 models above we have already learned to use ESPCNN in the previous tutorial, we will use it again but this time with the dnn_superres module.

Super Resolution with dnn_superres Code

Download Code for this post

Download Code for this post

Directory Hierarchy

After downloading the zip folder, unzip it and you will have the following directory structure.

This is how our directory structure looks like, it has a Jupyter notebook, a media folder with images and the model folder containing all 4 models.

You can now run the notebook Super_Resolution_with_dnn_superres.ipynb and start executing each cell as follows.

Import Libraries

Start by Importing the required libraries.

Initialize the Super Resolution Object

First you have to create the dnn_superres constructor by the following command.

Read Image

We will start by reading and displaying a sample image. We will be running the EDSR model (with 4x scale) to upscale this image.

Extracting Model Name & Scale

In the next few steps, will be using a setModel() function in which we will pass the model’s name and its scale. We could manually do that but all this information is already present in the model’s pathname so we just need to extract the model’s name and scale using simple text processing.

model name: edsr
model scale: 4

Reading the model

Finally we will read the model, this is where all the required weights of the model gets loaded. This is equivalent to DNN module’s readnet function

Setting Model Name & Scale

Here we are setting the name and scale of the model which we extracted above.

Why do we need to do that ?

So remember when I said that this module does not require us to do preprocessing or postprocessing because it does that internally. So in order to initiate the correct pre and post-processing pipelines, the module needs to know which model we will be using and what version meaning what scale 2x, 3x, 4x etc.

Running the Network

This is where all the magic happens. In this line a forward pass of the network is performed along with required pre and post-processing. We are also making note of the time taken as this information will tell us if the model can be run in real-time or not.

As you can see it takes a lot of time, in fact, EDSR is the most expensive model out of the four in terms of computation.

It should be noted that larger your input image’s resolution is the more time its going to take in this step.

Wall time: 45.1 s

Check the Shapes

We’re also checking the shapes of the original image and the super resolution image. As you can see the model upscaled the image by 4 times.

Shape of Original Image: (262, 347, 3) , Shape of Super Resolution Image: (1200, 1200, 3)

Comparing the Original Image & Result

Finally we will display the original image along with its super resolution version. Observe the difference in Quality.

Save the High Resolution Image

Although you can see the improvement in quality but still you can’t observe the true difference with matplotlib so its recommended that you save the SR image in disk and then look at it.

Creating Functions

Now that we have seen a step by step implementation of the whole pipeline, we’ll create the 2 following python functions so we can use different models on different images by just calling a function and passing some parameters.

Initialization Function: This function will contain parts of the network that will be set once, like loading the model.

Main Function: This function will contain the rest of the code. it will also have the option to either return the image or display it with matplotlib. We can also use this function to process a real-time video.

Initialization Function

Main Function

Set returndata = True when you just want the image. This is usually done when I’m working with videos. I’ve also added a few more optional variables to the method.

print_shape: This variable decides if you want to print out the shape of the model’s output.

name: This is the name by which you will save the image in disk.

save_img: This variable decides if you want to save the images in disk or not.

Now that we have created the initialization function and a main function, lets use all 4 models on different examples

The function above displays the original image along with the SR Image.

Initialize Enhanced Deep Residual Network (EDSR, 4x Resolution)

Run the network

Shape of Original Image: (221, 283, 3) , Shape of Super Resolution Image: (884, 1132, 3)
Wall time: 43.1 s

Initialize Efficient Subpixel Convolutional Network (ESPCN, 4x Resolution)

Run the network

Shape of Original Image: (256, 256, 3) , Shape of Super Resolution Image: (1024, 1024, 3)
Wall time: 295 ms

Initialize Fast Super-Resolution Convolutional Neural Networks (FSRCNN, 3x Resolution)

Run the network

Shape of Original Image: (232, 270, 3) , Shape of Super Resolution Image: (696, 810, 3)
Wall time: 253 ms

Initialize Laplacian Pyramid Super-Resolution Network (LapSRN, 8x Resolution)

Run the network

Shape of Original Image: (302, 357, 3) , Shape of Super Resolution Image: (2416, 2856, 3)
Wall time: 26 s

Applying Super Resolution on Video

Lastly, I’m also providing the code to run Super-resolution on Videos. Although the example video I’ve used sucks, but that’s the only video I tested on primarily because I’m only interested in doing super resolution on images as this is where most of my use cases lie. Feel free to test out different models for real-time feed.

Tip: You might also want to save the High res video in disk using the VideoWriter Class.


Here’s a chart for benchmarks using a 768×512 image with 4x resolution on an Intel i7-9700K CPU for all models.

The benchmark shows PSNR (Peak signal to noise ratio) and SSIM (structural similarity index measure) scores, these are the scores which measure how good the supre res network’s output is.

The best performing model is EDSR but it has the slowest inference time, the rest of the models can work in real time.

For detailed benchmarks you can see this page.  Also make sure to check Official OpenCV contrib page on dnn_superres module

If you thought upscaling to 8x resolution was cool then take a guess on the scaling ability of the current state of the Art algorithm in super-resolution?

So believe it or not the state of the art in SR can actually do a 64x resolution…yes 64x, that wasn’t a typo.

In fact, the model that does 64x was published just last month, here’s the paper for that model, here’s the GitHub repo and here is a ready to run colab notebook to test out the code. Also here’s a video demo of it. It’s pretty rare that such good stuff is easily accessible for programmers just a month after publication so make sure to check it out.

The model is far too complex to explain in this post but the authors took a totally different approach, instead of using supervised learning they used self-supervised learning. (This seems to be on the rise).

What’s Next?

computer vision

If you want to go forward from here and learn more advanced things and go into more detail, understand theory and code of different algorithms then be sure to check out our Computer Vision & Image Processing with Python Course (Urdu/Hindi). In this course, I go into a lot of detail regarding vision fundamentals and cover a plethora of algorithms and techniques to help you master Computer Vision.

The 3 month course contains:

✔ 125 Video Lectures
✔ Discussion Forums
✔ Quizzes
✔ 100+ High Quality Jupyter notebooks
✔ Practice Assignments
✔Certificate of Completion

If you want to start a career in Computer Vision & Artificial Intelligence then this course is for you. One of the best things about this course is that the video lectures are in Urdu/Hindi Language without any compromise on quality, so there is a personal/local touch to it.


In today’s tutorial we learned to use 4 different architectures to do Super resolution going from 3x to 8x resolution. 

Since the library handles preprocessing and postprocessing, so the code for all the models was almost the same and pretty short.

As I mentioned earlier, I only showed you results of a single version of each model, you should go ahead and try other versions of each model.

These models have been trained using DIV2K  BSDS and General100 datasets which contains images of diverse objects but the best results from a super-resolution model is obtained by training them for a domain-specific task, for e.g if you want the SR model to perform best on pedestrians then your dataset should consist mostly of pedestrian images. The best part about training SR networks is that you don’t need to spend hours doing manual annotation, you can just resize them and you’re all set.

Also I would raise a concern regarding these models that we must be careful using SR networks, for e.g. consider this scenario:

 You caught an image of a thief stealing your mail on your low res front door cam, the image looks blurry and you can’t make out who’s in the image.

Now you being a Computer Vision enthusiast thought of running a super res network to get a clearer picture.
After running the network, you get a much clearer image and you can almost swear that it’s Joe from the next block.

The same Joe that you thought was a friend of yours.

The same Joe that made different poses to help you create a pedestrian datasets for that SR network you’re using right now.

How could Joe do this?

Now you feel betrayed but yet you feel really Smart, you solved a crime with AI right?

You Start STORMING to Joe’s house to confront him with PROOF.

Now hold on! … like really hold on.

Don’t do that, seriously don’t do that.

Why did I go on a rant like that?

Well to be honest back when I initially learned about SR networks that’s almost exactly what I thought I would do. Solve Crimes by AI by doing just that (I know it was a ridiculous idea). But soon I realize that SR networks only learn to hallucinate data based on learned data, they can’t visualize a face with 100% accuracy that they’ve never seen. It’s still pretty useful but you have to use this technology carefully.

I hope you enjoyed this tutorial, feel free to comment below and I’ll gladly reply.