Training a Custom Image Classifier with Tensorflow, Converting to ONNX and using it in OpenCV DNN module

Training a Custom Image Classifier with Tensorflow, Converting to ONNX and using it in OpenCV DNN module

In the previous tutorial we learned how the DNN module in OpenCV works, we went into a lot of details regarding different aspects of the module including where to get the models, how to configure them, etc. 

This Tutorial will build on top of the previous one so if you haven’t read the previous post then you can read that here. 

Today’s post is the second tutorial in our brand new 3 part Deep Learning with OpenCV series. All three posts are titled as:

  1. Deep Learning with OpenCV DNN Module, A Comprehensive Guide
  2. Training a Custom Image Classifier with OpenCV, Converting to ONNX, and using it in OpenCV DNN module.
  3. Using a Custom Trained Object Detector with OpenCV DNN Module.

In this post, we will train a custom image classifier with Tensorflow’s Keras API. So if you want to learn how to get started creating a Convolutional Neural Network using Tensorflow, then this post is for you, and not only that but afterward, we will also convert our trained .h5 model to ONNX format and then use it with OpenCV DNN module.

Converting your model to onnx will give you more than 3x reduction in model size.

This whole process shows you how to train models in Tensorflow and then deploy it directly in OpenCV.

What’s the advantage of using the trained model in OpenCV vs using it in Tensorflow ?

So here are some points you may want to consider.

  • By using OpenCV’s DNN module, the final code is a lot compact and simpler.
  • Someone who’s not familiar with the training framework like TensorFlow can also use this model.
  • There are cases where using OpenCV’s DNN module will give you faster inference results for the CPU. See these results in LearnOpenCV by Satya.
  • Besides supporting CUDA based NVIDIA’s GPU, OpenCV’s DNN module also supports OpenCL based Intel GPUs.
  • Most Importantly by getting rid of the training framework (Tensorflow) not only makes the code simpler but it ultimately gets rid of a whole framework, this means you don’t have to build your final application with a heavy framework like TensorFlow. This is a huge advantage when you’re trying to deploy on a resource-constrained edge device, e.g. a Raspberry pie

So this way you’re getting the best of both worlds, a framework like Tensorflow for training and OpenCV DNN for faster inference during deployment.

This tutorial can be split into 3 parts.

  1. Training a Custom Image Classifier in OpenCV with Tensorflow
  2. Converting Our Classifier to ONNX format.
  3. Using the ONNX model directly in the OpenCV DNN module.

Let’s start with the Code

Download Code for this post

Download Code for this post

Part 1: Training a Custom Image Classifier with Tensorflow:

For this tutorial you need OpenCV 4.0.0.21 and Tensorflow 2.2

So you should do:

pip install opencv-contrib-python==4.0.0.21
(
Or install from Source, Make sure to change the version)

pip install tensorflow
(Or install tensorflow-gpu from source)

Note: The reason I’m asking you to install version 4.0 instead of the latest 4.3 version of OpenCV is because later on we’ll be using a function called readNetFromONNX() now with our model this function was giving an error in 4.3 and 4.2, possibly due to some bug in those versions. This does not mean that you can’t use custom models with those versions but that for my specific case there was an issue. Converting models only takes 2-3 lines of code but sometimes you get ambiguous errors which are hard to diagnose, but it can be done.

Hopefully, the conversion process will get better in the future.

One thing you can do is create a custom environment (with Anaconda or virtualenv) in which you can install version 4.0 without affecting your root environment and if you’re using google colab for this tutorial then you don’t need to worry about that.

You can go ahead and download the source code from the download code section. After downloading the zip folder, unzip it and you will have the following directory structure.

You can start by importing the libraries:

Let’s see how you would go about training a basic Convolutional Network in Tensorflow. I assume you know some basics of deep learning. Also in this tutorial, I will be teaching how to construct and train a classifier using a real-world dataset, not a toy one, I will not go in-depth and explain the theory behind neural networks. If you want to start learning deep learning then you can take a look at Andrew Ng’s Deep Learning specialization, although this specialization is basic and covers mostly foundational things now if your end goal is to specialize in computer Vision then I would strongly recommend that you first learn Image Processing and Classical Computer Vision techniques from my 3 month comprehensive course here.

The Dataset we’re going to use here is a dataset of 5 different flowers, namely rose, tulips, sunflower, daisy and dandelion. I avoided the usual cats and dogs dataset.

You can download the dataset from a url, you just have to run this cell

After downloading the dataset you’ll have to unzip it, you can also do this manually.

After extracting you can check the folder named flower_photos in your current directory which will contain these 5 subfolders.

You can check the number of images in each class using the code below.

Found 699 images of sunflowers
Found 898 images of dandelion
Found 633 images of daisy
Found 799 images of tulips
Found 641 images of roses
[‘daisy’, ‘dandelion’, ‘roses’, ‘sunflowers’, ‘tulips’]

Generate Images:

Now it’s time to load up the data, now since the data is approx 218 MB, we can actually load it in RAM but most real datasets are large several GBs in size, and will fit in your RAM. In those scenarios, you use data generators to fetch batches of data and feed it to the neural network during training, so today we’ll also be using a data generator to load the data.

Before we can pass the images to a deep learning model, we need to do some preprocessing, like resize the image in the required shape, convert them to floating-point tensors, rescale the pixel values from 0-255 to 0-1 range as this helps in training.

Fortunately, all of this can be done by the ImageDataGenerator class in tf.keras. Not only that but the ImageDataGenerator Class can also perform data augmentation. Data augmentation means that the generator takes your image and performs random transformations like randomly rotating, zooming, translating, and performing other such operations to the image. This is really effective when you don’t have much data as this increases your dataset size on the fly and your dataset contains more variation which helps in generalization.

As you’ve already seen that each flower class has less than 1000 examples, so in our case data augmentation will help a lot. It will expand our dataset.

When training a Neural Network, we normally use 2 datasets, a training dataset and a validation dataset. The neural network tunes its parameters using the training dataset and the validation dataset is used for the evaluation of the Network’s performance.

Found 2939 images belonging to 5 classes.
Found 731 images belonging to 5 classes.

Note: Usually when using an ImageDataGenerator to read from a directory with data augmentation we usually have two folders for each class because data augmentation is done only to the training dataset, not the validation set as this set is only used for evaluation. So I’ve actually created two data generators instances for the same directory with a validation split of 20% and used a constant random seed on both generators so there is no data overlap.

I’ve rarely seen people split with augmentation this way but this approach actually works and saves us the time of splitting data between two directories.

Visualize Images:

It’s always a good idea to see what images look like in your dataset, so here’s a function that will plot new images from the dataset each time you run it.



Alright, now we’ll use the above function to first display few of the original images using the validation generator.



Now we will generate some Augmented images using the train generator. Notice how images are rotated, zoomed etc.

Create the Model

Since we’re using Tensorflow 2 (TF2) and in TF2 the most popular way to go about creating neural networks is by using the Keras API. Previously Keras used to be a separate framework (it still is) but not so long ago because of Keras’ popularity in the community it was included in TensorFlow as the default high-level API. This abstraction allows developers to use TensorFlow’s low-level functionality with high-level Keras code. 

This way you can design powerful neural networks in just a few lines of code. E.g. take a look at how we have created an effective Convolutional Networks.


A typical neural network has a bunch of layers, in a Convolutional network, you’ll see convolutional layers. These layers are created with the Conv2d function. Take a look at the first layer:

      Conv2D(16, 3, padding=’same’, activation=’relu’, input_shape =(IMG_HEIGHT, IMG_WIDTH ,3))

The number 16 refers to the number of filters in that layer, normally we increase the number of filters as you add more layers. You should notice that I double the number of filters in each subsequent convolutional layer i.e. 16, 32, 64 … , this is common practice. In the first layer, you also specify a fix input shape that the model will accept, which we have already set as 200x200

Another thing you’ll see is that typically a convolutional layer is followed by a pooling layer. So the Conv layer outputs a number of feature maps and the pooling layer reduces the spatial size (width and height) of these feature maps which effectively reduces the number of parameters in the network thus reducing computation.

So you’ll commonly a convolutional layer followed by a pooling layer, this is normally repeated several times, at each stage the size is reduced and the no of filters is increased. We are using a MaxPooling layer there are other pooling types too e.g. AveragePooling.

The Dropout layer randomly drops x% percentage of parameters from the network, this allows the network to learn robust features. In the network above I’m using dropout twice and so in those stages I’m dropping 10% of the parameters. The whole purpose of the Dropout layer is to reduce overfitting.

Now before we add the final layer we need to flatten the output in a single-dimensional vector, this can be done by the flatten layer but a better method is using the  GlobalAveragePooling2D Layer, which flattens the output while reducing the parameters.

Finally, before our last layer, we also use a Dense layer (A fully connected layer) with 1024 units. The final layer contains the number of units equal to the number of classes. The activation function here is softmax as I want the network to produce class probabilities at the end.

Compile the model

Before we can start training the network we need to compile it, this is the step where we define our loss function, optimizer, and metrics.

For this example, we are using the ADAM optimizer and a categorical cross-entropy loss function as we’re dealing with a multi-class classification problem. The only metric we care about right now is the accuracy of the model.

Model summary

By using the built-in method called summary() we can see the whole architecture of the model that we just created. You can see the total parameter count and the number of params in each layer.

Notice how the number of params are 0 in all layers except the Conv and Dense layers, this is because these are the only two types of layers here which are actually involved in learning.

Training the Model:

You can start training the model using the model.fit() method but first specify the number of epochs, and the steps per epoch. 

Epoch: A single epoch means 1 pass of the whole data meaning an epoch is considered done when the model goes over all the images in the training data and uses it for gradient calculation and optimizations. So this number decides how many times the model will go over your whole data.

Steps per epoch: A single step means the model goes over a single batch of the data, so steps per epoch tells, after how many steps should an epoch be considered done. This should be set to dataset_size / batch_size which is the number of steps required to go over the whole data once.

Let’s train our model for 60 epochs.

…………………………………..
…………………………………..

You can see in the last epoch that our validation loss is low and accuracy is high so our model has successfully converged, we can further verify this by plotting the loss and accuracy graphs.

After you’re done training it’s a good practice to plot accuracy and loss graphs.

The model has slightly overfitted at the end but that is okay considering the number of images we used and our model’s capacity.

You can test out the trained model on a single test image using this code. Make sure to carry out the same preprocessing steps you used before training for e.g. since we trained on normalized images in range 0-1, we will need to divide any new image with 255 before passing it to the model for prediction.

Predicted Flowers is : roses, 85.61%

Notice that we are converting our model from BGR to RGB color format. This is because TensorFlow has trained the model using images in RGB format whereas OpenCV reads images in BGR format, so we have to reverse channels before we can perform prediction.

Finally when you’re satisfied with the model you save it in .h5 format using model.save function.

Part 2: Converting Our Classifier to ONNX format

Now that we have trained our model, it’s time to convert it to ONNX format.

What is ONNX ?

ONNX stands for Open neural network exchange. ONNX is an industry-standard format for changing model frameworks, this means you can train a model in PyTorch or any other common frameworks and then convert to onnx and then convert back to TensorFlow or any other framework. 

So ONNX allows developers to move models between different frameworks such as CNTK, Caffe2, Tensorflow, PyTorch etc.

So why are we converting to ONNX ?

Remember our goal is to use the above custom trained model in DNN module but the issue is the DNN module does not support using the .h5 Keras model directly. So we have to convert our .h5 model to a .onnx model after doing this we will be able to take the onnx model and plug it into the DNN module.

Note: Even if you saved the model in saved_model format then you still can’t use it directly 

You need to use keras2onnx module to perform the conversion so you should  go ahead and install keras2onnx module.

pip install keras2onnx

You also need to install onnx so that you can save .onnx models to disk.

pip install onnx

After installing keras2onnx, you can use its convert_keras function to convert the model, we will also serialize the model to disk using keras2onnx.save_model  so we can use it later.

tf executing eager_mode: True
tf.keras model eager_mode: False
The ONNX operator number change on the optimization: 57 -> 25

Now we’re ready to use this model in the DNN module. Check how your ~7.5 MB .h5 model now has reduced to ~2.5 MB .onnx model, a 3x reduction in size. Make sure to check out  keras2onnx repo for more details.

Note: You can even use this model with just ONNX using onnxruntime module which itself is pretty powerful considering the support of multiple hardware accelerations.

Using the ONNX model in the OpenCV DNN module:

Now we will take this ONNX model and use it directly in our DNN module.

Let’s use this as a test image.

Here’s the code to test the ONNX model on the image.

Here’s the result of a few images which I took from google, I’m using my custom function classify_flower() to classify these images. You can find this function’s code inside the downloaded Notebook.

If you want to learn about doing image classification using the DNN module in detail then make to read the previous post,  Deep learning with OpenCV DNN module. Where I have explained each step in detail.

What’s Next?

computer vision

If you want to go forward from here and learn more advanced things and go into more detail, understand theory and code of different algorithms then be sure to check out our Computer Vision & Image Processing with Python Course (Urdu/Hindi). In this course, I go into a lot of detail regarding vision fundamentals and cover a plethora of algorithms and techniques to help you master Computer Vision.

The 3 month course contains:

✔ 125 Video Lectures
✔ Discussion Forums
✔ Quizzes
✔ 100+ High Quality Jupyter notebooks
✔ Practice Assignments
✔Certificate of Completion

If you want to start a career in Computer Vision & Artificial Intelligence then this course is for you. One of the best things about this course is that the video lectures are in Urdu/Hindi Language without any compromise on quality, so there is a personal/local touch to it.

Summary:

In today’s post we first learned how to train an image classifier with tf.keras, after that we learned how to convert our trained .h5 model to .onnx model.

Finally we learned to use this onnx model using OpenCV’s DNN module.

Although the model we converted today was quite basic but this same pipeline can be used for converting complex models too.

A word of Caution: I personally have faced some issues while converting some types of models so the whole process is not foolproof yet but it’s still pretty good. Make sure to look at keras2onnx repo and this excellent repo of ONNX conversion tutorials.




Deep Learning with OpenCV DNN Module, A Comprehensive Guide

Deep Learning with OpenCV DNN Module, A Comprehensive Guide

In this tutorial we will go over OpenCV’s DNN module in detail, I plan to cover various important details of the DNN module that is never discussed, things that usually trip of people like, selecting preprocessing params correctly and designing pre and postprocessing pipelines for different models.

This post is the first of 3 in our brand new Deep Learning with OpenCV series. All three posts are titled as:

  1. Deep Learning with OpenCV DNN Module, A Comprehensive Guide
  2. Training a Custom Image Classifier with Tensorflow, Converting to ONNX and using it in OpenCV DNN module
  3. Using a Custom Trained Object Detector with OpenCV DNN Module

This post can be split into 3 sections.

  1. Introduction to OpenCV’s DNN module.
  2. Using a Caffe DenseNet121 model for classification.
  3. Important Details regarding the DNN module, e.g. where to get models, how to configure them, etc.

If you’re just interested in the image classification part then you can skip to the second section or you can even read this great classification with DNN module post by Adrian. However, if you’re interested in getting to know the DNN module in all its glory then keep reading.

Introduction to OpenCV’s DNN module

First let me start by introducing the DNN module for all those people who are new to it, so as you can probably guess, the DNN module stands for Deep Neural Network module. This is the module in OpenCV which is responsible for all things deep learning related.

It was introduced in OpenCV version 3 and now in version 4.3 it has evolved a lot. This module lets you use pre trained neural networks from popular frameworks like tensorflow, pytorch  etc and use those models directly in OpenCV.

This means you can train models using a popular framework like Tensorflow and then do inference/prediction with just OpenCV.

So what are the benefits here?

Here are some advantages you might want to consider when using OpenCV for inference.

  • By using OpenCV’s DNN module for inference the final code is a lot compact and simpler.
  • Someone who’s not familiar with the training framework can also use the model.
  • Beside supporting CUDA based NVIDIA’s GPU, OpenCV’s DNN module also supports OpenCL based Intel GPUs.
  • Most Importantly by getting rid of the training framework not only makes the code simpler but it ultimately gets rid of a whole framework, this means you don’t have to build your final application with a heavy framework like TensorFlow. This is a huge advantage when you’re trying to deploy on a resource-constrained edge device, e.g. a Raspberry pie.

One thing that might put you off is the fact that OpenCV can’t be used for training deep learning networks. This might sound like a bummer but fret not, for training neural networks you shouldn’t use OpenCV there are other specialized libraries like Tensorflow, PyTorch etc for that task.

So which frameworks can you use to train Neural Networks:

These are the frameworks that are currently supported with the DNN module.

Now there are many interesting pre-trained models already available in the OpenCV Model Zoo that you can use, to keep things simple for this tutorial, I will be using an image classification network to do classification.

I have also made a tutorial on doing Super-Resolution with DNN module and Facial expression recognition that you can look at after going through this post.

Details regarding other types of models are discussed in the 3rd section. By the way, I actually go over 13-14 different types of models in our Computer Vision and Image processing Course. These contain notebooks tutorials and video walk-throughs.

Image Classification pipeline with OpenCV DNN

Now we will be using a DenseNet121 model, which is a caffe model trained on 1000 classes of ImageNet. The model is from the paper Densely Connected Convolutional Networks by Gap Huang et al.

Generally there are 4 steps you need to perform when doing deep learning with DNN module.

  1. Read the image and the target classes.
  2. Initialize the DNN module with an architecture and model parameters.
  3. Perform the forward pass on the image with the module
  4. Post-process the results.

The pre and post processing steps are different for different tasks.

Let’s start with the code

Download Code for this post

Download Code for this post

You can go ahead and download the source code from the download code section. After downloading the zip folder, unzip it and you will have the following directory structure.

Now run the Image Classification with DenseNet121.ipynb notebook, and start executing the cells.

Import Libraries

First we will import the required libraries.

Loading Class Labels

Now we’ll start by loading class names, In this notebook, we are going to classify among 1000 classes defined in ImageNet.

All these classes are in the text file named synset_words.txt. In this text file, each class is in on a new line with its unique id, Also each class has multiple labels for e.g look at the first 3 lines in the text file:

  • ‘n01440764 tench, Tinca tinca’
  • ‘n01443537 goldfish, Carassius auratus’
  • ‘n01484850 great white shark, white shark

So for each line, we have the Class ID, then there are multiple class names, they all are valid names for that class and we’ll just use the first one. So in order to do that we’ll have to extract the second word from each line and create a new list, this will be our labels list.

Number of Classes 1000 [‘n01440764 tench, Tinca tinca’, ‘n01443537 goldfish, Carassius auratus’, ‘n01484850 great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias’, ‘n01491361 tiger shark, Galeocerdo cuvieri’, ‘n01494475 hammerhead, hammerhead shark’]

Extract the Label

Here we will extract the labels (2nd element from each line) and create a labels list.

[‘tench’, ‘goldfish’, ‘great white shark’, ‘tiger shark’, ‘hammerhead’, ‘electric ray’, ‘stingray’, ‘cock’, ‘hen’, ‘ostrich’, ‘brambling’, ‘goldfinch’, ‘house finch’, ‘junco’, ‘indigo bunting’, ‘robin’, ‘bulbul’, ‘jay’, ‘magpie’, ‘chickadee’]

Initializing the DNN Module

Now before we can use the DNN Module we must initialize it using one of the following functions.

  • Caffe Modles: cv2.dnn.readNetFromCaffe
  • Tensorflow Models: cv2.dnn.readNetFromTensorFlow
  • Pytorch Models: cv2.dnn.readNetFromTorch

As you can see the function you use depends upon Original Architecture the model was trained on.

Since we’ll be using a DenseNet121 which was trained using Caffe so our function will be:

retval = cv2.dnn.readNetFromCaffe( prototxt[, caffeModel] )

Params:

  • prototxt: Path to the .prototxt file, this is the text description of the architecture of the model.
  • caffeModel: path to the .caffemodel file, this is your actual trained neural network model, it contains all the weights/parameters of the model. This is usually several MBs in size.

Note: If you load the model and proto file via readNetFromTensorFlow then the order of architecture and model inputs are reversed.

Read An Image

Let’s read an example image and display it with matplotlib imshow

Pre-processing the image

Now before you pass an image in the network you need to preprocess it, this means resizing the image to the size it was trained on, for many networks, this is 224×224, in pre-processing step you also do other things like Normalize the image (make the range of intensity values between 0-1) and mean subtraction, etc. These are all the steps the authors did on the images that were used during model training.

Fortunately, In OpenCV you have a function called cv2.dnn.blobFromImage() which most of the time takes care of all the pre-processing for you.

blob = cv2.dnn.blobFromImage(image[, scalefactor[, size[, mean[, swapRB[, crop]]]]])

Params:

  • Image Input image.
  • Scalefactor Used to normalize the image. This value is multiplied by the image, value of 1 means no scaling is done.
  • Size The size to which the image will be resized to, this depends upon the each model.
  • Mean These are mean R,G,B Channel values from the whole dataset and these are subtracted from the image’s R,G,B respectively, this gives illumination invariance to the model.
  • swapRB Boolean flag (false by default) this indicates weather swap first and last channels in 3-channel image is necessary.
  • crop flag which indicates whether the image will be cropped after resize or not. If crop is true, input image is resized so one side after resize is equal to the corresponding dimension in size and another one is equal or larger. Then, a crop from the center is performed. If crop is false, direct resize without cropping and preserving aspect ratio is performed.

So After this function we get a 4d blob, this is what we’ll pass to the network.

(1, 3, 224, 224)

Note: There is also blobFromImages() which does the same thing but with multiple images.

Input the Blob Image to the Network 

Here you’re setting up the blob image as the input to the network.

Forward Pass 

Here the actual computation will take place, Most of the time in your whole pipeline will be taken here. Here your image will go through all the model parameters and in the end, you will get the output of the classifier.

Wall time: 166 ms

Total Number of Predictions are: 1000

array([[[-2.0572357 ]], [[-0.18754716]], [[-3.314731 ]], [[-6.196114 ]]], dtype=float32)

Apply Softmax Function to get Probabilities

By looking at the output, you can tell that the model has returned a set of scores for each class but we need Probabilities between 0-1 for each class. We can get them by applying a softmax function on the scores.

array([5.7877337e-06, 3.7540856e-05, 1.6458317e-06, 9.2260699e-08], dtype=float32)

The Maximum probability is the confidence of our target class.

0.59984004

The index Containing the maximum confidence/probability is the index of our target class.

331

By putting the index from above into our labels list we can get the name of our target class.

hare

As we have successfully performed the classification, now we will just annotate the image with the information we have.



Creating Functions 

Now that we have understood step by step how to create the pipeline for classification using OpenCV’s DNN module, we’ll now create functions that do all the above in a single step. In short we will be creating following two functions.

Initialization Function: This function will contain parts of the network that will be set once, like loading the model.

Main Function: This function will contain all the rest of the code from preprocessing to postprocessing, it will also have the option to either return the image or display it with matplotlib.



Initialization Function

This method will be run once and it will initialize the network with the required files.

Main Method

returndata is set to True when we want to perform classification on video.

Initialize the Classifier

Calling our initializer to initialize the network.

Using our Classifier Function

Now we can call our classifier function and test on multiple images.



Real time Image Classification

If you want to this classifier in real time then here is the code for that.

Important Details Regarding the DNN module 

Let’s discuss some interesting details and some tips to fully utilize the DNN module.

Where to get the pre-trained Models:

Earlier I mentioned that you can get other pre-trained models, so where are they? 

The best place to get pre-trained models is here. This page is a wiki for Deep learning with OpenCV, you will find models that have been tested by the OpenCV team.

There are a variety of models present here, for things like Classification, Pose Detection, Colorization, Segmentation, Face recognition, text detection, style transfer, and more. You can take models from any of the above 5 frameworks.

Just click on the models to go to their repo and download them from there. Note: The models listed on the page above are only the tested models, in theory, you can almost take any pre-trained model and use it in OpenCV. 

A faster and easier way to download models is to go here. Now, this is a python script that will let you download not only the most commonly used models but also some State of the Art ones like Yolo v4 etc. You can download this script and then run from the command line. Alternatively, if you’re in a rush and just one specific model then you can take the downloadable URL of any model and download it.

After downloading the model, you will need a couple of more things before you can actually use the model in the OpenCV dnn module.

You’re now probably familiar with those things, so yeah you will need the model configuration file like the prototxt file we just used with our Caffe model above. You will also need class labels, now for classification problems, models are usually trained on the ImageNet dataset so we needed synset_word.txt file, for Object detection you will find models trained on COCO or Pascal VOC dataset. And similarly, other tasks may require other files.

So where are all these files present ?

You will find most of these configuration files present here and the class names here. If the configuration file you’re looking for is not present in the above links then I would recommend that you look at the GitHub repo of the model, the files would be present there. Otherwise, you have to create it yourself. (More on this later)

After getting the configuration files, the only thing you need is the pre-processing parameters that go in blobFromImage. E.g. the mean subtraction values, scaling params etc. 

You can get that information from here. Now, this script only contains parameter details for a few popular models. 

So how do you get the details for other models ?

For that you would need to go to the repo of the model and look in the ReadMe section, the authors usually put that information there. 

For e.g. If I visit the github repo of the Human Pose Estimation model using this link which I got from the model downloading script.

By scrolling down the readme I can find these details here:

Note: These details are not always present in the Readme and sometimes you have to do quite some digging before you can find these parameters.

What to do if there is no GitHub repo link with the model, for e.g. this shuffleNet model does not have a GitHub link, in that case, I can see that the framework is ONNX.

So now I will visit the ONNX model zoo repo and find that model. 

After clicking on the model I will find its readme and then its preprocessing steps.

Notice that this model contains some preprocessing steps that are not supported by blobfromImage function. So this could happen and at times you would need to write custom preprocessing steps without using blobfromImage function, for e.g. in our Super Resolution post, I had to write a custom pre-processing pipeline for the network.

How to use our own Custom Trained Networks

Now that we have learned to use different models, you might wonder exactly how can we use our own custom trained models. So the thing is you can’t directly plug a trained network in a DNN module but you need to perform some operations to get a configuration file, which is why we needed a prototxt file along with the model.

Fortunately, In the next two blog posts, I plan to cover exactly this topic and show you how to use a custom trained classifier and a custom trained Detection network.

For now, you can take a look at this page which briefly describes how you can use models trained with Tensorflow Object Detection API in OpenCV.

One thing to note is that not all networks are supported by the DNN module, this is because DNN module supports some 30+ layer types, these layer names can be found at the wiki here. So if a model contains layers that are not among the supported layers then it won’t run, this is not a major issue as most common layers used in deep learning models are supported. 

Also OpenCV provides a way for you to define your own custom layers.

Using GPU’s and Faster Backends to speed up OpenCV DNN Module

By default OpenCV’s DNN module runs on the default C++ implementation which itself is pretty fast but OpenCV further allows you to change this backend to increase the speed even more.

Option 1: Use NVIDIA GPU with CUDA backend in the DNN module:

If you have an Nvidia GPU present then great, you can use that with the DNN module, you can follow my OpenCV source installation guide to configure your NVIDIA GPU for OpenCV and learn how to use it. This will make your networks run several times faster.

Option 2: Use OpenCL based INTEL GPU’s:

If you have an OpenCL based GPU then you can use that as a backend, although this increases speed but in my experience, I’ve seen speed gains only in 32 bit systems. To use the OpenCL as a backend you can see the last section of my OpenCV source installation section linked above.

Option 3: Use Halide Backend:

As described on this post from learnOpenCV.com, for some time in the past using the halide backend increased the speed but then OpenCV engineer’s optimized the default C++ implementation so much that the default implementation actually got faster. So I don’t see a reason to use this backend now, Still here’s how you configure halide as a backend.

Option 4: Use Intel’s Deep Learning Inference Engine backend:

Intel’s Deep Learning Inference Engine backend is part of OpenVINO toolkit, OpenVINO stands for Open Visual Inferencing and Neural Network Optimization. OpenVINO is designed by Intel to speed up inference with neural networks, especially for tasks like classification, detection, etc. OpenVINO speeds up by optimizing the model in a hardware-agnostic way. You can learn to install OpenVINO here and here’s a nice tutorial for it.

What’s Next?

computer vision

If you want to go forward from here and learn more advanced things and go into more detail, understand theory and code of different algorithms then be sure to check out our Computer Vision & Image Processing with Python Course (Urdu/Hindi). In this course, I go into a lot of detail regarding vision fundamentals and cover a plethora of algorithms and techniques to help you master Computer Vision.

The 3 month course contains:

✔ 125 Video Lectures
✔ Discussion Forums
✔ Quizzes
✔ 100+ High Quality Jupyter notebooks
✔ Practice Assignments
✔Certificate of Completion

If you want to start a career in Computer Vision & Artificial Intelligence then this course is for you. One of the best things about this course is that the video lectures are in Urdu/Hindi Language without any compromise on quality, so there is a personal/local touch to it.

Summary:

In today’s tutorial, we went over a number of things regarding OpenCV’s DNN module. From using pre-trained models to Optimizing for faster inference speed.

We also learned to perform a classification pipeline using densenet121.

This post should serve as an excellent guide for anyone trying to get started in Deep learning using OpenCV’s DNN module.

Finally, OpenCV’s DNN repo contains an example python scripts to run common networks like classification, text, object detection, and more. You can start utilizing the DNN module by using these scripts and here are a few DNN Tutorials by OpenCV.

The main contributor for the DNN module in OpenCV is Dmitry Kurtaev and formerly it was Aleksandr Rybnikov, so big thanks to them and the rest of the contributors for making such a great module.

I hope you enjoyed today’s tutorial, feel free to comment and ask questions.




Emotion / Facial Expression Recognition with OpenCV.

Emotion / Facial Expression Recognition with OpenCV.

A few weeks ago we learned how to do Super-Resolution using OpenCV’s DNN module, in today’s post we will perform Facial Expression Recognition AKA Emotion Recognition using the DNN module. Although the term emotion recognition is technically incorrect (I will explain why) for this problem but for the remainder of this post I’ll be using both of these terms, since emotion recognition is short and also good for SEO since people still search for emotion recognition while looking for facial expression recognition xD.

The post is structured in the following way:

  • First I will define Emotion Recognition & its importance.
  • Then I will discuss different approaches to tackle this problem.
  • Finally, we will Implement an Emotion Recognition pipeline using OpenCV’s DNN module. 

Emotion Recognition Or Facial Expression Recognition

Now let me start by clarifying what I meant when I said this problem is incorrectly quoted as Emotion recognition. So you see by saying that you’re doing emotion recognition you’re implying that you’re actually finding the emotion of a person whereas in a typical AI-based emotion recognition system you’ll find around and the one that we’re gonna built looks only at a single image of a person’s face to determine the emotion of that person. Now, in reality, our expression may at times exhibit what we feel but not always. People may smile for a picture or someone may have a face that inherently looks gloomy & sad but that doesn’t represent the person’s emotion. 

So If we were to build a system that actually recognizes the emotions of a person then we need to do more than look at a simple face image. We would also consider the body language of a person through a series of frames, so the network would be a combination of an LSTM & a CNN network. Also for a more robust system, we may also incorporate a voice tone recognition AI as the tone of a voice, and speech patterns tell a lot about the person’s feelings.

Watch this part of the interview of Lisa Feldman Barret who debunks these so-called Emotion recognition systems.

Since today we’ll only be looking at a single face image so it’s better to call our task Facial Expression Recognition rather than Emotion recognition.

Facial Expression Recognition Applications:

Monitoring facial expressions of several people over a period of time provides great insights if used carefully, so for this reason we can use this technology in the following applications.

1: Smart Music players that play music according to your mood:

Think about it, you come home after having a really bad day, you lie down on the bed looking really sad & gloomy and then suddenly just the right music plays to lift up your mood.

2: Student Mood Monitoring System:

Now a system that cleverly averages the expressions of multiple students over a period of time can get an estimate of how a particular topic or teacher is impacting students, does the topic being taught stresses out the students, is a particular session from a teacher a joyful experience for students. 

3: Smart Advertisement Banners:

Think about smart advertisement banners that have a camera attached to it, when a commercial airs, it checks real-time facial expressions of people consuming that ad and informing the advertiser if the ad had the desired effect or not. Similarly, companies can get feedback if customers liked their products or not without even asking them.

Also, check out this video in which the performance of a new Ice Cream flavor is tested on people using their expressions.

These are just some of the applications from top of my head, if you start thinking about it you can come up with more use cases. One thing to remember is that you have to be really careful as how you use this technology. Use it as an assistive tool and do not completely rely on it. For e.g don’t deploy on Airport and start interrogating every other black guy who triggers Angry expressions on the system for a couple of frames.

Facial Expression Recognition Approaches:

So let’s talk about the ways we could go about recognizing someone’s facial expressions. We will look at some classical approaches first then move on to deep learning.

Haar Cascades based Recognition:

Perhaps the oldest method that could work are Haar Cascades. So essentially these Haar Cascades also called viola jones Classifier is an outdated Object detection technique by Paul Viola and Michael Jones in 2001. It is a machine learning-based approach where a cascade is trained from a lot of positive and negative images. It is then used to detect objects in images.

The most popular use of these cascades is as a face detector which is still used today, although there are better methods available. 

Now instead of using face detection, we could train a cascade to detect expressions. Since you can only train a single class with a cascade so you’ll need multiple cascades. A better way to go about is to first perform face detection then look for different features inside the face ROI, like detecting a smile with this smile detection cascade. You can also train a frown detector and so on.

Truth be told, this method is so weak that I wouldn’t even try experimenting with this in this time and era but since people have used this in the past so I’m just putting it there.

Fisher, Eigen & LBPH based Recognition:

OpenCV’s built-in face_recognition module has 3 different face recognition algorithms, Eigenfaces face recognizer,  Fisherfaces face recognizer and Local binary patterns histograms (LBPH) Face Recognizer.

If you’re wondering why am I mentioning face recognition algorithms on a facial expression recognition post, So understand this,  these algorithms can extract some really interesting features like principal components and local histograms which you can then feed into an ML classifier like SVM, so in theory, you can repurpose them for emotion recognition, only this time the target classes are not the identities of people but some facial expressions. This will work best if you have a few classes, ideally 2-3. I haven’t seen many people work on emotion like this but take a look at this post in which a guy uses Fisher faces for facial expression recognition.

Again I would mention this is not a robust approach, but would work better than the previous one.

Histogram Oriented Gradients based Recognition (HOG):

Now similar to the above approach instead of using the face_recognizer module to extract features you can extract HOG features of faces, HOG based features are really effective. After extracting HOG features you can train an SVM or any other Machine learning classifier on top of it.

Custom Features with Landmark Detection:

One of the easiest and effective ways to create an emotion recognition system is to use a landmark detector like the one in dlib which allows you to detect 68 important landmarks on the face.

By using this detector you can extract facial features like eyes, eyebrows, mouth, etc. Now you can take custom measurements of these features like measuring the distance between the lip ends to detect if the person is smiling or not. Similarly, you can measure if the eyes are wide open or not, indicating surprise or shock.

Now there are two ways to go about it, either you can send these custom measurements to an ML classifier and let it learn to predict emotions based on these measurements or you use your own heuristics to determine when to call it happy, sad etc based on the measurements.

I do think the former approach is more effective than the latter. But if you’re just determining a singular emotion like if a person is smiling or not then it’s easier to use heuristics.

Deep Learning based Recognizer:

It should not come as a surprise that the State of the Art approach to detect emotions would be a deep learning-based approach. Let me explain how you would create a simple yet effective emotion recognizer system. So what you would simply do is train a Convolutional Neural Network (CNN) on different facial expression images (Ideally thousands of images for each class/emotion) and after the training showed it new samples and if done right it would perform better than all the above approaches I’ve mentioned.

Now that we have discussed different approaches, let’s move on to the coding part of the blog. 

Facial Expression Recognition in OpenCV

We will be using a deep learning classifier that will be loaded to the OpenCV DNN module. The authors trained this model using MS Cognitive Toolkit (formerly CNTK) and then converted this model to ONNX (Open neural network exchange ) format.

ONNX format allows developers to move models between different frameworks such as CNTK, Caffe2, Tensorflow, PyTorch etc.

There is also a javascript version of this model (version 1.2) with a live demo which you can check out here. In this post we will be using version 1.3 which has a better performance.

You can also look at the original source code used to train this model here, the authors explained the architectural details of their model in their research paper titled Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution.

In the paper, the authors demonstrate training a deep CNN using 4 different approaches: majority voting, multi-label learning, probabilistic label drawing, and cross-entropy loss. The model that we are going to use today was trained using cross-entropy loss, which according to the author’s conclusion was one of the best performing models.

The model was trained on FER+ dataset,  FER dataset was the standard dataset for emotion recognition task but in FER+ each image has been labeled by 10 crowd-sourced taggers, which provides a better quality of ground truth label for still image emotion than the original FER labels.

More information about the ONNX version of the model can be found here.

The input to our emotion recognition model is a grayscale image of 64×64 resolution. The output is the probabilities of 8 emotion classes: neutral, happiness, surprise, sadness, anger, disgust, fear, and contempt.

Here’s the architecture of the model.

Here are the steps we would need to perform:

  1.  Initialize the Dnn module.
  2. Read the image.
  3. Detect faces in the image.
  4. Pre-process all the faces.
  5. Run a forward pass on all the faces.
  6. Get the predicted emotion scores and convert them to probabilities.
  7. Finally get the emotion corresponding to the highest probability

Make sure you have the following Libraries Installed.

  • OpenCV ( possibly Version 4.0 or above)
  • Numpy
  • Matplotlib
  • bleedfacedetector

Bleedfacedetector is my face detection library which can detect faces using 4 different algorithms. You can read more about the library here.

You can install it by doing:

pip install bleedfacedetector

Before installing bleedfacedetector make sure you have OpenCV & Dlib installed.

pip install opencv-contrib-python

To install dlib you can do:

pip install dlib
OR
pip install dlib==19.8.1

Download Code for this post

Download Resource Guide for this post

Directory Hierarchy

You can go ahead and download the source code from the download code section. After downloading the zip folder, unzip it and you will have the following directory structure.

You can now run the Jupyter notebook Facial Expression Recognition.ipynb and start executing each cell as follows.

Import Libraries



Initialize DNN Module

To use Models in ONNX format, you just have to use cv2.dnn.readNetFromONNX(model) and pass the model inside this function.



Read Image

This is our image on which we are going to perform emotion recognition.

  • Line 2: We’re reading the image form disk.
  • Line 5-6 : We’re setting the figure size and showing the image with matplotlib, [:,:,::-1] means to reverse image channels so we can show OpenCV BGR images properly in matplotlib. OpenCV BGR images.



Define the available classes / labels

Now we will create a list of all 8 available emotions that we need to detect.



Detect faces in the image

The next step is to detect all the faces in the image, since our target image only contains a single face so we will extract the first face we find. 

Line 4: We’re using an SSD based face detector with 20% filter confidence to detect faces, you can easily swap this detector with any other detector inside bleedfacedetector by just changing this line.

Line 7: We’re extracting the x,y,w,h coordinates from the first face we found in the list of faces.

Line 10-13: We’re padding the face by a value of 3, now this expands the face ROI boundaries, this way the model takes a look at a larger face image when predicting. I’ve seen this improves results in a lot of cases, Although this is not required.


Padded Vs Non Padded Face

Here you can see what the final face ROI looks like when it’s padded and when it’s not padded.

Pre-Processing Image

Before you pass the image to a neural network you perform some image processing to get the image in the right format. So the first thing we need to do is convert the face from BGR to Grayscale then we’ll resize the image to be of size 64x64. This is the size that our network requires. After that we’ll reshape the face image into (1, 1, 64, 64), this is the final format which the network will accept.

Line 2: Convert the padded face into GrayScale image
Line 5: Resize the GrayScale image into 64 x 64
Line 8: Finally we are reshaping the image into the required format for our model


Input the preprocessed Image to the Network



Forward Pass

Most of the Computations will take place in this step, This is the step where the image goes through the whole neural network.



Check the output

As you can see, the model outputs scores for each emotion class.

Shape of Output: (1, 8)
[[ 0.59999390 -0.05662632 7.5.22 -3.5109.508 -0.33268.593 -3.967.581.5 9.2001578 -3.1812003 ]]



Apply Softmax function to get probabilities:

We will convert the model scores to class probabilities between 0-1 by applying a Softmax function on it.

[9.1010029e-04 4.7197891e-04 9.6490067e-01 1.491846e-05
3.5819356e-04 9.4487186e-06 3.3313509e-02 2.1165248e-05]


Get Predicted emotion

Predicted Emotion is: Surprise


Display Final Result

We already have the correct prediction from the last step but to make it more cleaner we will display the final image with the predicted emotion, we will also draw a bounding box over the detected face.

Creating Functions

Now that we have seen a step by step implementation of the network, we’ll create the 2 following python functions.

Initialization Function: This function will contain parts of the network that will be set once, like loading the model.

Main Function: This function will contain all the rest of the code from preprocessing to postprocessing, it will also have the option to either return the image or display it with matplotlib.

Furthermore, the Main Function will be able to predict the emotions of multiple people in a single image, as we will be doing all the operations in a loop.

Initialization Function

Main Function

Set returndata = True when you just want the image. I usually do this when working with videos.



Initialize the Emotion Recognition

Call the initialization function once.



Calling the main function

Now pass in any image to the main function 

Real time emotion recognition on Video:

You can also take the above main function that we created and put it inside a loop and it will start detecting facial expressions on a video, below code detects emotions using your webcam in real time. Make sure to set returndata = True

Conclusion:

Here’s the confusion matrix of the model from the author’s paper. As you can see this model is not good at predicting Disgust, Fear & Contempt classes.

You can try running the model on different images and you’ll also agree with the above matrix, that the last three classes are pretty difficult to predict, particularly because It’s also hard for us to differentiate between these many emotions based on just facial expression, a lot of micro expressions overlap between these classes and so it’s understandable why the algorithm would have a hard time differentiating between 8 different emotional expressions.

Improvement Suggestions:

Still, if you really want to detect some expressions that the model seems to fail on then the best way to go about is to train the model yourself on your own data. Ethnicity & color can make a lot of difference. Also, try removing some emotion classes so the model can focus only on those that you care about.

You can also try changing the padding value, this seems to help in some cases.

If you’re working on a live video feed then try to average the results of several frames instead of giving a new result on every new frame. 

What’s Next?

computer vision

If you want to go forward from here and learn more advanced things and go into more detail, understand theory and code of different algorithms then be sure to check out our Computer Vision & Image Processing with Python Course (Urdu/Hindi). In this course, I go into a lot of detail regarding vision fundamentals and cover a plethora of algorithms and techniques to help you master Computer Vision.

The 3 month course contains:

✔ 125 Video Lectures
✔ Discussion Forums
✔ Quizzes
✔ 100+ High Quality Jupyter notebooks
✔ Practice Assignments
✔Certificate of Completion

If you want to start a career in Computer Vision & Artificial Intelligence then this course is for you. One of the best things about this course is that the video lectures are in Urdu/Hindi Language without any compromise on quality, so there is a personal/local touch to it.

Summary:

In this tutorial we first learned about the Emotion Recognition problem, why it’s important, and what are the different approaches we could take to develop such systems.

Then we learned to perform emotion recognition using OpenCV’s DNN module. After that, we went over some ways on how to improve our results.

I hope you enjoyed this tutorial. If you have any questions regarding this post then please feel free to comment below and I’ll gladly answer them.




Super Resolution with OpenCV

Super Resolution with OpenCV

Have You seen those Sci fi movies in which the detective tells the techie to zoom in on an image of the suspect and run an enhancement program and suddently that part of image is magically enhanced to a higher resolution instead of being pixelated.

Feel free to take a look at a compilation of those exact scenes below.


It’s also absurd, the amount of times that they all got a reflection of something in the video. Anyways the point is that in the past few years we have made that aspect of Sci-fi a reality. Meaning today with deep learning methods we can actually enhance many low-resolution images to a high-resolution version, sometimes even as high as 8x resolution. This means you can take a 224×224 image and make it 1792×1792 without any loss in quality. This technique is called Super Resolution.

In this tutorial you will learn how to perform Super-Resolution with just OpenCV, specifically, we’ll be using OpenCV’s DNN module so you won’t be using any external frameworks like Pytorch or Tensorflow.

Before we start with the code I want to briefly discuss the amazing progress of Super-Resolution Algorithms. You can feel free to jump right into the code. But I would recommend giving the theory below a quick read even if you don’t understand all of it.

So technically speaking, Super Resolution can be defined as the class of Algorithms that upscales an image without losing quality. How would you upscale an image without this? well you could say you can resize the image and make it larger.

So when you typically resize an image, you use Nearest Neighbor Interpolation. This just means you expand the pixels of the original image and then fill the gaps by copying the values of the nearest neighboring pixels.

Figure 1: Nearest Neighbor interpolation.

Of Course the results would be terrible, you can do better by taking a weighted average of neighboring pixels instead of just copying them. This is essentially done by using Bilinear or Bicubic interpolation.

SRCNN:

Still the results above are blurred and you can easily tell that its not the original version. So can we make this upscaled version look like the original with some fancy Algorithms? Well the short answer is No. No smart function or algorithm will be able to replace the missing information. The best we can do is approximate and fill the gaps based on the neighboring pixels. 

But fret not, Neural Networks come to the rescue. These algorithms can actually look at thousands of samples and remember the patterns so at the end of the day you don’t have to approximate the missing information, you can hallucinate based on the past seen data. 

Let me simplify this, what you can do is train a neural network by showing samples of high res images with their low res version. In fact that is what the SRCNN (Image Super-Resolution Using Deep Convolutional Networks ) paper in 2015 by Chao Dong et al did.

They simply input Low res (downscaled version) of images and made the model output a Higher resolution version and then compared it with the original High res version. The metric they were Optimizing was Peak signal to noise ratio (PSNR) score.

Figure 4: PSNR Equation

SRResNet:

SRResNet (Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network ) in 2016 by Wenzhe Shi et al improved upon the previous SRCNN at two levels, first, it used Residual blocks (Convolution layers with skip connections) instead of normal Convolution layers. Why? Well the success of architectures like ResNets made the fact popular that Residual blocks are more powerful than simple convolutional layers, as it allowed to add more layers without overfitting.

Second, it shifted the upsampling step to the middle of the network. Now in any Super res architecture there has to be an upsampling step. Now if you’re using bicubic interpolation inside the network to upsample then you can either use it at the start or the end, you can’t use it in the middle because It’s a fix mathematical operation, it’s not learnable. Now if you want to have upsampling in between the layers then you can go for transpose layers to upsample the image. One problem tho, transpose convolutions adds zeros to upscale the image, you don’t have any gradient information to tune this upscaling process. The way SRResNet got around that was that it used sub pixel convolutions to upscale, without me explaining what this layer is you just have to understand that with this technique of upscaling is a learnable operation, so this improved results. This model was also optimizing the PSNR score.

Now Consider the PSNR score again: 

Figure 4: PSNR Equation

As you can see, the model will have a high PSNR score if the MSE (mean squared error) is low. Now this approach works well, but the problem here is that even with high PSNR scores the images do not necessarily look good to the human eye.

So the image fidelity or the human perception of image quality is not exactly correlated to psnr scores. Minimizing MSE would produce images that may look more like the original but may not necessarily look pleasing to the eye.

Consider all these images below that have almost equal MSE when compared to the reference image, even though we can clearly see that the image on the top is way closer to reference image than the bottom one.

Perceptual Loss:

MSE only cared about pixel-wise intensity differences not the actual contents of the image. This problem can be solved by using a better metric. Something called a Perceptual loss (Perceptual Losses for Real-Time Style Transfer and Super-Resolution in 2016 by Justin Johnson et al) can be used. It’s kind of a loss that correlates well with our perception of image quality. It works by simply passing the output of the model and the actual target image to a pre-trained model like VGG variants and then compute the difference between the resulting feature maps of that model and try to minimize that. The layers of the pre-trained model that generates those feature maps for loss calculation stays frozen during the training of the Super res network. This perceptual loss is also called the content loss in style transfer networks.

Figure 7: Perceptual Loss.

EnhanceNet (Single Image Super-Resolution Through Automated Texture Synthesis) by Mehdi S. M. Sajjadi et al, 2017 is a great network that effectively implements this loss.


SRGAN:

Figure 8: SRGAN Architecture.

For a moment if we think about the Super Resolution problem then we can agree that we don’t care if the output image matches the original one exactly as long as it looks good, So why not use GANs (Generative Adversarial Networks) to generate realistic Upscaled versions of the image. That’s what SRGAN (Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network) Christian Ledig et al, 2017 did. Like all GANs SRGAN, had a generator that tries to generate realistic-looking Upscaled versions of the original images and it also had a discriminator that tried to tell if the generated image is the Original high res version or a generated Upscaled version. During the training they both get better over time and the generator learns to produce better looking Upscaled versions of the image.

In Addition to this, SRGAN also implemented a Perceptual loss function. So this network with the combination of Generative Loss & Perceptual Loss along with sub-pixel convolutions produces really High-quality Upscaled images.

ESRGAN:

An interesting variant of SRGANs is this ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks by Xintao Wang et al, 2018. The paper did lots of simple yet interesting things like removing the Batch norm layer, doing residual scaling, modifying discriminator loss, taking feature maps comparison before the activation function, etc. to the above network and so the performance improved. 

The network also computed a weighted average of two models, a GAN model and an MSE trained model, this way the Output looked real and also closely resembled the original image.

 This network got pretty popular in the gaming community, people upscaled old gaming graphics.

Figure 10: Nearest Neighbor and ESRGAN in Gaming.

Other Areas Of Super Resolution:

Domain/Task Specific Super Resolution: 

Needless to say if you train Super Res on a certain type of data then it will perform really well on that type. So people have trained really powerful Super res networks on domain problems like training only on faces and by utilizing face priors, you get a network (like this: Pixel Recursive Super Resolution) which can generate plausible high res face images from a very low res image. Of Course these types of networks can’t be used for CSI use cases as the details are totally made up by the algorithm.

Progressive Networks:

There are also progressive networks that break the training into steps so you can achieve a really high resolution with this for e.g. Progressive Face Super-Resolution via Attention to Facial Landmark, (2019) can improve the resolution by 8x.

Multi-Image Super-Resolution:

All the methods discussed above belong to theSingle Image Super-Resolution” category, while most of the interesting papers in SR are in this category but there is another area called “Multi-Image Super-Resolution” in which you have multiple images of the same scenes but the camera is slightly shifted, by some subpixels on each image. Then you use that extra information from all those individual images and construct a high res version. In fact Google’s New Pixel 3 uses a Multi-Image SR algorithm that uses those slight shifts of handheld motion to produce those amazing SR Zoom effects.

Note: Super resolution is a really popular subject and you’ll see a good number of research papers published each year in this area. There are other interesting papers that I have not discussed but the papers I have mentioned in essence capture the evolution of Super res networks.

Super Resolution with OpenCV Code




Now let’s start with the code, we are going to be using OpenCV’s DNN module, this was introduced in OpenCV version 3 and now in version 4.2 it has evolved a lot. This module lets you use pre trained neural networks from popular frameworks like tensorflow, pytorch, onnx etc and use those models directly in OpenCV. 

The Super Res model we’ll be using is called Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network”  by Wenzhe Shi et al, 2016. Although this does not use Perceptual loss nor a generative loss its still a really fast implementation because it uses Sub-Pixel Convolutions for upscaling. This Model will Enhance your image by 3x.

The model is in ONNX format (Open neural network exchange format). This is an industry-standard format for changing model frameworks, this means you can train a model in PyTorch or other common frameworks and then convert to onnx and then convert back to TensorFlow or any other framework.

OpenCV DNN module allows you to use models that are in ONNX format by using cv2.dnn.readNetFromONNX(). You can get a list of ONNX models from the ONNX Model Zoo.

Figure 14: ONNX model Conversion.

Here are the steps we would need to perform:

  1.  Initialize the Dnn module.
  2.  Read & Pre-process the image.
  3.  Set the preprocessed image as input and do a forward pass with the model.
  4.  Post-process the results to get the final image

Make sure you have the following Libraries Installed.

  • OpenCV ( possibly Version 4.0 or above)
  • Numpy
  • Matplotlib

Download Code for this post

Download Resource Guide for this post

Directory Hierarchy

Make sure to download the zip folder containing the source code, images, & model from above. After downloading, extract the folder and run the Jupyter notebook kernel from there.

This is how our directory structure looks like, it has a Jupyter notebook, a media folder with images and the model folder.

Import Libraries

Start by Importing the required libraries.

Initialize DNN Module

To use Models in ONNX format, you just have to use cv2.dnn.readNetFromONNX(model) and pass the model inside this function.

Read Image

This is our image on which we are going to perform super-resolution.

  • Line 2: We’re reading the image form disk.
  • Line 5-7 : We’re setting the figure size and showing the image with matplotlib, [:,:,::-1] means to reverse image channels so we can show OpenCV BGR images properly in matplotlib. OpenCV BGR images.

Preprocessing the image

Before you pass in an image to a neural network you perform some image processing to get the image in the right format. So the first thing we will do is resize the image to have the size 224×224. This is the size that our network requires. After that we’ll convert the image from RGB to YCbCr color format. So take a look at the components in this format.

  • Y:  This is called the lumma component. So basically this channel encodes brightness intensity of the image, you can think of this channel like the grayscale version of the image.
  • Cb: This is the blue-difference channel.
  • Cr: This is the red difference channel.

So why are we doing this? If we try to Upscale an RGB (or BGR in case of OpenCV) Color image then we would need to train a network that would have to learn to Upscale each individual channel, so instead of doing that we can do something smarter, That changes the image to the color format where you can just manipulate the main intensity channel. So the network we’re using has learned to upscale Y channel. After it does that, all we do is upscale the channels (These are responsible for color) using bicubic interpolation and merge it with the Y channel. This cuts our work to 1/3. After this we do some formatting of the Y channel and then finally normalize it by dividing with 255.0.

  • Line 2-5: We’re making a copy of the image and resizing it to the size the network requires.
  • Line 8-11: Changing the color channel to YCbCr & splitting it to individual channels, so we can just work with the Y channel.
  • Line 14-17: Formatting Y to the format that is acceptable by the network.
  • Line 20: Converting to float and normalizing the image as it was done in the original implementation.

Input the Blob Image to the Network 

Forward Pass

Most of the Computations will take place in this step, in my PC it took 90 ms for a single pass. This is the step where the image goes through the whole neural network.

Post-processing

After the network outputs the results, you need to post-process it. Mostly you reverse what you did in the preprocessing step.

  • Line 2-5: We’re reversing what we did in the preprocessing step.
  • Line 8: Clipping to stay in the uint8 range and avoid artifacts in the final image from rounding.
  • Line 11-12: Resizing the color channels according to the ‘Y’ channel.
  • Line 15-18: Merging all channels and converting the image back to BGR.

Display Final Result

  • Line 5-7: We’re displaying both the bicubic and super-resolution version of the image in subplots.

Creating Functions

Now that we have seen step by step implementation of the network, we’ll create the following 2 python functions.

Initialization Function: This function will contain parts of the network that will be set once, like loading the model.
Main Function: This function will contain all the rest of the code from preprocessing to postprocessing, it will also have the option to either return the enhanced image or display it with matplotlib.

Initialization Function

Main Function

Set returndata = True when you just want the enhanced image. I usually do this when working with videos.

Initialize the Super Resolution

Call the initialization function once.

Calling the main function

Now pass in any image to the main function and you’ll see a comparison of both its Bicubic and super-resolution version.

Granted the results are not that astonishing, it’s only doing 3x and there are models that can do 8x or more. But its a starting point, its really fast, easily under 100 ms on a CPU. And it’s better than Bicubic interpolation. And most importantly you can use this directly in OpenCV. In future I may consider writing a tutorial on other Super Resolution networks but for that I may have to use Pytorch or Tensorflow.

Applications:

Super Resolution has some great applications.

Facial Recognition:  Super res algorithms can help in improving the accuracy of those Surveillance systems which have to perform facial recognition on low-resolution cameras. You can try this experiment yourself and see if this network helps you to improve performance on your facial recognition systems.

Satellite Imagery: All those satellite images can be further zoomed in without any loss in quality and with no extra hardware by just implementing super-resolution on them.

Medical: It can also prove really effective in medical imaging. Especially those images for which you are limited by the available lenses.

Data Compression/Bandwidth saving: Imagine the bandwidth savings if you can download low res images and then view in high resolution using a mobile version of a super-resolution algorithm.

In Fact if you think about it, this technique has limitless applications across many industries.

Note: There is still no Generic Super-resolution algorithm that does well in all problem domains. Meaning if you take a Super res network that was trained on a dataset of house pictures and test it on animals then it would do poorly. So almost all Super res networks have their weaknesses. The best thing is to train a super res on your own problem and then use it. The best part about it is that generating labels for any image is as easy as downsizing an image literally.

What’s Next?

computer vision

If you want to go forward from here and learn more advanced things and go into more detail, understand theory and code of different algorithms then be sure to check out our Computer Vision & Image Processing with Python Course (Urdu/Hindi). In this course, I go into a lot of detail regarding vision fundamentals and cover a plethora of algorithms and techniques to help you master Computer Vision.

If you want to start a career in Computer Vision & Artificial Intelligence then this course is for you. One of the best things about this course is that the video lectures are in Urdu/Hindi Language without any compromise on quality, so there is a personal/local touch to it.

Summary: 

So let’s quickly Summarize what we did here, first, we discussed what is Super-resolution then we went over a series of Super Res networks and we saw how each algorithm improved over the other. We also discussed other areas of Super-Resolution like multi-image Super-resolution and domain-specific super res networks.

After that, we learned how to perform a step by step pipeline to do inference with a super res network inside the OpenCV DNN module. 

I hope you enjoyed this tutorial. If you have any questions regarding this post then please feel free to comment below and I’ll gladly answer them.

Checkout Bleed AI Premium Subscribers. This will give you access to Graded Quizzes, Premium Colab Notebooks, Priority Support, Course discounts & Practice Assignments You can Sign Up for the membership here. It’s Free.