fbpx
Emotion / Facial Expression Recognition with OpenCV.

Emotion / Facial Expression Recognition with OpenCV.

Download Source Code

Facial Recognition

A few weeks ago we learned how to do Super-Resolution using OpenCV’s DNN module, in today’s post we will perform Facial Expression Recognition AKA Emotion Recognition using the DNN module. Although the term emotion recognition is technically incorrect (I will explain why) for this problem but for the remainder of this post I’ll be using both of these terms, since emotion recognition is short and also good for SEO since people still search for emotion recognition while looking for facial expression recognition xD.

The post is structured in the following way:

  • First I will define Emotion Recognition & its importance.
  • Then I will discuss different approaches to tackle this problem.
  • Finally, we will Implement an Emotion Recognition pipeline using OpenCV’s DNN module. 

Emotion Recognition Or Facial Expression Recognition

Now let me start by clarifying what I meant when I said this problem is incorrectly quoted as Emotion recognition. So you see by saying that you’re doing emotion recognition you’re implying that you’re actually finding the emotion of a person whereas in a typical AI-based emotion recognition system you’ll find around and the one that we’re gonna built looks only at a single image of a person’s face to determine the emotion of that person. Now, in reality, our expression may at times exhibit what we feel but not always. People may smile for a picture or someone may have a face that inherently looks gloomy & sad but that doesn’t represent the person’s emotion. 

So If we were to build a system that actually recognizes the emotions of a person then we need to do more than look at a simple face image. We would also consider the body language of a person through a series of frames, so the network would be a combination of an LSTM & a CNN network. Also for a more robust system, we may also incorporate a voice tone recognition AI as the tone of a voice, and speech patterns tell a lot about the person’s feelings.

Watch this part of the interview of Lisa Feldman Barret who debunks these so-called Emotion recognition systems.

Since today we’ll only be looking at a single face image so it’s better to call our task Facial Expression Recognition rather than Emotion recognition.

Facial Expression Recognition Applications:

Monitoring facial expressions of several people over a period of time provides great insights if used carefully, so for this reason we can use this technology in the following applications.

1: Smart Music players that play music according to your mood:

Think about it, you come home after having a really bad day, you lie down on the bed looking really sad & gloomy and then suddenly just the right music plays to lift up your mood.

2: Student Mood Monitoring System:

Now a system that cleverly averages the expressions of multiple students over a period of time can get an estimate of how a particular topic or teacher is impacting students, does the topic being taught stresses out the students, is a particular session from a teacher a joyful experience for students. 

3: Smart Advertisement Banners:

Think about smart advertisement banners that have a camera attached to it, when a commercial airs, it checks real-time facial expressions of people consuming that ad and informing the advertiser if the ad had the desired effect or not. Similarly, companies can get feedback if customers liked their products or not without even asking them.

Also, check out this video in which the performance of a new Ice Cream flavor is tested on people using their expressions.

These are just some of the applications from top of my head, if you start thinking about it you can come up with more use cases. One thing to remember is that you have to be really careful as how you use this technology. Use it as an assistive tool and do not completely rely on it. For e.g don’t deploy on Airport and start interrogating every other black guy who triggers Angry expressions on the system for a couple of frames.

Facial Expression Recognition Approaches:

So let’s talk about the ways we could go about recognizing someone’s facial expressions. We will look at some classical approaches first then move on to deep learning.

Haar Cascades based Recognition:

Perhaps the oldest method that could work are Haar Cascades. So essentially these Haar Cascades also called viola jones Classifier is an outdated Object detection technique by Paul Viola and Michael Jones in 2001. It is a machine learning-based approach where a cascade is trained from a lot of positive and negative images. It is then used to detect objects in images.

The most popular use of these cascades is as a face detector which is still used today, although there are better methods available. 

Now instead of using face detection, we could train a cascade to detect expressions. Since you can only train a single class with a cascade so you’ll need multiple cascades. A better way to go about is to first perform face detection then look for different features inside the face ROI, like detecting a smile with this smile detection cascade. You can also train a frown detector and so on.

Truth be told, this method is so weak that I wouldn’t even try experimenting with this in this time and era but since people have used this in the past so I’m just putting it there.

Fisher, Eigen & LBPH based Recognition:

OpenCV’s built-in face_recognition module has 3 different face recognition algorithms, Eigenfaces face recognizer,  Fisherfaces face recognizer and Local binary patterns histograms (LBPH) Face Recognizer.

If you’re wondering why am I mentioning face recognition algorithms on a facial expression recognition post, So understand this,  these algorithms can extract some really interesting features like principal components and local histograms which you can then feed into an ML classifier like SVM, so in theory, you can repurpose them for emotion recognition, only this time the target classes are not the identities of people but some facial expressions. This will work best if you have a few classes, ideally 2-3. I haven’t seen many people work on emotion like this but take a look at this post in which a guy uses Fisher faces for facial expression recognition.

Again I would mention this is not a robust approach, but would work better than the previous one.

Histogram Oriented Gradients based Recognition (HOG):

Now similar to the above approach instead of using the face_recognizer module to extract features you can extract HOG features of faces, HOG based features are really effective. After extracting HOG features you can train an SVM or any other Machine learning classifier on top of it.

Custom Features with Landmark Detection:

One of the easiest and effective ways to create an emotion recognition system is to use a landmark detector like the one in dlib which allows you to detect 68 important landmarks on the face.

By using this detector you can extract facial features like eyes, eyebrows, mouth, etc. Now you can take custom measurements of these features like measuring the distance between the lip ends to detect if the person is smiling or not. Similarly, you can measure if the eyes are wide open or not, indicating surprise or shock.

Now there are two ways to go about it, either you can send these custom measurements to an ML classifier and let it learn to predict emotions based on these measurements or you use your own heuristics to determine when to call it happy, sad etc based on the measurements.

I do think the former approach is more effective than the latter. But if you’re just determining a singular emotion like if a person is smiling or not then it’s easier to use heuristics.

Deep Learning based Recognizer:

It should not come as a surprise that the State of the Art approach to detect emotions would be a deep learning-based approach. Let me explain how you would create a simple yet effective emotion recognizer system. So what you would simply do is train a Convolutional Neural Network (CNN) on different facial expression images (Ideally thousands of images for each class/emotion) and after the training showed it new samples and if done right it would perform better than all the above approaches I’ve mentioned.

Now that we have discussed different approaches, let’s move on to the coding part of the blog. 

Facial Expression Recognition in OpenCV

We will be using a deep learning classifier that will be loaded to the OpenCV DNN module. The authors trained this model using MS Cognitive Toolkit (formerly CNTK) and then converted this model to ONNX (Open neural network exchange ) format.

ONNX format allows developers to move models between different frameworks such as CNTK, Caffe2, Tensorflow, PyTorch etc.

There is also a javascript version of this model (version 1.2) with a live demo which you can check out here. In this post we will be using version 1.3 which has a better performance.

You can also look at the original source code used to train this model here, the authors explained the architectural details of their model in their research paper titled Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution.

In the paper, the authors demonstrate training a deep CNN using 4 different approaches: majority voting, multi-label learning, probabilistic label drawing, and cross-entropy loss. The model that we are going to use today was trained using cross-entropy loss, which according to the author’s conclusion was one of the best performing models.

The model was trained on FER+ dataset,  FER dataset was the standard dataset for emotion recognition task but in FER+ each image has been labeled by 10 crowd-sourced taggers, which provides a better quality of ground truth label for still image emotion than the original FER labels.

More information about the ONNX version of the model can be found here.

The input to our emotion recognition model is a grayscale image of 64×64 resolution. The output is the probabilities of 8 emotion classes: neutral, happiness, surprise, sadness, anger, disgust, fear, and contempt.

Here’s the architecture of the model.

Here are the steps we would need to perform:

  1.  Initialize the Dnn module.
  2. Read the image.
  3. Detect faces in the image.
  4. Pre-process all the faces.
  5. Run a forward pass on all the faces.
  6. Get the predicted emotion scores and convert them to probabilities.
  7. Finally get the emotion corresponding to the highest probability

Make sure you have the following Libraries Installed.

  • OpenCV ( possibly Version 4.0 or above)
  • Numpy
  • Matplotlib
  • bleedfacedetector

Bleedfacedetector is my face detection library which can detect faces using 4 different algorithms. You can read more about the library here.

You can install it by doing:

pip install bleedfacedetector

Before installing bleedfacedetector make sure you have OpenCV & Dlib installed.

pip install opencv-contrib-python

To install dlib you can do:

pip install dlib
OR
pip install dlib==19.8.1

Directory Hierarchy

You can go ahead and download the source code from the download code section. After downloading the zip folder, unzip it and you will have the following directory structure.

You can now run the Jupyter notebook Facial Expression Recognition.ipynb and start executing each cell as follows.

Import Libraries



Initialize DNN Module

To use Models in ONNX format, you just have to use cv2.dnn.readNetFromONNX(model) and pass the model inside this function.



Read Image

This is our image on which we are going to perform emotion recognition.

  • Line 2: We’re reading the image form disk.
  • Line 5-6 : We’re setting the figure size and showing the image with matplotlib, [:,:,::-1] means to reverse image channels so we can show OpenCV BGR images properly in matplotlib. OpenCV BGR images.



Define the available classes / labels

Now we will create a list of all 8 available emotions that we need to detect.



Detect faces in the image

The next step is to detect all the faces in the image, since our target image only contains a single face so we will extract the first face we find. 

Line 4: We’re using an SSD based face detector with 20% filter confidence to detect faces, you can easily swap this detector with any other detector inside bleedfacedetector by just changing this line.

Line 7: We’re extracting the x,y,w,h coordinates from the first face we found in the list of faces.

Line 10-13: We’re padding the face by a value of 3, now this expands the face ROI boundaries, this way the model takes a look at a larger face image when predicting. I’ve seen this improves results in a lot of cases, Although this is not required.


Padded Vs Non Padded Face

Here you can see what the final face ROI looks like when it’s padded and when it’s not padded.

Pre-Processing Image

Before you pass the image to a neural network you perform some image processing to get the image in the right format. So the first thing we need to do is convert the face from BGR to Grayscale then we’ll resize the image to be of size 64x64. This is the size that our network requires. After that we’ll reshape the face image into (1, 1, 64, 64), this is the final format which the network will accept.

Line 2: Convert the padded face into GrayScale image
Line 5: Resize the GrayScale image into 64 x 64
Line 8: Finally we are reshaping the image into the required format for our model


Input the preprocessed Image to the Network



Forward Pass

Most of the Computations will take place in this step, This is the step where the image goes through the whole neural network.



Check the output

As you can see, the model outputs scores for each emotion class.

Shape of Output: (1, 8)
[[ 0.59999390 -0.05662632 7.5.22 -3.5109.508 -0.33268.593 -3.967.581.5 9.2001578 -3.1812003 ]]



Apply Softmax function to get probabilities:

We will convert the model scores to class probabilities between 0-1 by applying a Softmax function on it.

[9.1010029e-04 4.7197891e-04 9.6490067e-01 1.491846e-05
3.5819356e-04 9.4487186e-06 3.3313509e-02 2.1165248e-05]


Get Predicted emotion

Predicted Emotion is: Surprise


Display Final Result

We already have the correct prediction from the last step but to make it more cleaner we will display the final image with the predicted emotion, we will also draw a bounding box over the detected face.

Creating Functions

Now that we have seen a step by step implementation of the network, we’ll create the 2 following python functions.

Initialization Function: This function will contain parts of the network that will be set once, like loading the model.

Main Function: This function will contain all the rest of the code from preprocessing to postprocessing, it will also have the option to either return the image or display it with matplotlib.

Furthermore, the Main Function will be able to predict the emotions of multiple people in a single image, as we will be doing all the operations in a loop.

Initialization Function

Main Function

Set returndata = True when you just want the image. I usually do this when working with videos.



Initialize the Emotion Recognition

Call the initialization function once.



Calling the main function

Now pass in any image to the main function 

Real time emotion recognition on Video:

You can also take the above main function that we created and put it inside a loop and it will start detecting facial expressions on a video, below code detects emotions using your webcam in real time. Make sure to set returndata = True

Conclusion:

Here’s the confusion matrix of the model from the author’s paper. As you can see this model is not good at predicting Disgust, Fear & Contempt classes.

You can try running the model on different images and you’ll also agree with the above matrix, that the last three classes are pretty difficult to predict, particularly because It’s also hard for us to differentiate between these many emotions based on just facial expression, a lot of micro expressions overlap between these classes and so it’s understandable why the algorithm would have a hard time differentiating between 8 different emotional expressions.

Improvement Suggestions:

Still, if you really want to detect some expressions that the model seems to fail on then the best way to go about is to train the model yourself on your own data. Ethnicity & color can make a lot of difference. Also, try removing some emotion classes so the model can focus only on those that you care about.

You can also try changing the padding value, this seems to help in some cases.

If you’re working on a live video feed then try to average the results of several frames instead of giving a new result on every new frame. 

What’s Next?

computer vision

If you want to go forward from here and learn more advanced things and go into more detail, understand theory and code of different algorithms then be sure to check out our Computer Vision & Image Processing with Python Course (Urdu/Hindi). In this course, I go into a lot of detail regarding vision fundamentals and cover a plethora of algorithms and techniques to help you master Computer Vision.

If you want to start a career in Computer Vision & Artificial Intelligence then this course is for you. One of the best things about this course is that the video lectures are in Urdu/Hindi Language without any compromise on quality, so there is a personal/local touch to it.

Summary:

In this tutorial we first learned about the Emotion Recognition problem, why it’s important, and what are the different approaches we could take to develop such systems.

Then we learned to perform emotion recognition using OpenCV’s DNN module. After that, we went over some ways on how to improve our results.

I hope you enjoyed this tutorial. If you have any questions regarding this post then please feel free to comment below and I’ll gladly answer them.


Checkout Bleed AI Premium Subscribers. This will give you access to Graded Quizzes, Premium Colab Notebooks, Priority Support, Course discounts & Practice Assignments You can Sign Up for the membership here. It’s Free.




Super Resolution with OpenCV

Super Resolution with OpenCV

Have You seen those Sci fi movies in which the detective tells the techie to zoom in on an image of the suspect and run an enhancement program and suddently that part of image is magically enhanced to a higher resolution instead of being pixelated.

Feel free to take a look at a compilation of those exact scenes below.


It’s also absurd, the amount of times that they all got a reflection of something in the video. Anyways the point is that in the past few years we have made that aspect of Sci-fi a reality. Meaning today with deep learning methods we can actually enhance many low-resolution images to a high-resolution version, sometimes even as high as 8x resolution. This means you can take a 224×224 image and make it 1792×1792 without any loss in quality. This technique is called Super Resolution.

In this tutorial you will learn how to perform Super-Resolution with just OpenCV, specifically, we’ll be using OpenCV’s DNN module so you won’t be using any external frameworks like Pytorch or Tensorflow.

Before we start with the code I want to briefly discuss the amazing progress of Super-Resolution Algorithms. You can feel free to jump right into the code. But I would recommend giving the theory below a quick read even if you don’t understand all of it.

So technically speaking, Super Resolution can be defined as the class of Algorithms that upscales an image without losing quality. How would you upscale an image without this? well you could say you can resize the image and make it larger.

So when you typically resize an image, you use Nearest Neighbor Interpolation. This just means you expand the pixels of the original image and then fill the gaps by copying the values of the nearest neighboring pixels.

Figure 1: Nearest Neighbor interpolation.

Of Course the results would be terrible, you can do better by taking a weighted average of neighboring pixels instead of just copying them. This is essentially done by using Bilinear or Bicubic interpolation.

SRCNN:

Still the results above are blurred and you can easily tell that its not the original version. So can we make this upscaled version look like the original with some fancy Algorithms? Well the short answer is No. No smart function or algorithm will be able to replace the missing information. The best we can do is approximate and fill the gaps based on the neighboring pixels. 

But fret not, Neural Networks come to the rescue. These algorithms can actually look at thousands of samples and remember the patterns so at the end of the day you don’t have to approximate the missing information, you can hallucinate based on the past seen data. 

Let me simplify this, what you can do is train a neural network by showing samples of high res images with their low res version. In fact that is what the SRCNN (Image Super-Resolution Using Deep Convolutional Networks ) paper in 2015 by Chao Dong et al did.

They simply input Low res (downscaled version) of images and made the model output a Higher resolution version and then compared it with the original High res version. The metric they were Optimizing was Peak signal to noise ratio (PSNR) score.

Figure 4: PSNR Equation

SRResNet:

SRResNet (Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network ) in 2016 by Wenzhe Shi et al improved upon the previous SRCNN at two levels, first, it used Residual blocks (Convolution layers with skip connections) instead of normal Convolution layers. Why? Well the success of architectures like ResNets made the fact popular that Residual blocks are more powerful than simple convolutional layers, as it allowed to add more layers without overfitting.

Second, it shifted the upsampling step to the middle of the network. Now in any Super res architecture there has to be an upsampling step. Now if you’re using bicubic interpolation inside the network to upsample then you can either use it at the start or the end, you can’t use it in the middle because It’s a fix mathematical operation, it’s not learnable. Now if you want to have upsampling in between the layers then you can go for transpose layers to upsample the image. One problem tho, transpose convolutions adds zeros to upscale the image, you don’t have any gradient information to tune this upscaling process. The way SRResNet got around that was that it used sub pixel convolutions to upscale, without me explaining what this layer is you just have to understand that with this technique of upscaling is a learnable operation, so this improved results. This model was also optimizing the PSNR score.

Now Consider the PSNR score again: 

Figure 4: PSNR Equation

As you can see, the model will have a high PSNR score if the MSE (mean squared error) is low. Now this approach works well, but the problem here is that even with high PSNR scores the images do not necessarily look good to the human eye.

So the image fidelity or the human perception of image quality is not exactly correlated to psnr scores. Minimizing MSE would produce images that may look more like the original but may not necessarily look pleasing to the eye.

Consider all these images below that have almost equal MSE when compared to the reference image, even though we can clearly see that the image on the top is way closer to reference image than the bottom one.

Perceptual Loss:

MSE only cared about pixel-wise intensity differences not the actual contents of the image. This problem can be solved by using a better metric. Something called a Perceptual loss (Perceptual Losses for Real-Time Style Transfer and Super-Resolution in 2016 by Justin Johnson et al) can be used. It’s kind of a loss that correlates well with our perception of image quality. It works by simply passing the output of the model and the actual target image to a pre-trained model like VGG variants and then compute the difference between the resulting feature maps of that model and try to minimize that. The layers of the pre-trained model that generates those feature maps for loss calculation stays frozen during the training of the Super res network. This perceptual loss is also called the content loss in style transfer networks.

Figure 7: Perceptual Loss.

EnhanceNet (Single Image Super-Resolution Through Automated Texture Synthesis) by Mehdi S. M. Sajjadi et al, 2017 is a great network that effectively implements this loss.


SRGAN:

Figure 8: SRGAN Architecture.

For a moment if we think about the Super Resolution problem then we can agree that we don’t care if the output image matches the original one exactly as long as it looks good, So why not use GANs (Generative Adversarial Networks) to generate realistic Upscaled versions of the image. That’s what SRGAN (Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network) Christian Ledig et al, 2017 did. Like all GANs SRGAN, had a generator that tries to generate realistic-looking Upscaled versions of the original images and it also had a discriminator that tried to tell if the generated image is the Original high res version or a generated Upscaled version. During the training they both get better over time and the generator learns to produce better looking Upscaled versions of the image.

In Addition to this, SRGAN also implemented a Perceptual loss function. So this network with the combination of Generative Loss & Perceptual Loss along with sub-pixel convolutions produces really High-quality Upscaled images.

ESRGAN:

An interesting variant of SRGANs is this ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks by Xintao Wang et al, 2018. The paper did lots of simple yet interesting things like removing the Batch norm layer, doing residual scaling, modifying discriminator loss, taking feature maps comparison before the activation function, etc. to the above network and so the performance improved. 

The network also computed a weighted average of two models, a GAN model and an MSE trained model, this way the Output looked real and also closely resembled the original image.

 This network got pretty popular in the gaming community, people upscaled old gaming graphics.

Figure 10: Nearest Neighbor and ESRGAN in Gaming.

Other Areas Of Super Resolution:

Domain/Task Specific Super Resolution: 

Needless to say if you train Super Res on a certain type of data then it will perform really well on that type. So people have trained really powerful Super res networks on domain problems like training only on faces and by utilizing face priors, you get a network (like this: Pixel Recursive Super Resolution) which can generate plausible high res face images from a very low res image. Of Course these types of networks can’t be used for CSI use cases as the details are totally made up by the algorithm.

Progressive Networks:

There are also progressive networks that break the training into steps so you can achieve a really high resolution with this for e.g. Progressive Face Super-Resolution via Attention to Facial Landmark, (2019) can improve the resolution by 8x.

Multi-Image Super-Resolution:

All the methods discussed above belong to theSingle Image Super-Resolution” category, while most of the interesting papers in SR are in this category but there is another area called “Multi-Image Super-Resolution” in which you have multiple images of the same scenes but the camera is slightly shifted, by some subpixels on each image. Then you use that extra information from all those individual images and construct a high res version. In fact Google’s New Pixel 3 uses a Multi-Image SR algorithm that uses those slight shifts of handheld motion to produce those amazing SR Zoom effects.

Note: Super resolution is a really popular subject and you’ll see a good number of research papers published each year in this area. There are other interesting papers that I have not discussed but the papers I have mentioned in essence capture the evolution of Super res networks.

Super Resolution with OpenCV Code




Now let’s start with the code, we are going to be using OpenCV’s DNN module, this was introduced in OpenCV version 3 and now in version 4.2 it has evolved a lot. This module lets you use pre trained neural networks from popular frameworks like tensorflow, pytorch, onnx etc and use those models directly in OpenCV. 

The Super Res model we’ll be using is called Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network”  by Wenzhe Shi et al, 2016. Although this does not use Perceptual loss nor a generative loss its still a really fast implementation because it uses Sub-Pixel Convolutions for upscaling. This Model will Enhance your image by 3x.

The model is in ONNX format (Open neural network exchange format). This is an industry-standard format for changing model frameworks, this means you can train a model in PyTorch or other common frameworks and then convert to onnx and then convert back to TensorFlow or any other framework.

OpenCV DNN module allows you to use models that are in ONNX format by using cv2.dnn.readNetFromONNX(). You can get a list of ONNX models from the ONNX Model Zoo.

Figure 14: ONNX model Conversion.

Here are the steps we would need to perform:

  1.  Initialize the Dnn module.
  2.  Read & Pre-process the image.
  3.  Set the preprocessed image as input and do a forward pass with the model.
  4.  Post-process the results to get the final image

Make sure you have the following Libraries Installed.

  • OpenCV ( possibly Version 4.0 or above)
  • Numpy
  • Matplotlib

Directory Hierarchy

Make sure to download the zip folder containing the source code, images, & model from above. After downloading, extract the folder and run the Jupyter notebook kernel from there.

This is how our directory structure looks like, it has a Jupyter notebook, a media folder with images and the model folder.

Import Libraries

Start by Importing the required libraries.

Initialize DNN Module

To use Models in ONNX format, you just have to use cv2.dnn.readNetFromONNX(model) and pass the model inside this function.

Read Image

This is our image on which we are going to perform super-resolution.

  • Line 2: We’re reading the image form disk.
  • Line 5-7 : We’re setting the figure size and showing the image with matplotlib, [:,:,::-1] means to reverse image channels so we can show OpenCV BGR images properly in matplotlib. OpenCV BGR images.

Preprocessing the image

Before you pass in an image to a neural network you perform some image processing to get the image in the right format. So the first thing we will do is resize the image to have the size 224×224. This is the size that our network requires. After that we’ll convert the image from RGB to YCbCr color format. So take a look at the components in this format.

  • Y:  This is called the lumma component. So basically this channel encodes brightness intensity of the image, you can think of this channel like the grayscale version of the image.
  • Cb: This is the blue-difference channel.
  • Cr: This is the red difference channel.

So why are we doing this? If we try to Upscale an RGB (or BGR in case of OpenCV) Color image then we would need to train a network that would have to learn to Upscale each individual channel, so instead of doing that we can do something smarter, That changes the image to the color format where you can just manipulate the main intensity channel. So the network we’re using has learned to upscale Y channel. After it does that, all we do is upscale the channels (These are responsible for color) using bicubic interpolation and merge it with the Y channel. This cuts our work to 1/3. After this we do some formatting of the Y channel and then finally normalize it by dividing with 255.0.

  • Line 2-5: We’re making a copy of the image and resizing it to the size the network requires.
  • Line 8-11: Changing the color channel to YCbCr & splitting it to individual channels, so we can just work with the Y channel.
  • Line 14-17: Formatting Y to the format that is acceptable by the network.
  • Line 20: Converting to float and normalizing the image as it was done in the original implementation.

Input the Blob Image to the Network 

Forward Pass

Most of the Computations will take place in this step, in my PC it took 90 ms for a single pass. This is the step where the image goes through the whole neural network.

Post-processing

After the network outputs the results, you need to post-process it. Mostly you reverse what you did in the preprocessing step.

  • Line 2-5: We’re reversing what we did in the preprocessing step.
  • Line 8: Clipping to stay in the uint8 range and avoid artifacts in the final image from rounding.
  • Line 11-12: Resizing the color channels according to the ‘Y’ channel.
  • Line 15-18: Merging all channels and converting the image back to BGR.

Display Final Result

  • Line 5-7: We’re displaying both the bicubic and super-resolution version of the image in subplots.

Creating Functions

Now that we have seen step by step implementation of the network, we’ll create the following 2 python functions.

Initialization Function: This function will contain parts of the network that will be set once, like loading the model.
Main Function: This function will contain all the rest of the code from preprocessing to postprocessing, it will also have the option to either return the enhanced image or display it with matplotlib.

Initialization Function

Main Function

Set returndata = True when you just want the enhanced image. I usually do this when working with videos.

Initialize the Super Resolution

Call the initialization function once.

Calling the main function

Now pass in any image to the main function and you’ll see a comparison of both its Bicubic and super-resolution version.

Granted the results are not that astonishing, it’s only doing 3x and there are models that can do 8x or more. But its a starting point, its really fast, easily under 100 ms on a CPU. And it’s better than Bicubic interpolation. And most importantly you can use this directly in OpenCV. In future I may consider writing a tutorial on other Super Resolution networks but for that I may have to use Pytorch or Tensorflow.

Applications:

Super Resolution has some great applications.

Facial Recognition:  Super res algorithms can help in improving the accuracy of those Surveillance systems which have to perform facial recognition on low-resolution cameras. You can try this experiment yourself and see if this network helps you to improve performance on your facial recognition systems.

Satellite Imagery: All those satellite images can be further zoomed in without any loss in quality and with no extra hardware by just implementing super-resolution on them.

Medical: It can also prove really effective in medical imaging. Especially those images for which you are limited by the available lenses.

Data Compression/Bandwidth saving: Imagine the bandwidth savings if you can download low res images and then view in high resolution using a mobile version of a super-resolution algorithm.

In Fact if you think about it, this technique has limitless applications across many industries.

Note: There is still no Generic Super-resolution algorithm that does well in all problem domains. Meaning if you take a Super res network that was trained on a dataset of house pictures and test it on animals then it would do poorly. So almost all Super res networks have their weaknesses. The best thing is to train a super res on your own problem and then use it. The best part about it is that generating labels for any image is as easy as downsizing an image literally.

What’s Next?

computer vision

If you want to go forward from here and learn more advanced things and go into more detail, understand theory and code of different algorithms then be sure to check out our Computer Vision & Image Processing with Python Course (Urdu/Hindi). In this course, I go into a lot of detail regarding vision fundamentals and cover a plethora of algorithms and techniques to help you master Computer Vision.

If you want to start a career in Computer Vision & Artificial Intelligence then this course is for you. One of the best things about this course is that the video lectures are in Urdu/Hindi Language without any compromise on quality, so there is a personal/local touch to it.

Summary: 

So let’s quickly Summarize what we did here, first, we discussed what is Super-resolution then we went over a series of Super Res networks and we saw how each algorithm improved over the other. We also discussed other areas of Super-Resolution like multi-image Super-resolution and domain-specific super res networks.

After that, we learned how to perform a step by step pipeline to do inference with a super res network inside the OpenCV DNN module. 

I hope you enjoyed this tutorial. If you have any questions regarding this post then please feel free to comment below and I’ll gladly answer them.

Checkout Bleed AI Premium Subscribers. This will give you access to Graded Quizzes, Premium Colab Notebooks, Priority Support, Course discounts & Practice Assignments You can Sign Up for the membership here. It’s Free.