(LearnOpenCV) Classification with Localization: Convert any Keras Classifier to a Detector

(LearnOpenCV) Classification with Localization: Convert any Keras Classifier to a Detector

Image classification is used to solve several Computer Vision problems; right from medical diagnoses, to surveillance systems, on to monitoring agricultural farms.  There are innumerable possibilities to explore using Image Classification.

If you have completed the basic courses on Computer Vision, you are familiar with the tasks and routines involved in Image Classification tasks. Want to know more? Check out https://opencv.org/courses/.

Image Classification tasks follow a standard flow – where you pass an image to a deep learning model and it outcomes the class or the label of the object present.

(LearnOpenCV) Training a Custom Object Detector with DLIB & Making Gesture Controlled Applications

(LearnOpenCV) Training a Custom Object Detector with DLIB & Making Gesture Controlled Applications

In this article, you will learn how to build python-based gesture-controlled applications using AI. We will guide you all the way with step-by-step instructions. I’m sure you will have loads of fun and learn many useful concepts following the tutorial.

Specifically, you will learn the following:

  • How to train a custom Hand Detector with Dlib.
  • How to cleverly automate the data collection & annotation step with image processing so we don’t have to label anything.
  • How to convert normal PC applications like Games and Video Players to be controlled via hand gestures.

Here’s a demo of what we’ll be building in this Tutorial:

Training a Object Detector with Tensorflow Object Detection API

Training a Object Detector with Tensorflow Object Detection API

Tensorflow Object Detection API

This is a really descriptive and interesting tutorial, let me highlight what you will learn in this tutorial about Tensorflow Object Detection API.

  1. A Crystal Clear step by step tutorial on training a custom object detector.
  2. A method to download videos and create a custom dataset out of that.
  3. How to use the custom trained network inside the OpenCV DNN module so you can get rid of the TensorFlow framework.

Plus here are two things you will receive from the provided source code:

  1. A Jupyter Notebook that automatically downloads and installs all the required things for you so you don’t have to step outside of that notebook.
  2. A Colab version of the notebook that runs out of the box, just run the cells and train your own network.

I will stress this again that all of the steps are explained in a neat and digestible way. I’ve you ever plan to do Object Detection then this is one tutorial you don’t want to miss.

As mentioned, by downloading the Source Code you will get 2 versions of the notebook: a local version and a colab version.

So first we’re going to see a complete end to end pipeline for training a custom object detector on our data and then we will use it in the OpenCV DNN module so we can get rid of the heavy Tensorflow framework for deployment. We have already discussed the advantages of using the final trained model in OpenCV instead of Tensorflow in my previous post.

Today’s post is the 3rd tutorial in our 3 part Deep Learning with OpenCV series. All three posts are titled as:

  1. Deep Learning with OpenCV DNN Module, A Comprehensive Guide
  2. Training a Custom Image Classifier with OpenCV, Converting to ONNX, and using it in OpenCV DNN module.
  3. Training a Custom Object Detector with Tensorflow and using it with OpenCV DNN (This Post)

Now to follow along and to learn the full pipeline of training a custom object detector with TensorFlow you don’t need to read the previous two tutorials but when we move to the last part of this tutorial and use the model in OpenCV DNN then those tutorials would help.

What is Tensorflow Object Detection API (TFOD) :

To train our custom Object Detector we will be using TensorFlow API (TFOD API). The Tensorflow Object Detection API is a framework built on top of TensorFlow that makes it easy for you to train your own custom models.

The workflow generally goes like this :

You take a pre-trained model from this model zoo and then fine-tune the model for your own task.
Fine-tuning is a transfer learning method that allows you to utilize features of the model which it learned from a different task to your own task. Because of this, you won’t require thousands of images to train the network, only a few hundred will suffice.
If you’re someone who prefers PyTorch instead of Tensorflow then you may want to look at Detectron 2

For this Tutorial I will be using TensorFlow Object Detection API version 1, If you want to know why we are using version 1 instead of the recently released version 2, then you can read below optional explanation.

Tensorflow Object Detection API 1

Why we’re using Tensorflow Object Detection API Version 1? (OPTIONAL READ)


Tensorflow Object Detection API v2 comes with a lot of improvements, the new API contains some new State of The ART (SoTA) models, some pretty good changes including New binaries for train/eval/export that are eager mode compatible. You can check out this release blog from the Tensorflow Object Detection API developers.

But the thing is because TF 2 no longer supports sessions so you can’t easily export your model to frozen_inference_graph, furthermore TensorFlow depreciates the use of frozen_graphs and promotes saved_model format for future use cases.

For TensorFlow, this is the right move as the saved_model format is an excellent format.

So what’s the issue?

The problem is that OpenCV only works with frozen_inference_graphs and does not support saved_model format yet, so for this reason if your end goal is to deploy it in OpenCV then you should use Tensorflow Object Detection API v1. Although you can still generate frozen_graphs, those graphs produce errors with OpenCV most of the time, we’ve tried limited experiments with TF2 so feel free to carry out your experiments but do share if you find something useful.

Now One great thing about this situation is that the Tensorflow team decided to keep the whole pipeline and code of Tensorflow Object Detection API 2 almost identical to Tensorflow Object Detection API 1 so learning how to use Tensorflow Object Detection API v1 will also teach you how to use Tensorflow Object Detection API v2.

Now Let’s start with the code

Code For TF Object Detection Pipeline:

Download Source Code For This Tutorial

Download Source Code 

Make sure to download the source code, which also contains the support folder with some helper files that you will need.

Here’s the hierarchy of the source code folder:

Here’s a Description of what these folder & files are:

  • Custom_Object_Detection.ipynb: This is the main notebook which contains all the code.
  • Colab Notebook Link: This text file contains the link for the colab version of the notebook.
  • Create_tf_record.py: This file will create tf records from the images and labels.
  • fronzen_graph_inference.pb: This is the model we trained, you can try to run this on test images.
  • graph_ours.pbtxt: This is the graph file we generated for OpenCV, you’ll learn to generate your own.
  • tf_text_graph_faster_rcnn.py: This file creates the above graph.pbtxt file for OpenCV.
  • tf_text_graph_common.py: This is a helper file used by the faster_rcnnn.py file.
  • labels: These are .xml labels for each image
  • test_images: These are some sample test images to do inference on.

Note: There are some other folder and files which you will generate along the way, I will explain their use later.

Now Even though I make it really easy but still if you don’t want to worry about environment setup, installation, then you can use the colab version of the notebook that comes with the source code.

The Colab version doesn’t require any Configuration, It’s all set to go. Just run the cells in order. You should also be able to use the Colab GPU to speed up the training process.

The full code can be broken down into the following parts

  • Part 1: Environment Setup
  • Part 2: Installation & TFOD API Setup
  • Part 3: Data Collection & Annotation
  • Part 4: Downloading Model & Configuring it
  • Part 5: Training and Exporting Inference Graph.
  • Part 6: Generating .pbtxt and using the trained model with just OpenCV.

Part 1: Environment Setup:

First let’s Make sure you have correctly set up your environment.

Since we are going to install tensorflow version 1.15.0 so we should use a virtual environment, you can either install virtualenv or anaconda distribution.. I’m using Anaconda. I will start by creating virtual environment.

Open up the command prompt and do conda create --name tfod1 python==3.7

Tensorflow Object Detection API installation

Now you can move into that environment by activating it:

conda activate tfod1

Make sure there is a (tfod1) at the beginning of each line in your cmd. This means you’re using that environment. Now anything you install will be in that environment and won’t affect your base/root environment.

Tensorflow Object Detection API installation

The first thing You want to do install a jupyter notebook in that environment. Otherwise, your environment will use the jupyter notebook of the base environment, so do:

pip install jupyter notebook

Now you should go into the directory/folder which I provided you and contains this notebook and open up the command prompt.

First, activate the environment tfod1 environment and then launch the jupyter notebook by typing jupyter notebook and hit enter.

This will launch the jupyter notebook in your newly created environment. You can now Open up Custom_Object_Detection Notebook.

Make sure your Notebook is Opened up in the Correct environment


Part 2: Installation & Tensorflow Object Detection API Setup: 

You can install all the required libraries by running this cell

If you want to install Tensorflow-GPU for version 1 then you can take a look at my tutorial for that here

Note: You would need to change the Cuda Toolkit version and CuDNN version in the above tutorial, since you’ll be installing for TF version 1 instead of version 2. You can look up the exact version requirements here

Another Library you will need is pycocotools

Alternatively You can also use this command to install in windows:

pip install git+https://github.com/philferriere/cocoapi.git#egg=pycocotools^&subdirectory=PythonAPI

Alternatively you can also use this command to install in Linux and osx:

pip install pycocotools

Note: Make sure you have Cython installed first by doing: pip install Cython

Import Required Libraries

This will also confirm if your installations were successful or not.

This should be Version 1.15.0, DETECTED VERSION: 1.15.0

Clone Tensorflow Object Detection API Model Repository

You need to clone the TF Object Detection API repository, you can either download the zip file and extract it or if you have git installed then you can git clone it.

Option 1: Download with git:

You can run git clone if you have git installed, this is going to take a while, its 600 MB+, have a coffee or something.

Option 2: Download zip and extract all: (Only do this if you don’t have git)

You can download the zip by clicking here, after downloading make sure to extract the contents of this zip inside the directory containing this notebook. I’ve already provided you the code that automatically downloads and unzips the repo in this directory.

The models we’ll be using are in the research directory of the above repo. The research directory contains a collection of research model implementations in TensorFlow 1 or 2 by researchers. There are a total of 4 directories in the above repo, you can learn more about them here.

Install Tensorflow Object Deteciton API & Compile Protos

Download Protobuff Compiler:

TFOD contains some files .proto format, I’ll explain more about this format in a later step, for now you need to download the protobuf compiler from here, make sure to download the correct one based on your system. For e.g. I downloaded protoc-3.12.4-win64.zip for my 64 bit windows. For linux and osx there are different files.

After downloading unzip the proto folder, go to its bin directory, and copy the proto.exe file. Now paste this proto.exe inside the models/research directory.

The below script does all of this, but you can choose to do it manually if you want. Make sure to change the URL if you’re using a system other than 64-bit windows.

Now you can install the object detection API and compile the protos:
Below two operations must be performed in this directory, otherwise it won’t work, especially the proto command.

Note: Since I already had installed pycocotools so after running this line cp object_detection/packages/tf1/setup.py . I edited the setup.py file to get rid of pycocotools package inside the REQUIRED_PACKAGES list then I saved the setup.py file and ran the python -m pip install . command. I did this because I was facing issues installing pycocotools this way which is why I installed the pycocotools-windows package, you probably won’t need do this.

If you wanted to install TFOD API version 2 instead of version 1 then you can just replace tf1 with tf2 in the cp object_detection/packages/tf1/setup.py . command.

You can Check your installation of TFOD API by running model_builder_tf1_test.py

Part 3: Data Collection & Annotation:

Now for this tutorial I’m going to train a detector to detect the faces of Tom & Jerry. I didn’t wanted to use the common animal datasets etc. So I went with this.

While I was writing the above sentence I just realized I’m still using a Cat, mouse dataset albeit an animated one so I guess its still a unique dataset.

In this tutorial, I’m not only going to show you how to annotate the data but also show you one approach on how to go about collecting data for a new problem.

So What I’ll be doing is that I’m going to download a video of Tom & Jerry from Youtube and then split the frames of the video to create my dataset and then annotate each of those frames with bounding boxes. Now instead of downloading my Tom & Jerry video you can use any other video and try to detect your own classes.

Alternatively you can also generate training data from other methods including getting images from Google Images.

To prepare the Data we need to perform these 5 steps:

  • Step 1: Download Youtube Video.
  • Step 2: Split Video Frames and store it.
  • Step 3: Annotate Images with labelImg.
  • Step 4: Create a label Map file.
  • Step 5: Generate TFRecords.

Step 1: Download Youtube Video:

11,311,502.0 Bytes [100.00%] received. Rate: [7788 KB/s]. ETA: [0 secs]

For more options on how you can download the video take a look at the documentation here

Step 2: Split Video Frames and store it:

Now we’re going to split the video frames and store them in a folder. Since most videos have a high FPS (30-60 frames/sec) and we don’t exactly need this many frames for two reasons:

  1. If you take a 30 FPS video then for each second of the video you will get 30 images and most of those images won’t be different from each other, there will be a lot of repetition of information.
  2. We’re already going to use Transfer Learning with TFOD API, the benefit of this is that we won’t be needing a lot of images and this is good since we don’t want to annotate thousands of images.

So we can do two two things we can skip frames and save every nth frame or we can save a frame every nth second of the video. I’m going with the latter approach, although both are valid approaches.

Done Splitting Video, Total Images saved: 165

You can go to the directory where the images are saved and manually go through each image and delete the ones where Tom & Jerry are not visible or hardly visible. Although this is not a strict requirement since you can easily skip these images in the annotation step.

Step 3: Annotate Images with labelImg

You can watch this video below to understand how to use labelImg to annotate images and export annotations. You can also take a look at the github repo here.

For the current Tom & Jerry problem I am providing you with a labels folder which already contains the .xml annotation file for each image. If you want to try a different dataset then go ahead, make sure to put the labels of that dataset in the labels folder

Note: We are not splitting the images into train and validation folder right now because we’ll be doing that automatically at tfrecord creation step. Although it would still be a good idea to separate 10% of the data for proper testing/evaluation of the final trained detector, but since my purpose is to make this tutorial as simple as possible so I won’t be doing that today, I already have test folder with 4-5 images which I will evaluate on.

Step 4: Create a label Map file

TensorFlow requires a label map file, which maps each of the class labels to an integer values. This label map is used in training and detection process. This file should be saved in training directory which also contains the labels folder

Step 5: Generate TFrecords

What are TFrecords?

Tfrecords are just protocol buffers, they help make the data reading/processing process computationally efficient. The only downside they have is that they are not human readable.

What are protocol Buffers?

A protocol buffer is a type of serialized structured data. It is more efficient than JSON, XML, pickle, and text storage formats. Google created this Protobuf (protocol buffer) format in 2008 because of their efficiency, Since then they have been widely used by Google and the community. To read the protobuf files (.proto files) you will first need to compile them by a protobuf compiler. So now you probably understand why we needed to compile those proto files at the beginning.

Here’s a nice tutorial by Naveen that explains how you can create a tfrecord for different data types and Here’s a more detailed explanation of protocol buffers with an example.

The create_tf_record.py script I’ll be using to convert images/labels to tfrecords is taken from the TensorFlow’s pet example but I’ve modified the script so now it accepts the following 5 arguments:

  1. Directory of images
  2. Directory of labels
  3. % of Split of Training data
  4. Path to label_map.pbtxt file
  5. Path to output tfrecord files

And it returns a train.record and val.record. So it splits the training data into training/validation sets. For this data I’m using a training set of 70% and validation is 30%.

Done Writing, Saved: training\\tfrecords\train.record Done Writing, Saved: training\\tfrecords\val.record

You can ignore these warnings, we already know that we’re using an older 1.15 version of TFOD API which contains some depreciated functions.

Most of the tfrecord scripts available online will first tell you to convert your xml files to csv and then you will use another script to split the data into training and validation folder and then another script to convert to tfrecords. The Script above is doing all of this.

Part 4: Downloading Model & Configuring it:

You can now go to the Model Zoo, select a model, and download its zip. Now unzip the contents of that folder and put inside a directory named pretrained_model. The below script does this automatically for a Faster-RCNN-Inception model which is already trained on the COCO dataset. You can change the model name to download a different model.

Model Downloaded

Modify pipline.config file:

After downloading you will have a number of files present in the pretrained_model folder, I will explain about them later but for now, let’s take a look at the pipeline.config file.

Pipeline.config defines how the whole training process will take place, what optimizers, loss, learning_rate, batch_size will be used. Most of these params are already set by default, its up to you if you want to change them or not but there are some paths in the pipeline.config file that you will need to change so that this model can trained on our data.

So open up pipeline.config with a text editor like Notepad ++ and change these 4 paths:

  • Change: PATH_TO_BE_CONFIGURED/model.ckpt  to  pretrained_model/model.ckpt
  • Change: PATH_TO_BE_CONFIGURED/mscoco_train.record  to  training/tfrecords/train.record
  • Change: PATH_TO_BE_CONFIGURED/mscoco_val.record   to  training/tfrecords/val.record
  • Change: PATH_TO_BE_CONFIGURED/mscoco_label_map.pbtxt  to  training/label_map.pbtxt
  • Change: num_classes: 90  to  num_classes: 2

If you’re lazy like me then no prob, below script does all this

Notice the correction I did by replacing step: 0 with step: 1, unfortunately for different models sometimes there are some corrections required but you can easily understand what exactly needs to be changed by pasting the error generated during training on google. Click on github issues for that error and you’ll find a solution for that.

Note: These issues seems to be mostly present in TFOD API Version 1

Changing Important Params in Pipeline.config File:

Additionally I’ve also changed the batch size of the model, just like batch_size there are lots of important parameters that you would want to tune. I would strongly recommend that you try to change the values according to your problem. Almost always the default values are not optimal for your custom use case. I should tell you that to tune most of these values you need some prior knowledge, make sure to atleast change the batch_size according to your system’s memory and learning_rate of the model.

Part 5 Training and Exporting Inference Graph: 

You can start training the model by calling the model_main.py script from the Object_detection folder, we are giving it the following arguments.

  • num_train_steps: These are the number of times your model weights will be updated using a batch of data.
  • pipeline_config_path: This is the path to your pipeline.config file.
  • model_dir: Path to the output directory where the final checkpoint files will be saved.

Now you can run below cell to start training but I would recommend that you run this cell in the command line, you can just paste this line:

Note: When you start training you will see a lot of warnings, just ignore them as TFOD 1 contains a lot of depreciated functions.

Once you start training, the network will take some time to initialize and then the training will start, after every few minutes, you will see a report of loss values and a global loss. The Network is learning if the loss is going down. If you’re not familiar with the Object detection Jargon Like IOU etc, then just make note of the final global loss after each report.

You ideally want to set the num_train_steps to tens of thousands of steps, you can always end training by pressing CTRL + C on the command prompt if the loss has decreased sufficiently. If training is taking place in jupyter notebook then you can end it by pressing the Stop button on top.

After training has ended or you’ve stopped it, there would be some new files in the pre_trained folder. Among all these files we will only need the checkpoint (ckpt) files.

If you’re training for 1000s of steps (which is most likely the case) then I would strongly recommend that you don’t use your CPU but utilize a GPU. If you don’t have one then its best to use Google Colab’s GPU. I’m already providing you a ready to run colab Notebook.

Note: There’s another script for training called train.py, this is an older script where you can see the loss value for each step, if you want to use that sicpt then you can find it at models / research / object_detection / legacy / train.py

You can run this script by doing:

The best way to monitor training is to use Tensorboard, I will discuss about this another time

Export Frozen Inference Graph:

Now we will use the export_inference_graph.py script to create a frozen_inference_graph from the checkpoint files.

Why are we doing this?

After training our model it is stored in checkpoint format and a saved_model format but in OpenCV we need the model to be in a frozen_inference_graph format. So we need to generate the frozen_inference_graph using the checkpoint files.

What are these checkpoint files?

After Every few minutes of training, tensorflow outputs some checkpoint (ckpt) files. The number on those files represent how many train steps they have gone through. So during the frozen_inference_graph creation we only take the latest checkpoint file (i.e. the file with the highest number) because this is the one which has gone through the most training steps.

Now every time a checkpoint file is saved, its split into 3 parts.

For the initial step these files are:

  • model.ckpt-000.data: This file contains the value of each single variable, its pretty large.
  • model.ckpt-000.info: This file contains metadata for each tensor. e.g. checksum, auxiliary data etc.
  • model.ckpt-000.meta: This file stores the graph structure of the model

If you take a look at the fine_tuned_model folder wihch will be created after running the above command then you’ll find that it contains the same files you got when you downloaded the pre_trained model. This is the final folder.

Now Your trained model is in 3 different formats, the saved_model format, the frozen_inference_graph format and the checkpoint file format. For OpenCV we only need the frozen inference graph format.

The checkpoint format is ideal for retraining purposes and getting to know other sorts of information about the model, for production and serving the model you will need to use is either the frozen_inference_graph or saved_model format. Its worth mentioning that both these files contain the extension .pb

In TF 2, frozen_inference_graph is depreciated and TF 2 encourages to use the saved_model format, as said previously unfortunately we can’t use the saved_model format with OpenCV yet.

Run Inference on Trained Model (Bonus Step):

You can optionally choose to run inference using tensorflow sessions, I’m not going to explain much here as Tf sessions are depreciated and our final goal is to actually use this model in OpenCV DNN.

Tom and Jerry, Tensorflow Object Detection API

Part 6: Generating .pbtxt and using the trained model with just OpenCV 

6 a) Export Graph.pbxt with frozen inference graph:

We can use the above generated frozen graph inside the OpenCV DNN module to do detection but most of the time we need another file called a graph.pbtxt file. This file contains a description of the network architecture, it is required by OpenCV to rewire some network layers for Optimization purposes.

This graph.pbtxt can be generated by using one of the 4 scripts provided by OpenCV. These scripts are:

  • tf_text_graph_ssd.py
  • tf_text_graph_faster_rcnn.py
  • tf_text_graph_mask_rcnn.py
  • tf_text_graph_efficientdet.py

They can be downloaded here, you will also find more information regarding them on that page.

Now since the Detection architecture we’re using is Faster-RCNN (you can tell by looking at the name of the downloaded model) so we will use tf_text_graph_faster_rcnn.py to generate the pbtxt file. For .pbtxt generation you will need the frozen_inference_graph.pb file and the pipeline.config file.

Note: When you’re done with training then you will also see a graph.pbtxt file inside the pretrained folder, this graph.pbtxt is different from the one generated by OpenCV’s .pbtxt generator scripts. One major difference is that the OpenCV’s graph.pbtxt do not contain the model weights but only contains the graph description, so they will be much smaller in size.

Number of classes: 2
Scales: [0.25, 0.5, 1.0, 2.0] Aspect ratios: [0.5, 1.0, 2.0]
Width stride: 16.000000
Height stride: 16.000000
Features stride: 16.000000

For model architectures that are not one of the above 4, then for those, you will need to convert TensorFlow’s .pbtxt file to OpenCV’s version. You can find more on how to do that here. But we warned this conversion is not a smooth process and there are a lot of low-level issues that come up.

6 b) Using the Frozen inference graph along with Pbtxt file in OpenCV:

Now that we have generated the graph.pbtxt file with OpenCV’s tf_text_graph function we can pass this file to cv2.dnn.readNetFromTensorflow() to initialize the network. All of our work is done now Make sure you’re familiar with with OpenCV’s DNN module, if not you can read my previous post on it.

Now we will create following two functions:

Initialization Function: This function will intialize the network using the .pb and .pbtxt file, it will also set the class labels.

Main Function: This function will contain all the rest of the code from preprocessing to postprocessing, it will also have the option to either return the image or display it with matplotlib

This is our Main function, the comments will explain what’s going on

Note: When you do net.forward() you get an output of shape (1,1,100,7). Since we’re predicting on a single image instead of a batch of images so you will get (1,1) at the start now the remaining (100,7) means that there are 100 detections for that image and each image contains 7 properties/variables.

There will be 100 detections for each image, this was set in the pipeline.config, you can choose to change that.

So here are what these 7 properties correspond to:

  1. This is the index of image for a single image its 0
  2. This is the index of the target CLASS
  3. This is the score/confidence of that CLASS

Remaining 4 values are x1,y1,x2,y2. These are used to draw the bounding box of that CLASS object

  1. x1
  2. y1
  3. x2
  4. y2

Initialize the network

You will just need to call this once to initialize the network

Predict On Images

Now you can use the main funciton to predict on different images, The images we will predict on are placed inside a folder namded test_images. These images were not in the training dataset.

jerry img, Tensorflow Object Detection API
tom img, Tensorflow Object Detection API
Jerry 2 img, Tensorflow Object Detection API
Tom 2 img, Tensorflow Object Detection API
Tom and Jerry 2, Tensorflow Object Detection API

What’s Next?

computer vision

If you want to go forward from here and learn more advanced things and go into more detail, understand theory and code of different algorithms then be sure to check out our Computer Vision & Image Processing with Python Course (Urdu/Hindi). In this course, I go into a lot of detail regarding vision fundamentals and cover a plethora of algorithms and techniques to help you master Computer Vision.

If you want to start a career in Computer Vision & Artificial Intelligence then this course is for you. One of the best things about this course is that the video lectures are in Urdu/Hindi Language without any compromise on quality, so there is a personal/local touch to it.


Limitations: Our Final detector has a decent accuracy but it’s not that robust because of 4 reasons:

  1. Transfer Learning works best when the dataset you’re training on shares some features with the original dataset it was trained on, most of the models are trained on ImageNet, COCO, PASCAL VOC datasets. Which is filled with animals and other real-world images. Now our dataset is a dataset of Cartoon images, which is drastically different from real-world images. We can solve this problem by including more images and training more layers of the model.

  2. Animations of cartoon characters are not consistent, they change a lot in different movies. So if you train the model on these pictures and then try to detect random google images of tom and jerry then you won’t get good accuracy. We can solve this problem by including images of these characters from different movies so the model learns the features that are the same throughout the movies.

  3. The images generated from the sample video created an imbalanced dataset, There are more Jerry Images than Tom images, there are ways to handle this scenario but try to get a decent balance of images for both classes to get the best results.

  4. The annotation is poor, Yeah so the annotation I did was just for the sake of making this tutorial, in reality, you want to set a clear outline and standard about how you’ll be annotating, are you going to annotate the whole head, are ears included, is the neck part of it.. so you need answer all these questions ahead of time.

I will stress this again that if you’re not planning to use OpenCV for the final deployment then use TFOD API version 2, it’s a lot more cleaner. However, if the final objective is to use OpenCV at the end then you could get away with TF 2 but its a lot of trouble.

Even with TFOD API v1, you can’t be sure that your custom trained model will always be loaded in OpenCV correctly, there are times when you would need to manually edit the graph.pbtxt file so that you can use the model in OpenCV. If this happens and you’re sure you have done everything correctly then your best bet is to raise an issue here.

Hopefully, OpenCV will catch up and start supporting TF 2 saved_model format but its gonna take time. If you enjoyed this tutorial then please feel free to comment and I’ll gladly answer you.

(LearnOpenCV) Playing Rock, Paper, Scissors with AI

(LearnOpenCV) Playing Rock, Paper, Scissors with AI

Let’s play rock, paper scissors.

You think of your move and I’ll make mine below this line in 1…2…and 3.

I choose ROCK.

Well? …who won. It doesn’t matter cause you probably glanced at the word “ROCK” before thinking about a move or maybe you didn’t pay any heed to my feeble attempt at playing rock, paper, scissor with you in a blog post.

So why am I making some miserable attempts trying to play this game in text with you?

Let’s just say, a couple of months down the road in lockdown you just run out of fun ideas. To be honest I desperately need to socialize and do something fun. 

Ideally, I would love to play games with some good friends, …or just friends…or anyone who is willing to play.

Now I’m tired of video games. I want to go for something old fashioned, like something involving other intelligent beings, ideally a human. But because of the lockdown, we’re a bit short on those for close proximity activities. So what’s the next best thing?

AI of course. So yeah why not build an AI that would play with me whenever I want.

Now I don’t want to make a dumb AI bot that predicts randomly between rock, paper, and scissor, but rather I also don’t want to use any keyboard inputs or mouse. Just want to play the old fashioned way.

Building a Smart Intruder Detection System with OpenCV and your Phone

Building a Smart Intruder Detection System with OpenCV and your Phone

Did you know that you can actually stream a Live Video wirelessly from your phone’s camera to OpenCV’s cv2.VideoCapture() function in your PC and do all sorts of image processing on the spot like build an intruder detection system?

Cool huh?

In today’s post not only we will do just that but we will also build a robust Intruder Detection surveillance system on top of that, this will record video samples whenever someone enters your room and will also send you alert messages via Twilio API.

This post will serve as your building blocks for making a smart intruder detection system with computer vision. Although I’m making this tutorial for a home surveillance experiment, you can easily take this setup and swap the mobile camera with multiple IP Cams to create a much larger system.

Today’s tutorial can be split into 4 parts:

  1. Accessing the Live stream from your phone to OpenCV.
  2. Learning how to use the Twilio API to send Alert messages.
  3. Building a Motion Detector with Background Subtraction and Contour detection.
  4. Making the Final Application

So most of the people have used the cv2.videocapture() function to read from a webcam or a video recording from a disk but only a few people know how easy it is to stream a video from a URL, in most cases this URL is from an IP camera. 

By the way with cv2.VideoCapture() you can also read a sequence of images, so yeah a GIF can be read by this.

So let me list out all 4 ways to use VideoCapture() class depending upon what you pass inside the function.

1. Using Live camera feed: You pass in an integer number i.e. 0,1,2 etc e.g. cap = cv2.VideoCapture(0), now you will be able to use your webcam live stream. The number depends upon how many USB cams you attach and on which port.

2. Playing a saved Video on Disk: You pass in the path to the video file e.g. cap = cv2.VideoCapture(Path_To_video).

3. Live Streaming from URL using Ip camera or similar: You can stream from a URL e.g. cap = cv2.VideoCapture( protocol://host:port/video) Note: that each video stream or IP camera feed has its own URL scheme.  

4. Read a sequence of Images: You can also read sequences of images, e.g. GIF.

Part 1: Accessing the Live stream from your phone to OpenCV For The Intruder Detection System:

For those of you who have an Android phone can go ahead and install this IP Camera application from playstore. 

For people that want to try a different application or those of you who want to try on their iPhone I would say that although you can follow along with this tutorial by installing a similar IP camera application on your phones but one issue that you could face is that the URL Scheme for each application would be different so you would need to figure that out, some application makes it really simple like the one I’m showing you today. 

You can also use the same code I’m sharing here to work with an actual IP Camera, again the only difference will be the URL scheme, different IP Cameras have different URL schemes. For our IP Camera, the URL Scheme is: protocol://host:port/video

After installing the IP Camera application, open it and scroll all the way down and click start server.

After starting the server the application will start streaming the video to the highlighted URL:

If you paste this URL in the browser of your computer then you would see this:

Note: Your computer and mobile must be connected to the same Network

Click on the Browser or the Flash button and you’ll see a live stream of your video feed:

Below the live feed, you’ll see many options on how to stream your video, you can try changing these options and see effects take place in real-time.

Some important properties to focus on are the video Quality, FPS, and the resolution of the video. All these things determine the latency of the video. You can also change front/back cameras.

Try copying the image Address of the frame:

If you try pasting the address in a new tab then you will only see the video stream. So this is the address that will go inside the VideoCapture function.

Image Address:

So the URL scheme in our case is : protocol://host:port/video, where protocol is “http” ,  host is: “”  and port is: “8080”

All you have to do is paste the above address inside the VideoCapture function and you’re all set.

Download Code for this post

Download Code for this post

Here’s the Full Code:

As you can see I’m able to stream video from my phone.

Now there are some options you may want to consider, for e.g you may want to change the resolution, in my case I have set the resolution to be 640x480. Since I’m not using the web interface so I have used the app to set these settings.

There are also other useful settings that you may want to do, like settings up a password and a username so your stream is protected. Setting up a password would, of course, change the URL to something like:

cv2.VideoCapture( protocol://username:password@host:port/video)

I’ve also enabled background mode so even when I’m out of the app or my phone screen is closed the camera is recording secretly, now this is super stealth mode.

Finally here are some other URL Schemes to read this IP Camera stream, with these URLs you can even load audio and images from the stream:

  • http://19412.168.3.:8080/video is the MJPEG URL.
  • fetches the latest frame.
  • is the audio stream in Wav format.
  • is the audio stream in AAC format (if supported by hardware).

Part 2: Learning how to use the Twilio API to send Alert messages for the Intruder Detection System:

What is Twilio?

Twilio is an online service that allows us to programmatically make and receive phone calls, send and receive SMS, MMS and even Whatsapp messages, using its web  APIs.

Today we’ll just be using it to send an SMS, you won’t need to purchase anything since you get some free credits after you have signed up here.

So go ahead and sign up, after signing up go to the console interface and grab these two keys and your trial Number:


After getting these keys you would need to insert them in the credentials.txt file provided in the source code folder. You can download the folder from above.

Make sure to replace the INSERT_YOUR_ACCOUNT_SID with your ACCOUNT SID and also replace INSERT_YOUR_AUTH_TOKEN with your AUTH TOKEN.

There are also two other things you need to insert in the text file, this is your trail Number given to by the Twilio API and your personal number where you will receive the messages.

So replace PERSONAL_NUMBER with your number and TRIAL_NUMBER with the Twilio number, make sure to include the country code for your personal number. 

Note: in the trail account the personal number can’t be any random number but its verified number. After you have created the account you can add verified numbers here.

Now you’re ready to use the twilio api, you first have to install the API by doing:

pip install twilio

Now just run this code to send a message:

Check your phone you would have received a message. Later on we’ll properly fill up the body text.

Part 3: Building a Motion Detector with Background Subtraction and Contour detection:

Now in OpenCV, there are multiple ways to detect and track a moving object, but we’re going to go for a simple background subtraction method. 

What are Background Subtraction methods?

Basically these kinds of methods separate the background from the foreground in a video so for e.g. if a person walks in an empty room then the background subtraction algorithm would know there’s disturbance by subtracting the previously stored image of the room (without the person ) and the current image (with the person). 

So background subtraction can be used as effective motion detectors and even object counters like a people counter, how many people went in or out of a shop.

Now what I’ve described above is a very basic approach to background subtraction, In OpenCV, you would find a number of complex algorithms that use background subtraction to detect motion, In my Computer Vision & Image Processing Course I have talked about background subtraction in detail. I have taught how to construct your own custom background subtraction methods and how to use the built-in OpenCV ones. So make sure to check out the course if you want to study computer vision in depth.

For this tutorial, I will be using a Gaussian Mixture-based Background / Foreground Segmentation Algorithm. It is based on two papers by Z.Zivkovic, “Improved adaptive Gaussian mixture model for background subtraction” in 2004 and “Efficient Adaptive Density Estimation per Image Pixel for the Task of Background Subtraction” in 2006

Here’s the code to apply background subtraction:

The cv2.createBackgroundSubtractorMOG2() takes in 3 arguments:

detectsSadows: Now this algorithm will also be able to detect shadows, if we pass in detectShadows=True argument in the constructor.  The ability to detect and get rid of shadows will give us smooth and robust results. Enabling shadow detection slightly decreases speed.

history: This is the number of frames that is used to create the background model, increase this number if your target object often stops or pauses for a moment.

varThreshold: This threshold will help you filter out noise present in the frame, increase this number if there are lots of white spots in the frame. Although we will also use morphological operations like erosion to get rid of the noise.

Now after we have our background subtraction done then we can further refine the results by getting rid of the noise and enlarging our target object.

We can refine our results by using morphological operations like erosion and dilation. After we have cleaned our image then we can apply contour detection to detect those moving big white blobs  (people) and then draw bounding boxes over those blobs.

If you don’t know about Morphological Operations or Contour Detection then you should go over this Computer Vision Crash course post, I published a few weeks back.

So in summary 4 major steps are being performed above:

  • Step 1: We’re Extracting moving objects with Background Subtraction and getting rid of the shadows
  • Step 2: Applying morphological operations to improve the background subtraction mask
  • Step 3: Then we’re detecting Contours and making sure you’re not detecting noise by filtering small contours
  • Step 4: Finally we’re computing a bounding box over the max contour, drawing the box, and displaying the image.

Part 4: Creating the Final Intruder Detection System Application:

Finally, we will combine all the things above, we will also use the cv2.VideoWriter() class to save the images as a video in our disk. We will alert the user via Twilio API whenever there is someone in the room.

Here are the final results:

This is the function that detects if someone is present in the frame or not.

This function uses twilio to send messages.

Explanation of the Final Application Code:

The function is_person_present()  is called on each frame and it tells us if a person is present in the current frame or not, if it is then we append True to a deque list of length 15, now if the detection has occurred 15 times consecutively we then change the Room occupied status to True. The reason we don’t change the Occupied status to True on the first detection is to avoid our system being triggered by false positives. As soon as the room status is true the VideoWriter is initialized and the video starts recording.

Now when the person is not detected anymore then we wait for 7 seconds before turning the room status to False, this is because the person may disappear from view for a moment and then reappear or we may miss detecting the person for a few seconds. 

Now when the person disappears and the 7-second timer ends then we make the room status to False, we release the VideoWriter in order to save the video and then send an alert message via send_message() function to the user.

Also I have designed the code in a way that our patience timer (7 second timer) is not affected by False positives.

Here’s a high level explanation of the demo:

See how I have placed my mobile, while the screen is closed it’s actually recording and sending live feed to my PC.  No one would suspect that you have the perfect intruder detection system setup in the room.


Right now your IP Camera has a dynamic IP so you may be interested in learning how to make your device have a static IP address so you don’t have to change the address each time you launch your IP Camera.

Another limitation you have right now is that you can only use this setup when your device and your PC are connected to the same network/WIFI so you may want to learn how to get this setup to run globally.

Both of these issues can be solved by some configuration, All the instructions for that are in a manual which you can get by downloading the source code from above for the intruder detection system.

What’s Next?

computer vision

If you want to go forward from here and learn more advanced things and go into more detail, understand theory and code of different algorithms then be sure to check out our Computer Vision & Image Processing with Python Course (Urdu/Hindi). In this course, I go into a lot of detail regarding vision fundamentals and cover a plethora of algorithms and techniques to help you master Computer Vision.

The 3 month course contains:

✔ 125 Video Lectures
✔ Discussion Forums
✔ Quizzes
✔ 100+ High Quality Jupyter notebooks
✔ Practice Assignments
✔Certificate of Completion

If you want to start a career in Computer Vision & Artificial Intelligence then this course is for you. One of the best things about this course is that the video lectures are in Urdu/Hindi Language without any compromise on quality, so there is a personal/local touch to it.


In this tutorial you learned how to turn your phone into a smart IP Camera, you learned how to work with URL video feeds in general.

After that we went over how to create a background subtraction based motion detector. 

We also learned how to connect the twilio api to our system to enable alert messages. Right now we are sending alert messages every time there is motion so you may want to change this and make the api send you a single message each day containing a summary of all movements that happened in the room throughout the day.

Finally we created a complete application where we also saved the recording snippets of people moving about in the room.

This post was just a basic template for a surveillance system, you can actually take this and make more enhancements to it, for e.g. for each person coming in the room you can check with facial recognition if it’s actually an intruder or a family member. Similarly there are lots of other things you can do with this.

If you enjoyed this tutorial then I would love to hear your opinion on it, please feel free to comment and ask questions, I’ll gladly answer them.