Human Activity Recognition using TensorFlow (CNN + LSTM)

Human Activity Recognition using TensorFlow (CNN + LSTM)

Convolutional Neural Networks (CNN) are great for image data and Long-Short Term Memory (LSTM) networks are great when working with sequence data but when you combine both of them, you get the best of both worlds and you solve difficult computer vision problems like video classification.

In this tutorial, we’ll learn to implement human action recognition on videos using a Convolutional Neural Network combined with a Long-Short Term Memory Network. We’ll actually be using two different architectures and approaches in TensorFlow to do this. In the end, we’ll take the best-performing model and perform predictions with it on youtube videos.

Before I start with the code, let me cover some theories on video classification and different approaches that are available for it.

Image Classification

You may already be familiar with an image classification problem, where, you simply pass an image to the classifier (either a trained Deep Neural Network (CNN or an MLP) or a classical classifier) and get the class predictions out of it.

But what if you have a video? What will happen then? 

Before we talk about how to go about dealing with videos, let’s just dissect what videos are exactly.

But First What Exactly Videos are?

Well, so it’s no secret that a video is just a sequence of multiple still images (aka. frames) that are updated really fast creating the appearance of a motion. Consider the video (converted into .gif format) below of a cat jumping on a bookshelf, it is just a combination of 15 different still images that are being updated one after the other.

Now that we understand what videos are, let’s take a look at a number of approaches that we can use to do video classification.

Approach 1: Single-Frame Classification

The simplest and most basic way of classifying actions in a video can be using an image classifier on each frame of the video and classify action in each frame independently. So if we implement this approach for a video of a person doing a backflip, we will get the following results.

The classifier predicts Falling in some frames instead of Backflipping because this approach ignores the temporal relation of the frames sequence. And even if a person looks at those frames independently he may think that the person is Falling.

Now a simple way to get a final prediction for the video is to consider the most frequent one which can work in simple scenarios but is Falling in our case and is not correct. So another way to go about this is to take an average of the probabilities of predictions and get a more robust final prediction.

You should also check another Video Classification and Human Activity Recognition tutorial I had published a while back, in which I had discussed a number of other approaches too and implemented this one using a single-frame CNN with moving averages and it had worked fine for a relatively simpler problem. 

But as mentioned before, this approach is not effective, because it does not take into account the temporal aspect of the data.

Approach 2: Late Fusion

Another slightly different approach is late fusion, in which after performing predictions on each frame independently, the classification results are passed to a fusion layer that merges all the information and makes the prediction. This approach also leverages the temporal information of the data.

This approach does give decent results but is still not powerful enough. Now before moving to the next approach let’s discuss what Convolutional Neural Networks are. So that you get an idea of what that black box named image classifier was, that I was using in the images. 

Convolutional Neural Network (CNN)

A Convolutional Neural Network (CNN or ConvNet) is a type of deep neural network that is specifically designed to work with image data and excels when it comes to analyzing the images and making predictions on them.

It works with kernels (called filters) that go over the image and generates feature maps (that represent whether a certain feature is present at a location in the image or not) and initially it generates few feature maps and as we go deeper in the network the number of feature maps is increased and the size of maps is decreased using pooling operations without losing critical information.

Image Source

Each layer of a ConvNet learns features of increasing complexity which means, for example, the first layer may learn to detect edges and corners, while the last layer may learn to recognize humans in different postures.

Now let’s get back to discussing other approaches for video classification.

Approach 3: Early Fusion  

Another approach of video classification is early fusion, in which all the information is merged at the beginning of the network, unlike late fusion which merges the information in the end. This is a powerful approach but still has its own limitations.

Approach 4: Using 3D CNN’s (aka. Slow Fusion)

Another option is to use a 3D Convolutional Network, where the temporal and spatial information are merged slowly throughout the whole network that is why it’s called Slow Fusion. But a disadvantage of this approach is that it is computationally really expensive so it is pretty slow.

Approach 5: Using Pose Detection and LSTM

Another method is to use a pose detection network on the video to get the landmark coordinates of the person for each frame in the video. And then feed the landmarks to an LSTM Network to predict the activity of the person. 

There are already a lot of efficient pose detectors out there that can be used for this approach. But a disadvantage of using this approach is that you discard all the information other than the landmarks, like the environment information can be very useful, for example for playing football action category the stadium and uniform info can help the model a lot in predicting the action accurately.

Before going to the approach that we will implement in this tutorial, let’s briefly discuss what are Long Short Term Memory (LSTM) networks, as we will be using them in the approach.

Long Short Term Memory (LSTM) 

An LSTM network is specifically designed to work with a data sequence as it takes into consideration all of the previous inputs while generating an output. LSTMs are actually a type of neural network called Recurrent Neural Network, but RNNs are not known to be effective for dealing with the Long term dependencies in the input sequence because of a problem called the Vanishing gradient problem.

LSTMs were developed to overcome the vanishing gradient and so an LSTM cell can remember context for long input sequences.

Many-to-one LSTM network

This makes an LSTM more capable of solving problems involving sequential data such as time series prediction, speech recognition, language translation, or music composition. But for now, we will only explore the role of LSTMs in developing better action recognition models.

Now let’s move on towards the approach we will implement in this tutorial to build an Action Recognizer. We will use a Convolution Neural Network (CNN) + Long Short Term Memory (LSTM) Network to perform Action Recognition while utilizing the Spatial-temporal aspect of the videos.

Approach 6: CNN + LSTM

We will be using a CNN to extract spatial features at a given time step in the input sequence (video) and then an LSTM to identify temporal relations between frames.

The two architectures that we will be using to use CNN along with LSTM are:

  1. ConvLSTM 
  2. LRCN

Both of these approaches can be used using TensorFlow. This tutorial also has a video version as well, that you can go and watch for a more detailed overview of the code.

Now let’s jump into the code.

Download The Files


  • Step 1: Download and Visualize the Data with its Labels
  • Step 2: Preprocess the Dataset
  • Step 3: Split the Data into Train and Test Set
  • Step 4: Implement the ConvLSTM Approach
    • Step 4.1: Construct the Model
    • Step 4.2: Compile & Train the Model
    • Step 4.3: Plot Model’s Loss & Accuracy Curves
  • Step 5: implement the LRCN Approach
    • Step 5.1: Construct the Model
    • Step 5.2: Compile & Train the Model
    • Step 5.3: Plot Model’s Loss & Accuracy Curves
  • Step 6: Test the Best Performing Model on YouTube videos

Alright, so without further ado, let’s get started.

Import the Libraries

We will start by installing and importing the required libraries.

And will set NumpyPython, and Tensorflow seeds to get consistent results on every execution.

Step 1: Download and Visualize the Data with its Labels

In the first step, we will download and visualize the data along with labels to get an idea about what we will be dealing with. We will be using the UCF50 – Action Recognition Dataset, consisting of realistic videos taken from youtube which differentiates this data set from most of the other available action recognition data sets as they are not realistic and are staged by actors. The Dataset contains:

  • 50 Action Categories
  • 25 Groups of Videos per Action Category
  • 133 Average Videos per Action Category
  • 199 Average Number of Frames per Video
  • 320 Average Frames Width per Video
  • 240 Average Frames Height per Video
  • 26 Average Frames Per Seconds per Video

Let’s download and extract the dataset.

For visualization, we will pick 20 random categories from the dataset and a random video from each selected category and will visualize the first frame of the selected videos with their associated labels written. This way we’ll be able to visualize a subset ( 20 random videos ) of the dataset.CodeText

Step 2: Preprocess the Dataset

Next, we will perform some preprocessing on the dataset. First, we will read the video files from the dataset and resize the frames of the videos to a fixed width and height, to reduce the computations and normalized the data to range [0-1] by dividing the pixel values with 255, which makes convergence faster while training the network.

But first, let’s initialize some constants.

Note: The IMAGE_HEIGHTIMAGE_WIDTH and SEQUENCE_LENGTH constants can be increased for better results, although increasing the sequence length is only effective to a certain point, and increasing the values will result in the process being more computationally expensive.

Create a Function to Extract, Resize & Normalize Frames

We will create a function frames_extraction() that will create a list containing the resized and normalized frames of a video whose path is passed to it as an argument. The function will read the video file frame by frame, although not all frames are added to the list as we will only need an evenly distributed sequence length of frames.

Create a Function for Dataset Creation

Now we will create a function create_dataset() that will iterate through all the classes specified in the CLASSES_LIST constant and will call the function frame_extraction() on every video file of the selected classes and return the frames (features), class index ( labels), and video file path (video_files_paths).

Now we will utilize the function create_dataset() created above to extract the data of the selected classes and create the required dataset.

Extracting Data of Class: WalkingWithDog

Extracting Data of Class: TaiChi

Extracting Data of Class: Swing

Extracting Data of Class: HorseRace

Now we will convert labels (class indexes) into one-hot encoded vectors.

Step 3: Split the Data into Train and Test Set

As of now, we have the required features (a NumPy array containing all the extracted frames of the videos) and one_hot_encoded_labels (also a Numpy array containing all class labels in one hot encoded format). So now, we will split our data to create training and testing sets. We will also shuffle the dataset before the split to avoid any bias and get splits representing the overall distribution of the data.

Step 4: Implement the ConvLSTM Approach

In this step, we will implement the first approach by using a combination of ConvLSTM cells. A ConvLSTM cell is a variant of an LSTM network that contains convolutions operations in the network. it is an LSTM with convolution embedded in the architecture, which makes it capable of identifying spatial features of the data while keeping into account the temporal relation.

For video classification, this approach effectively captures the spatial relation in the individual frames and the temporal relation across the different frames. As a result of this convolution structure, the ConvLSTM is capable of taking in 3-dimensional input (width, height, num_of_channels) whereas a simple LSTM only takes in 1-dimensional input hence an LSTM is incompatible for modeling Spatio-temporal data on its own.

You can read the paper Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting by Xingjian Shi (NIPS 2015), to learn more about this architecture.

Step 4.1: Construct the Model

To construct the model, we will use Keras ConvLSTM2D recurrent layers. The ConvLSTM2D layer also takes in the number of filters and kernel size required for applying the convolutional operations. The output of the layers is flattened in the end and is fed to the Dense layer with softmax activation which outputs the probability of each action category.

We will also use MaxPooling3D layers to reduce the dimensions of the frames and avoid unnecessary computations and Dropout layers to prevent overfitting the model on the data. The architecture is a simple one and has a small number of trainable parameters. This is because we are only dealing with a small subset of the dataset which does not require a large-scale model.

Now we will utilize the function create_convlstm_model() created above, to construct the required convlstm model.

Check Model’s Structure:

Now we will use the plot_model() function, to check the structure of the constructed model, this is helpful while constructing a complex network and making that the network is created correctly.

Step 4.2: Compile & Train the Model

Next, we will add an early stopping callback to prevent overfitting and start the training after compiling the model.


Evaluate the Trained Model

After training, we will evaluate the model on the test set.

4/4 [==============================] – 14s 3s/step – loss: 0.8976 – accuracy: 0.8033

Save the Model

Now we will save the model to avoid training it from scratch every time we need the model.

Step 4.3: Plot Model’s Loss & Accuracy Curves

Now we will create a function plot_metric() to visualize the training and validation metrics. We already have separate metrics from our training and validation steps so now we just have to visualize them.

Now we will utilize the function plot_metric() created above, to visualize and understand the metrics.

Step 5: Implement the LRCN Approach

In this step, we will implement the LRCN Approach by combining Convolution and LSTM layers in a single model. Another similar approach can be to use a CNN model and LSTM model trained separately. The CNN model can be used to extract spatial features from the frames in the video, and for this purpose, a pre-trained model can be used, that can be fine-tuned for the problem. And the LSTM model can then use the features extracted by CNN, to predict the action being performed in the video.

But here, we will implement another approach known as the Long-term Recurrent Convolutional Network (LRCN), which combines CNN and LSTM layers in a single model. The Convolutional layers are used for spatial feature extraction from the frames, and the extracted spatial features are fed to LSTM layer(s) at each time-steps for temporal sequence modeling. This way the network learns spatiotemporal features directly in an end-to-end training, resulting in a robust model.

You can read the paper Long-term Recurrent Convolutional Networks for Visual Recognition and Description by Jeff Donahue (CVPR 2015), to learn more about this architecture.
We will also use TimeDistributed wrapper layer, which allows applying the same layer to every frame of the video independently. So it makes a layer (around which it is wrapped) capable of taking input of shape (no_of_frames, width, height, num_of_channels) if originally the layer’s input shape was (width, height, num_of_channels) which is very beneficial as it allows to input the whole video into the model in a single shot.

Step 5.1: Construct the Model

To implement our LRCN architecture, we will use time-distributed Conv2D layers which will be followed by MaxPooling2D and Dropout layers. The feature extracted from the Conv2D layers will be then flattened using the Flatten layer and will be fed to a LSTM layer. The Dense layer with softmax activation will then use the output from the LSTM layer to predict the action being performed.

Now we will utilize the function create_LRCN_model() created above to construct the required LRCN model.

Check Model’s Structure:

Now we will use the plot_model() function to check the structure of the constructed LRCN model. As we had checked for the previous model.

Step 5.2: Compile & Train the Model

After checking the structure, we will compile and start training the model.

Evaluating the trained Model

As done for the previous one, we will evaluate the LRCN model on the test set.

4/4 [==============================] – 2s 418ms/step – loss: 0.2242 – accuracy: 0.9262

Save the Model

After that, we will save the model for future uses using the same technique we had used for the previous model.

Step 5.3: Plot Model’s Loss & Accuracy Curves

Now we will utilize the function plot_metric() we had created above to visualize the training and validation metrics of this model.

Step 6: Test the Best Performing Model on YouTube videos

From the results, it seems that the LRCN model performed significantly well for a small number of classes. so in this step, we will put the LRCN model to test on some youtube videos.

Create a Function to Download YouTube Videos:

We will create a function download_youtube_videos() to download the YouTube videos first using pafy library. The library only requires a URL to a video to download it along with its associated metadata like the title of the video.

Download a Test Video:

Now we will utilize the function download_youtube_videos() created above to download a youtube video on which the LRCN model will be tested.

Create a Function To Perform Action Recognition on Videos

Next, we will create a function predict_on_video() that will simply read a video frame by frame from the path passed in as an argument and will perform action recognition on video and save the results.

Perform Action Recognition on the Test Video

Now we will utilize the function predict_on_video() created above to perform action recognition on the test video we had downloaded using the function download_youtube_videos() and display the output video with the predicted action overlayed on it.

100%|██████████| 867/867 [00:02<00:00, 306.08it/s]

Create a Function To Perform a Single Prediction on Videos

Now let’s create a function that will perform a single prediction for the complete videos. We will extract evenly distributed N (SEQUENCE_LENGTH) frames from the entire video and pass them to the LRCN model. This approach is really useful when you are working with videos containing only one activity as it saves unnecessary computations and time in that scenario.

Perform Single Prediction on a Test Video

Now we will utilize the function predict_single_action() created above to perform a single prediction on a complete youtube test video that we will download using the function download_youtube_videos(), we had created above.

Action Predicted: TaiChi

Confidence: 0.94

Join My Upcoming Computer Vision For Building Cutting Edge Applications Course

A Course that goes beyond basic applications and teaches you how to create some next-level apps that utilize physics, deep learning (LSTM + CNN) + classical image processing, hand and body gestures to do a variety of very interesting things.


In this tutorial, we discussed a number of approaches to perform video classification and learned about the importance of the temporal aspect of data to gain higher accuracy in video classification and implemented two CNN + LSTM architectures in TensorFlow to perform Human Action Recognition on videos by utilizing the temporal as well as spatial information of the data. 

We also learned to perform pre-processing on videos using the OpenCV library to create an image dataset, we also looked into getting youtube videos using just their URLs with the help of the Pafy library for testing our model.

Now let’s discuss a limitation in our application that you should know about, our action recognizer cannot work on multiple people performing different activities. There should be only one person in the frame to correctly recognize the activity of that person by our action recognizer, this is because the data was in this manner, on which we had trained our model. 

You can use some different dataset that has been annotated for more than one person’s activity and also provides the bounding box coordinates of the person along with the activity he is performing, to overcome this limitation.

Or a hacky way is to crop out each person and perform activity recognition separately on each person but this will be computationally very expensive.

That is all for this lesson, if you enjoyed this lesson, let me know in the comments and you can also support me and the Bleed AI team on Patreon here.
If you need 1 on 1 Coaching in AI/computer vision regarding your project or your career then you reach out to me personally here.

Making a Career in Computer Vision for FREE

Making a Career in Computer Vision for FREE

I have released a free 10-day email course on making a career in computer vision, you can join the course here.

Who am I?

Alright before I start this post, you might be wondering who am I to teach you, or what merit do I have to give advice regarding career options in AI.  So here’s a quick intro of me:

I’m Taha Anwar. An Applied  Computer Vision scientist, besides running Bleed AI, I’ve also worked for the official OpenCV.org team and led teams to develop high-end technical content for their premium courses.

I’ve also published a number of technical tutorials (blog posts, videos) applications at Bleed AI and at LearnOpenCV, given talks at prominent universities and international events on computer vision. Also published a useful computer vision python module at  PyPI. This year I’ve also started a youtube channel to reach more people.

Why Create this Course?

So I’ve been working in this field for a number of years and during my time I’ve taught and helped a lot of people from University Grads to engineers and researchers. So I created this course in order to help people interested in computer vision reach their desired outcomes, whether it’s landing a job, becoming a researcher, building projects as a hobby, or whatever it might be, this course will help you get that and it’ll show you an ideal path from start to finish to master the computer vision career roadmap.

It doesn’t matter what your background level is, the course is designed to cover an audience of all experience levels.

 Here’s what you’ll learn inside this FREE course each day. 

  • Day 1 | The Ideal Learning Technique: On the first day, you will learn about the best approach from the two main approaches ( Top-Down, and Bottom-Up ) to learn and master computer vision easily and efficiently.
  • Day 2 | Building the Required Background: On the second day, I’ll show you exactly how you can build the Mathematical & Computer Science background for computer vision and share some short high-quality, and really easy to go through free courses to help you learn the prerequisites.
  • Day 3 | Learning High-Level Artificial Intelligence: From day 3, the exciting stuff will start as you will dive into learning AI. I will share some high-level resources about a broad overview of the field and will tell you why it is important before getting involved in specifics.
  • Day 4 | Learning Image Processing and Classical Computer Vision: On the fourth day, I will share some personally evaluated high-quality resources on Image Processing, and Classical Computer Vision and will explain why these techniques should be learned first before jumping into Deep learning.
  • Day 5 | Learning the Theory behind AI/ML & Start Building Models: On the fifth day, finally, It will be time to go deeper and learn the theoretical foundations behind AI/ML algorithms and also start training algorithms using a high-level library to kill the dryness. I will share the right resources to help you get through.
  • Day 6 | Learning Deep Learning Theory &  Start Building DL Models: On the sixth day, we will step up the game and get into deep learning with a solid plan on how and what to learn.
  • Day 7 | Learning Model Deployment & Start Building Computer Vision Projects: On the seventh day, we will move towards productionizing models and I’ll also discuss how to start working on your own computer vision projects to build up your portfolio.
  • Day 8 | Learning to Read Computer Vision Papers: On the eighth day, I will share the best tips, suggestions, and practices to get comfortable with reading the papers and will also discuss its significance over other resources.
  • Day 9 | Picking a Path; Research, Development, or Domain Expertise: On the ninth day, I’ll show you the final step you need in your journey in order to implement your knowledge, move forward in career and start making money.
  • Day 10 | FINAL Lesson, the Journey ENDS with Bonus Tips: On the last day, I will  guide you further and provide you with some bonus tips to keep in mind. I will also share a few final learning resources.

Here’s a video summarizing the entire Course.

5 Easy & Effective Face Detection Algorithms in Python

5 Easy & Effective Face Detection Algorithms in Python

Watch Video Here

In this post, you’ll learn in-depth about the five of the most easiest and effective face detection options available in python, along with the pros and cons of each one of them. You will become capable of obtaining the required balance in accuracy, speed, and efficiency in any given scenario.

The face detection methods we will be covering are:

Face Detection is one of the most common and simplest vision techniques out there, as the name implies, it detects (i.e., locates) the faces in the images and is the first and essential step for almost every face application like Face Recognition, Facial Landmarks Detection, Face Gesture Recognition, and Augmented Reality (AR) Filters, etc.

Other than these, one of its most common applications, that you must have used, is your mobile camera which detects your face and adjusts the camera focus automatically in real-time.

Also, for what it’s worth Tony Stark’s EDITH (Even Dead I’m The Hero) glasses, inherited by Peter Parker in the Spider-Man Far From Home movie, also uses Face Detection as an initial step to perform its functionalities. Cool 😊 … right?  

Yeah I know .. I know, I needed to add a marvel reference into it, the whole post get’s cooler.

Face detection also serves as a ground for a lot of exciting face applications for e.g.  You can even appoint Mr. Beans as the President 😂 using Deepfake.

But for now, let’s just go back to Face Detection.

The idea behind face detection is to make the computer capable of identifying what human face exactly is and detecting the features that are associated with the faces in images/videos which might not always be easy because of changing facial expression, orientation, lighting conditions, and occlusions due to face masks, glasses, etc.

But with enough training data covering all the possible scenarios, you can create a very robust face detector.

And people throughout the years have done just that, they have designed various algorithms for facial detection and in this post, we’ll explore 5 such algorithms.

As this is the most common and widely used technique, there are a lot of face detectors out there.

But which Algorithm is the best?

If you’re looking for a single solution then it’s a hard answer as each of the algorithms that we’re going to cover has its own pros and cons, take a look at the demos at the end for some comparison, and make sure to read the summary for the final verdict.

This tutorial also has a video version, that you can go and watch, the video version covers more details, regarding how each approach works.

Alright, so without further ado, let’s dive in.

Import the Libraries

We will first import the required libraries.

Algorithm 1: OpenCV Haar Cascade Face Detection

This face detector was introduced in 2001 and remained the state-of-the-art face detection algorithm for many years. Other than just this face detector, OpenCV provides some other detectors (like eye, and smile, etc) too, which use the same haar cascade technique.

Load the OpenCV Haar Cascade Face Detector

To perform the face detection using this algorithm, first, we will have to load the pre-trained Haar cascade face detection model around 900 KBs from the disk, stored in a .xml file format, using the function CascadeClassifier().

Create a Haar Cascade Face Detection Function

Now we will create a function haarCascadeDetectFaces() that will perform haar cascade face detection using the function cv2.CascadeClassifier.detectMultiScale() on an image/frame and will visualize the resultant image along with the original image (when working with images) or return the resultant image along with the output of the model (when working with videos) depending upon the passed arguments.

Function Syntax:

results = cv2.CascadeClassifier.detectMultiScale(image, scaleFactor, minNeighbors, minSize, maxSize)


  • image – It is the input grayscale image containing the faces.
  • scaleFactor (optional) – It is the image size that is reduced at each image scale. Its default value is 1.1 which means a decrease of 10%.
  • minNeighbors (optional) – It is the number of minimum neighbors each predicted face should have, to retain. Otherwise, the prediction is ignored. Its default value is 3.
  • minSize (optional) – It is the minimum possible face size, the faces smaller than that size are ignored.
  • maxSize (optional) – It is the maximum possible face size, the faces larger than that are ignored. If maxSize == minSize then only the faces of a particular size are detected.


  • results – It is an array of bounding boxes coordinates (i.e., x1, y1, bbox_width, bbox_height) where each bounding box encloses the detected face, the boxes may be partially outside the original image.

Note: When the value of the minNeighbors parameter is decreased, false positives are increased, and when the value of scaleFactor is decreased the large faces in the image become smaller and detectable by the algorithm at the cost of speed.

So the algorithm can detection very large and very small faces too by appropriately utilizing the scaleFactor argument.

Now we will utilize the function haarCascadeDetectFaces() created above to perform face detection on a few sample images and display the results.

The time taken by the algorithm to perform detection is pretty impressive, so yeah, it can work in real-time on a CPU.

A major drawback of this algorithm is that it does not work on non-frontal and occluded faces.

And also gives a lot of false positives But that can be controlled by increasing the value of the minNeighbors argument in the function cv2.CascadeClassifier.detectMultiScale().

Algorithm 2: Dlib HoG Face Detection

This face detector is based on HoG (Histogram of Oriented Gradients), and SVM (Support Vector Machine) and is significantly more accurate than the previous one. The technique used in this one is not invariant to changes in face angle, so it uses five different HOG filters that are for:

  1. Frontal face
  2. Right side turned face
  3. Left side turned face
  4. Frontal face but rotated right
  5. Frontal face but rotated left

So it can work on slightly non-frontal and rotated faces as well.

Load the Dlib HoG Face Detector

Now we will use the dlib.get_frontal_face_detector() function to load the pre-trained HoG face detector and we will not need to pass the path of the model file for this one as the model is included in the dlib library.

Create a HoG Face Detection Function

Now we will create a function hogDetectFaces() that will perform HoG face detection by inputting the image/frame into the loaded hog_face_detector and will visualize the resultant image along with the original image or return the resultant image along with the output of HoG face detector depending upon the passed arguments.

Function Syntax:

results = hog_face_detector(image, upsample)


  • image – It is the input image containing the faces in RGB format.
  • upsample (optional) – It is the number of times to upsample an image before performing face detection.


  • results – It is an array of rectangle objects containing the (x, y) coordinates of the corners of the bounding boxes enclosing the faces in the input image.

Note: The model is trained to detect a minimum face size of 80×80, so to detect small faces in the images, you will have to upsample the images that increase the resolution of the input images, thus increases the face size at the cost of computation speed of the detection process.

Now we will utilize the function hogDetectFaces() created above to perform HoG face detection on a few sample images and display the results.

So this too can work in real-time on a CPU. You can also resize the images before passing them to the model, as the smaller the images are, the faster the detection process will be. But this also increases the probability of faces smaller than 80×80 in the images.

As you can see, it works on slightly rotated faces but will fail on extremely rotated and non-frontal ones and the bounding box often excludes some parts of the face like the chin and forehead.

And also works on small occlusions but will fail on massive ones.

As mentioned above, it cannot detect faces smaller than 80x80. Now, if you want, you can increase the upsample argument value of the loaded hog_face_detector in the function hogDetectFaces() created above, to detect the face in the image above, but that will also tremendously increase the time taken by the face detection process.

Algorithm 3: OpenCV Deep Learning based Face Detection

This one is based on a deep learning approach and uses ResNet-10 Architecture to detect multiple faces in a single pass (Single Shot Detector SSD) of the image through the network (model). It has been included in OpenCV since August 2017, with the official release of version 3.3, still, it is not as popular as the OpenCV Haar Cascade Face Detector but surely is highly more accurate.

Load the OpenCV Deep Learning based Face Detector

Now to load the face detector, OpenCV provides us with two options, one of them is in the Caffe framework’s format and takes around 5.10 MBs in memory and the other one is in the TensorFlow framework’s format and acquires only 2.7 MBs in memory.

To load the first one from the disk, we can use the cv2.dnn.readNetFromCaffe() function and to load the other one we will have to use the cv2.dnn.readNetFromTensorflow() function with appropriate arguments.

Create an OpenCV Deep Learning based Face Detection Function

Now we will create a function cvDnnDetectFaces() that will perform Deep Learning-based face detection using OpenCV. First, we will pre-process the image/frame using the cv2.dnn.blobFromImage() function and then we will set the pre-processed image as an input to the network by using the function opencv_dnn_model.setInput().

And after that, pass the input image into the network by using the opencv_dnn_model.forward() function to get an array containing the bounding boxes coordinates normalized to ([0.0, 1.0]) and the detection confidence of each faces in the image.

After performing the detection, the function will also visualize the resultant image along with the original image or return the resultant image along with the output of the dnn face detector depending upon the passed arguments.

Note: Higher the face detection confidence score is, the more certain the model is about the detection.

Now we will utilize the function cvDnnDetectFaces() created above to perform OpenCV deep learning-based face detection on a few sample images and display the results.

So it is highly more accurate than both of the above and works great even under massive occlusions and on non-frontal faces. And the reason for its significantly higher speed is that it can detect faces across various scales, allowing us to resize the images to a smaller size which decreases computations.

Also, the bounding box encloses the whole face, unlike the HoG Face Detector, making it easier to crop regions of interest (i.e., faces) from the images.CodeText

Also, the bounding box encloses the whole face, unlike the HoG Face Detector, making it easier to crop regions of interest (i.e., faces) from the images.CodeText

So even the faces with masks are detectable with this one.

Algorithm 4: Dlib Deep Learning based Face Detection

This detector is also based on a Deep learning (Convolution Neural Network) approach and uses Maximum-Margin Object Detection (MMOD) method to detect faces in images. This one is also trained for a minimum face size of 80×80 and provides the option of upsampling the images. This one is very slow on a CPU but can be used on an NVIDIA GPU and outperforms the other detectors in speed on the GPU.

Load the Dlib Deep Learning based Face Detector

Now first, we will use the dlib.cnn_face_detection_model_v1() function to load the pre-trained maximum-margin cnn face detector around 700 KBs from the disk, stored in a .dat file format.

Create a Dlib Deep Learning based Face Detection Function

Now we will create a function dlibDnnDetectFaces() in which we will perform deep Learning-based face detection using dlib by inputting the image/frame and the number of times to upsample the image to the loaded cnn_face_detector as we had done for the HoG face detection.

The only difference is that we are loading a different model, and it will return a list of objects, where each object will be a wrapper around a rectangle object (containing the bounding box coordinates) and a detection confidence score. As our every other function, this one will also visualize the results or return them depending upon the passed arguments.

Now we will utilize the function dlibDnnDetectFaces() created above to perform dlib deep learning-based face detection on a few sample images and display the results.

Interesting! this one is also far more accurate and robust than the first two and is also capable of detecting faces under occlusion. But as you can see, the time taken by the detection process is very high, so this detector cannot work in real-time on a CPU.

Also, the varying face orientations and lighting do not stop it from detecting faces accurately.

Similar to the HoG face detector, the bounding box for this one is also small and does not enclose the whole face.

Algorithm 5: Mediapipe Deep Learning based Face Detection

The last one is also based on Deep learning approach and uses BlazeFace that is a very lightweight and highly accurate face detector inspired and modified from Single Shot MultiBox Detector (SSD) & MobileNetv2. The detector provided by Mediapipe is capable of running at a speed of 200-1000+ FPS on flagship devices.

Load the Mediapipe Face Detector

To load the model, we first have to initialize the face detection class using the mp.solutions.face_detection syntax and then we will have to call the function mp.solutions.face_detection.FaceDetection() with the arguments explained below:

  • model_selection – It is an integer index ( i.e., 0 or 1 ). When set to 0, a short-range model is selected that works best for faces within 2 meters from the camera, and when set to 1, a full-range model is selected that works best for faces within 5 meters. Its default value is 0.
  • min_detection_confidence – It is the minimum detection confidence between ([0.0, 1.0]) required to consider the face-detection model’s prediction successful. Its default value is 0.5 ( i.e., 50% ) which means that all the detections with prediction confidence less than 0.5 are ignored by default.

We will also have to initialize the mp.solutions.drawing_utils class which is used to visualize the detection results on the images/frames.

Create a Mediapipe Deep Learning based Face Detection Function

Now we will create a function mpDnnDetectFaces() in which we will use the mediapipe face detector to perform the detection on an image/frame by passing it into the loaded model by using the function mp_face_detector.process() and get a list of a bounding box and six key points for each face in the image. The six key points are on the:

  1. Right Eye
  2. Left Eye
  3. Nose Tip
  4. Mouth Center
  5. Right Ear Tragion
  6. Left Ear Tragion

The bounding boxes are composed of xmin and width (both normalized to [0.0, 1.0] by the image width) and ymin and height (both normalized to [0.0, 1.0] by the image height). Each key point is composed of x and y, which are normalized to [0.0, 1.0] by the image width and height respectively. The function will work on images and videos as well as this one will also display or return the results depending upon passed arguments.

Now we will utilize the function mpDnnDetectFaces() created above to perform face detection using Mediapipe’s detector on a few sample images and display the results.

You can get an idea of its super-realtime performance from the time taken by the detection process. After all, this is what differentiates this detector from all the others.

It can detect the non-frontal and occluded faces but fails to accurately detect the key points in such scenerios.

The size of the bounding box returned by this detector is also quite appropriate.

By using the short-range model, one can easily ignore the faces in the background, which is normally required in most of the applications out there, like face gesture recognition.

Face Detection on Real-Time Webcam Feed

We have compared the face detection algorithms on the images and discussed the pros and cons of each of them, but now the real test begins, as we will test the algorithms on a real-time webcam feed. First, we will select the algorithm we want to use as one of them will be used at a time. We have designed the code below to switch between different face detection algorithms in real-time, by pressing the key s.

We will utilize the functions created above to perform face detection on the real-time webcam feed using the selected algorithm and will also calculate and display the number of frames being updated in one second to get an idea of whether the algorithms can work in real-time on a CPU or not.

Output Video

As expected! all of them can work in real-time on a CPU except for the Dlib Deep Learning-based Face Detector.

Join My Upcoming Computer Vision For Building Cutting Edge Applications Course

A Course that goes beyond basic applications and teaches you how to create some next-level apps that utilize physics, deep learning (LSTM + CNN) + classical image processing, hand and body gestures to do a variety of very interesting things.


In this tutorial, you have learned about the five most popular and effective face detectors along with the best tips, and suggestions. You have become capable of acquiring the required balance in accuracy, speed, and efficiency in any given scenario. Now to summarize; 

If you have a low-end device or an embedded device like the Raspberry Pi and are expecting faces under substantial occlusion and with various sizes, orientations, and angles then I will recommend you to go for the Mediapipe Face Detector, as it is the fastest one and also pretty accurate. In fact, this one has the best trade-off between speed and accuracy and also gives a few facial landmarks (key points).

Otherwise, if you have some environmental restrictions and cannot use the Mediapipe face detector, then the next best option will be OpenCV DNN Face Detector as this one is also pretty accurate but has higher latency.

For applications in which the face size can be controlled (> 80×80), and you want to skip the people (small faces) that are far away from the camera, the Dlib  HoG Face Detector can be used but surely is not the best option and for flag-ship devices with NVIDIA GPU in the same scenario, Dlib DNN Face Detector can be a good alternative to the HoG Face Detector, but try to use it on a CPU.

And If you are only working with frontal faces and want to skip all the non-frontal and rotated faces, then the Haar Cascade detector can be an option but remember you will have to manually tune the parameters to get rid of false positives.

So generally, you should just go with the Mediapipe Face Detector for super real-time speed and high accuracy.

That is all for this lesson, if you enjoyed this lesson, let me know in the comments and you can also support me and the Bleed AI team on Patreon here.

If you need 1 on 1 Coaching in AI/computer vision regarding your project or your career then you reach out to me personally here.

Further Resources

Bleed Face DetectorIt is a python package that allows using 4 different face detectors (OpenCV Haar Cascade, Dlib HoG, OpenCV Deep Learning-based, and Dlib Deep Learning-based) by just changing a single line of code.

Real-Time Fingers Counter & Hand Gesture Recognizer with Mediapipe and Python

Real-Time Fingers Counter & Hand Gesture Recognizer with Mediapipe and Python

Watch Video Here

In the last Week’s tutorial, we had learned to perform real-time hands 3D landmarks detection, hands classification (i.e., either left or right), extraction of bounding box coordinates from the landmarks, and the utilization of the depth (z-coordinates) of the hands to create a customized landmarks annotation.

Yup, that was a whole lot, and we’re not coming slow in this tutorial too, 😉

In this week’s tutorial, we’ll learn to utilize the landmarks to count and fingers (that are up) in images and videos and create a real-time hand counter. We will also create a hand finger recognition and visualization application that will display the exact fingers that are up.  This will work for both hands.

Then based on the status (i.e., up/down) of the fingers, we will build a Hand Gesture Recognizer that will be capable of identifying multiple gestures. 

Below are the results on a few sample images but this will also work on camera feed in real-time and on recorded videos as well. 

You will not need any expensive GPU, your CPU will suffice as the whole code is highly optimized.

And that is not all, in the end, on top of all this, we will build a Selfie-Capturing System that will be controlled using hand gestures to enhance the user experience. So we will be able to capture images and also turn an image filter on/off without even touching our device. The image below shows a visual of what this system will be capable of.

Well 🤔, maybe not exactly that but somewhat similar.

Excited yet? I know I am! Before diving into the implementation, let me tell you that as a child, I always was fascinated with the concept of automating the interaction between people and machines and that was one of the reasons I got into programming. 

To be more precise, I wanted to control my computer with my mind, yes I know how this sounds but I was just a kid back then. Controlling computers via mind with high fidelity is not feasible yet but hey Elon, is working on it.. So there’s still hope.

But for now, why don’t we utilize the options we have. I have published some other tutorials too on controlling different applications using hand body gestures.

So I can tell you that using hand gestures to interact with a system is a much better option than using some other part like the mouth since hands are capable of making multiple shapes and gestures without much effort.

Also during these crucial times of covid-19, it is very unsafe to touch the devices installed at public places like ATMs. So upgrading these to make them operable via gestures can tremendously reduce infection risk..

Tony Stark, the boy Genius can be seen in movies to control stuff with his hand gestures, so why let him have all the fun when we can join the party too. 

You can also use the techniques you’ll learn in this tutorial to control any other Human-Computer Interaction based application.

The tutorial is divided into small steps with every step explained in detail in the simplest manner possible. 


  1. Step 1: Perform Hands Landmarks Detection
  2. Step 2: Build the Fingers Counter
  3. Step 3: Visualize the Counted Fingers
  4. Step 4: Build the Hand Gesture Recognizer
  5. Step 5: Build a Selfie-Capturing System controlled by Hand Gestures

Now if you want you can go and watch the youtube tutorial before going further, for detailed insights and intuitions, although this blog post alone can suffice.

Alright, so without further ado, let’s get started.

Import the Libraries

First, we will import the required libraries.

Initialize the Hands Landmarks Detection Model

After that, we will need to initialize the mp.solutions.hands class and then set up the mp.solutions.hands.Hands() function with appropriate arguments and also initialize mp.solutions.drawing_utils class that is required to visualize the detected landmarks. We will be working with images and videos as well, so we will have to set up the mp.solutions.hands.Hands() function two times.

Once with the argument static_image_mode set to True to use with images and the second time static_image_mode set to False to use with videos. This speeds up the landmarks detection process, and the intuition behind this was explained in detail in the previous post.

Step 1: Perform Hands Landmarks Detection

In the step, we will create a function detectHandsLandmarks() that will take an image/frame as input and will perform the landmarks detection on the hands in the image/frame using the solution provided by Mediapipe and will get twenty-one 3D landmarks for each hand in the image. The function will display or return the results depending upon the passed arguments.

The function is quite similar to the one in the previous post, so if you had read the post, you can skip this step. I could have imported it from a separate .py file, but I didn’t, as I wanted to make this tutorial with the minimal number of prerequisites possible.

Now let’s test the function detectHandsLandmarks() created above to perform hands landmarks detection on a sample image and display the results.

Great! got the required landmarks so the function is working accurately.

Step 2: Build the Fingers Counter

Now in this step, we will create a function countFingers() that will take in the results of the landmarks detection returned by the function detectHandsLandmarks() and will utilize the landmarks to count the number of fingers up of each hand in the image/frame and will return the count and the status of each finger in the image as well.

How will it work?

To check the status of each finger (i.e., either it is up or not), we will compare the y-coordinates of the FINGER_TIP landmark and FINGER_PIP landmark of each finger. Whenever the finger will be up, the y-coordinate of the FINGER_TIP landmark will have a lower value than the FINGER_PIP landmark.

But for the thumbs, the scenario will be a little different as we will have to compare the x-coordinates of the THUMB_TIP landmark and THUMB_MCP landmark and the condition will vary depending upon whether the hand is left or right.

For the right hand, whenever the thumb will be open, the x-coordinate of the THUMB_TIP landmark will have a lower value than the THUMB_MCP landmark, and for the left hand, the x-coordinate of the THUMB_TIP landmark will have a greater value than the THUMB_MCP landmark.

Note: You have to face the palm of your hand towards the camera.

Now we will utilize the function countFingers() created above on a real-time webcam feed to count the number of fingers in the frame.

Output Video

Astonishing! the fingers are being counted very fast.

Step 3: Visualize the Counted Fingers

Now that we have built the finger counter, in this step, we will visualize the status (up or down) of each finger in the image/frame in a very appealing way. We will draw left and right handprints on the image and will change the color of the handprints in real-time depending upon the output (i.e., status (up or down) of each finger) from the function countFingers().

  • The hand print will be Red if that particular hand (i.e., either right or left) is not present in the image/frame.
  • The hand print will be Green if the hand is present in the image/frame.
  • The fingers of the hand print, that are up, will be highlighted by with the Orange color and the fingers that are down, will remain Green.

To accomplish this, we will create a function annotate() that will take in the output of the function countFingers() and will utilize it to simply overlay the required hands and fingers prints on the image/frame in the required color.

We have the .png images of the hands and fingers prints in the required colors (red, green, and orange) with transparent backgrounds, so we will only need to select the appropriate images depending upon the hands and fingers statuses and overlay them on the image/frame. You will also get these images with the code when you will download them.

Now we will use the function annotate() created above on a webcam feed in real-time to visualize the results of the fingers counter.

Output Video

Woah! that was Cool, the results are delightful.

Step 4: Build the Hand Gesture Recognizer

We will create a function recognizeGestures() in this step, that will use the status (i.e., up or down) of the fingers outputted by the function countFingers() to determine the gesture of the hands in the image. The function will be capable of identifying the following hand gestures:

  • V Hand Gesture ✌️ (i.e., only the index and middle finger up)
  • SPIDERMAN Hand Gesture 🤟 (i.e., the thumb, index, and pinky finger up)
  • HIGH-FIVE Hand Gesture ✋ (i.e., all the five fingers up)

For the sake of simplicity, we are only limiting this to three hand gestures. But if you want, you can easily extend this function to make it capable of identifying more gestures just by adding more conditional statements.