fbpx

Real-Time 3D Hands Landmarks Detection & Hands Classification with Mediapipe and Python

By Taha Anwar and Rizwan Naeem

On August 16, 2021

Watch Video Here

In this tutorial, we’ll learn how to do real-time 3D hands landmarks detection using the Mediapipe library in python. After that, we’ll learn to perform hands type classification (i.e. is it a left or right hand) and then draw the bounding boxes around the hands by retrieving the required coordinates from the detected landmarks. 

Below are the results on a few sample images, and this will work on real-time camera feed or recorded videos as well.

And last but not the least, we will also create a customized landmarks annotation of the hands utilizing the depth (distance from the camera) of the hands, as Mediapipe’s solution provides three-dimensional landmarks. 

The annotation provided by Mediapipe allows users to annotate the images and videos with a single line of code but it does not utilize the depth and draws fixed-sized lines and circles on the hands. 

But in our customized annotation, the thickness of the lines (connections) and circles (landmarks) for each hand will vary in real-time by using the depth of each hand, with the decrease in the distance from the camera (depth), the size of the annotation increases.

The motivation behind this was that the size of the hand in the image/frame increases when the hand is moved closer to the camera, so using a fixed size annotation for a hand that varies in size was hard to digest for me. You can see the comparison of Mediapipe’s annotation and our customized annotation below.

The code for all this is very easy and is explained in the simplest manner possible. 

Now before diving further, you can go and watch the youtube tutorial for the detailed explanation, although this blog post alone can also serve the purpose.

This post can be split into 4 parts:

Part 1 (a): Introduction to Hands Recognition & Landmarks Detection

Part 1 (b): Mediapipe’s Hands Landmarks Detection Implementation

Part 2: Using Hands Landmarks Detection on images and videos

Part 3: Hands Classification (i.e., Left or Right)

Part 4 (a): Draw Bounding Boxes around the Hands

Part 4 (b): Draw Customized Landmarks Annotation

Part 1 (a): Introduction to Hands Recognition & Landmarks Detection

Hands Recognition & Landmarks Detection also belongs to the keypoint estimation category of computer vision that I had talked about in detail in a post I wrote a few weeks ago Real-Time 3D Pose Detection & Pose Classification with Mediapipe and Python, make sure to read that. 

In this one, we will learn to localize twenty-one key landmarks on hand(s) e.g. wrist, thumb and fingertips, etc. See the image below:

It is an important and popular pre-processing task in many vision applications as it allows to localize and recognize the shape and motion of hands that opens up the door to the future by making a ton of applications feasible like:

  • Augmented Reality Applications that overlay Digital Content and Information over Reality like a Virtual Painter. 
  • Sign Language Recognition. 
  • Hand Gesture Controlled Drones and Robots or any other Physical Appliances.

Using landmark detection is also a great way to interact with any Human-Computer Interaction (HCI) based system as the human hand is capable of making multiple gestures without much effort.

Some other types of keypoint detection problems are facial landmark detection, pose detection, etc.

I have already made a tutorial on pose detection and will explain facial landmark detection in an upcoming tutorial.

Part 1 (b): Mediapipe’s Hands Landmarks Detection Implementation

Here’s a brief introduction to Mediapipe;

 “Mediapipe is a cross-platform/open-source tool that allows you to run a variety of machine learning models in real-time. It’s designed primarily for facilitating the use of ML in streaming media & It was built by Google”

All the models in Mediapipe have state-of-the-art performance and are highly optimized as well and are used in a lot of well-known products out there.

It provides a robust solution capable of predicting twenty-one 3D landmarks on a hand in real-time with high accuracy and speed as well, even on low-end devices i.e. phones, laptops, etc., which makes it stand out from the other solutions out there.

Now you may be thinking what makes it so fast?

Actually, they have combined a computationally costly object detector with an object tracker that requires a lot less processing. So for the videos, a tracker is used instead of invoking the object detector at every frame which makes this solution so fast and brilliant. 

The detector is only invoked as needed, that is in the very first frame or when the tracker loses track of any of the hands. The detector localizes the hand in the image/frame and outputs the bounding box coordinates around the hand.

Then the region of interest (ROI) is cropped from the image/frame using the bounding box coordinates and after that, the cropped image/frame is used by the hand landmark model to predict the landmarks within the ROI.

The palm detector is used, as detecting hands is a comparatively more complex task than the palm or fist (i.e., rigid objects).  Palms require square bounding boxes which reduces the number of anchors (predefined bounding boxes of a certain height and width). Also as palms are smaller objects, which reduces the possibility of self-occlusion (hidden by themselves), like handshakes. Comparatively, hands lack easily distinguishable patterns and are often occluded.

The image below shows the twenty-one hand landmarks, that this solution detects, along with their indexes.

They manually annotated around 30K real-world images with 3D coordinates by using images depth maps and also rendered a high-quality synthetic hand model over various backgrounds and then mapped the model with different backgrounds to the corresponding 3D coordinates.

So they kind of automated the whole annotation process. The image below contains the aligned hands cropped images and the rendered synthetic hand images with ground truth annotation.

Alright now we have learned the required basic theory and implementation details, let’s dive into the code.

Download Code Files From Here

Download Code

Import the Libraries

We will start by importing the required libraries.

Part 2: Using Hands Landmarks Detection on images and videos

Initialize the Hands Landmarks Detection Model

To use the Mediapipe’s hands solution, we first have to initialize the hands class using the mp.solutions.hands syntax and then we will have to call the function mp.solutions.hands.Hands() with the arguments explained below:

  • static_image_mode – It is a boolean value that is if set to False, the solution treats the input images as a video stream. It will try to detect hands in the first input images, and upon a successful detection further localizes the hand landmarks. In subsequent images, once all max_num_hands hands are detected and the corresponding hand landmarks are localized, it simply tracks those landmarks without invoking another detection until it loses track of any of the hands. This reduces latency and is ideal for processing video frames. If set to True, hand detection runs on every input image, ideal for processing a batch of static, possibly unrelated, images. Its default value is False.
  • max_num_hands – It is the maximum number of hands to detect. Its default value is 2.
  • min_detection_confidence – It is the minimum detection confidence ([0.0, 1.0]) required to consider the palm-detection model’s prediction correct. Its default value is 0.5 which means that all the detections with prediction confidence less than 50% are ignored by default.
  • min_tracking_confidence – It is the minimum tracking confidence ([0.0, 1.0]) required to consider the landmark-tracking model’s tracked hands landmarks valid. If the confidence is less than this argument value then the detector is invoked again in the next frame/image, so increasing its value increases the robustness, but also increases the latency. Its default value is 0.5.

Then we will also need to initialize the mp.solutions.drawing_utils class that is very useful to visualize the landmarks on the images/frames.

Read an Image

Now we will use the function cv2.imread() to read a sample image and then display it using the matplotlib library.

Perform Hands Landmarks Detection

Now we will pass the image to the hand’s landmarks detection machine learning pipeline by using the function mp.solutions.hands.Hands().process(). But first, we will have to convert the image from BGR to RGB format using the function cv2.cvtColor() as OpenCV reads images in BGR format and the ml pipeline expects the input images to be in RGB color format.

The machine learning pipeline outputs a list of twenty-one landmarks of the prominent hands in the image. Each landmark has:

  • x – It is the landmark x-coordinate normalized to [0.0, 1.0] by the image width.
  • y: It is the landmark y-coordinate normalized to [0.0, 1.0] by the image height.
  • z: It is the landmark z-coordinate normalized to roughly the same scale as x. It represents the landmark depth with the wrist being the origin, so the smaller the value the closer the landmark is to the camera.

To get more intuition, we will display the first two landmarks of each hand, the ml pipeline outputs an object that has an attribute multi_hand_landmarks that contains the found landmarks coordinates of each hand as an element of a list.

Note: The z-coordinate is just the relative distance of the landmark from the wrist, and this distance increases and decreases depending upon the distance from the camera so that is why it represents the depth of each landmark point.

Output

HAND NUMBER: 1

WRIST:
x: 0.37476441264152527
y: 0.6743423938751221
z: -6.916032725712284e-05

THUMB_CMC:
x: 0.4167462885379791
y: 0.6456755995750427
z: 0.009653960354626179

HAND NUMBER: 2

WRIST:
x: 0.5507286787033081
y: 0.8074262142181396
z: -2.4604334612376988e-05

THUMB_CMC:
x: 0.510724663734436
y: 0.8015302419662476
z: 0.002759072231128812

As you can see that the landmarks are normalized to specific scales, so now we will convert them back to their original scale by using the width and height of the sample image and display them.

Output

HAND NUMBER: 1

WRIST:
x: 479.69844818115234
y: 575.2140619754791
z: -0.08852521888911724

THUMB_CMC:
x: 533.4352493286133
y: 550.7612864375114
z: 12.357069253921509

HAND NUMBER: 2

WRIST:
x: 704.9327087402344
y: 688.7345607280731
z: -0.031493548303842545

THUMB_CMC:
x: 653.7275695800781
y: 683.7052963972092
z: 3.531612455844879

Now we will draw the detected landmarks on a copy of the sample image using the function mp.solutions.drawing_utils.draw_landmarks() from the class mp.solutions.drawing_utils, we had initialized earlier and will display the resultant image.

Part 3: Hands Classification (i.e., Left or Right)

Create a Hands Landmarks Detection Function

Now we will put all this together to create a function that will perform hands landmarks detection on an image and will visualize the resultant image along with the original image or return the resultant image along with the output of the model depending upon the passed arguments.

Now we will utilize the function created above to perform hands landmarks detection on a few sample images and display the results.

Hands Landmarks Detection on Real-Time Webcam Feed

The results on the images were excellent, but now the real test begins, we will try the function on a real-time webcam feed. We will also calculate and display the number of frames being updated in one second to get an idea of whether this solution can work in real-time on a CPU or not. As that is the only thing that differentiates it from the other solutions out there.

Output

Woah! that was impressive not only it was fast but the results were quite accurate too.

Create a Hand Type Classification Function

Now we will create a function that will perform hands type classification (i.e. is it a left or right hand). The output (object) by the hand’s landmarks detector model contains another attribute multi_handedness that contains a score (probability of the predicted label being correct) and label (either "Left" or "Right") for each detected hand.

While determining the label, it is assumed that the input image is mirrored (i.e., flipped horizontally). So we had already performed classification during the hand’s landmarks detection, and now we only need to access the information stored in the attribute multi_handedness.

Now we will utilize the function created above to perform hand type classification on a few sample images and display the results.

Cool! it worked perfectly on each of the sample images.

Part 4 (a): Draw Bounding Boxes around the Hands

Create a Function to Draw Bounding Boxes

Now we will create a function that will draw bounding boxes around the hands and write their classified types near them. We will first convert the normalized landmarks back to their original scale by using the width and height of the image. We will then get the bounding box coordinates ((x1,y1), (x2, y2)) for each hand.

  • Top Left Coordinate:
    • x1 – the smallest x-coordinate in the list of the found landmarks of the hand.
    • y1 – the smallest y-coordinate in the list of the found landmarks of the hand.
  • Bottom Right Coordinate:
    • x2 – the largest x-coordinate in the list of the found landmarks of the hand.
    • y2 – the largest y-coordinate in the list of the found landmarks of the hand.

Then we will draw the bounding boxes around the hands using the found coordinates and the specified padding and write the classified types of each hand near them using the bounding box coordinates. And after that, we will either display the resultant image or return it depending upon the passed arguments.

Hands Bounding boxes & Hands Types Detection on Real-Time Webcam Feed

Now we will utilize the function created above to perform hand type classification and draw bounding boxes around the hands on a real-time webcam feed.

Output

Great! the classification, along with localization, works pretty accurately on a real-time webcam feed too.

Part 4 (b): Draw Bounding Boxes around the Hands

Create a Function to Draw Customized Landmarks Annotation

Now we will create a function that will draw customized landmarks of the hands. What we are doing differently, is that we are utilizing the depth (z-coordinate) values to increase and decrease the size of the lines and circles whereas Mediapipe’s annotation uses the fixed sizes. As we have learned that z-coordinate represents the landmark depth, so the smaller the value the closer the landmark is to the camera.

We are calculating the average depth of every landmark of a hand and with the decrease in the average depth of a hand, we are increasing the thickness of the annotation circles and the lines of that hand which means the closer the hand is to the camera bigger the annotation will be, to adjust the annotation size with the size of the hand.

Mediapipe’s Annotation vs Our Customized Annotation on Real-Time Webcam Feed

Now we will utilize the function created above to draw the customized annotation on a real-time webcam feed and stack it with the results of Mediapipe’s annotation to visualize the difference.

Output

As expected, the results were remarkable! The thickness of the annotation circles and the lines of each hand increased with the decrease in the distance, so the hack that Mediapipe uses to calculate the depth works pretty well and is also computationally very reasonable.

Join My Upcoming Computer Vision For Building Cutting Edge Applications Course

A Course that goes beyond basic applications and teaches you how to create some next-level apps that utilize physics, deep learning (LSTM + CNN) + classical image processing, hand and body gestures to do a variety of very interesting things.

Summary:

In this tutorial, we have learned about a very popular and useful computer vision problem called hand landmarks detection. First, we covered what exactly this is, along with its applications, and then we moved to the implementation details that Mediapipe has used to provide the solution. 

Also, we learned how it used a detection/tracker pipeline to provide the speed for which it stands out. After that, we performed 3D hands landmarks detection using Mediapipe’s solution on images and a real-time webcam feed. 

Then we learned to classify hands as left or right and draw bounding boxes around them and after that, we learned to draw customized landmarks annotation utilizing the z-coordinate (depth) values of the hands.

Now a drawback of using this hand landmarks detection system is that you have to provide the maximum number of hands possible in the image/frame beforehand and the computationally expensive detector is invoked on every frame until the number of hands detection becomes equal to the provided maximum number of hands.

Another limitation is that the z-coordinate is just the relative distance of the landmark from the wrist, and that distance can also vary depending upon the posture of the hand i.e., whether the hand is closed or wide open. So it does not truly represent the depth, but still, it’s a great hack to calculate the depth from 2D images without using a depth camera.

That will be all for today, if you have enjoyed it, tell me in the comments and you can also support me and the Bleed AI team on Patreon here.
If you need 1 on 1 Coaching in AI/computer vision regarding your project or your career then you can reach out to me personally here.

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *