fbpx

Facial Landmark Detection with Mediapipe & Creating Animated Snapchat filters

By Taha Anwar and Rizwan Naeem

On October 29, 2021

Watch Video Here

In this tutorial, we’ll learn to perform real-time multi-face detection followed by 3D face landmarks detection using the Mediapipe library in python on 2D images/videos, without using any dedicated depth sensor. After that, we will learn to build a facial expression recognizer that tells you if the person’s eyes or mouth are open or closed

Below you can see the facial expression recognizer in action, on a few sample images:


And then, in the end, we see how we can combine what we’ve learned to create animated Snapchat-like 2d filters and overlay them over the faces in images and videos. The filters will trigger in real-time for videos based on the facial expressions of the person. Below you can see results on a sample video.

Everything that we will build will work on the images, camera feed in real-time, and recorded videos as well, and the code is very neatly structured and is explained in the simplest manner possible.

This tutorial also has a video version that you can go and watch for a detailed explanation, although this blog post alone can also suffice.

This post can be split into 4 parts:

Part 1 (a): Introduction to Face Landmarks Detection

Part 1 (b): Mediapipe’s Face Landmarks Detection Implementation

Part 2: Face Landmarks Detection on images and videos

Part 3: Face Expression Recognition

Part 4: Snapchat Filter Controlled by Facial Expressions

Part 1 (a): Introduction to Face Landmarks Detection

Facial landmark detection/estimation is the process of detecting and tracking face key landmarks (that represent important regions of the face e.g, the center of the eye, and the tip of the nose, etc) in images and videos. It allows you to localize the face features and identify the shape and orientation of the face.

It also fits into the key point estimation category that I had explained in detail a few weeks ago in Real-Time 3D Pose Detection & Pose Classification with Mediapipe and Python post, make sure to check that one out too. 

In this tutorial, we will learn to detect four hundred sixty-eight facial landmarks. Below are the results of the landmarks detector we will use.

It is a must-learn task for every vision practitioner as it is used as a pre-processing task in many vision applications like

Some other types of keypoint estimation tasks are hand landmark detection, pose detection, etc.

I have already made tutorials (Hands Landmarks Detection, Pose Detection) on both of them.

Part 1 (b): Mediapipe’s Face Landmarks Detection Implementation

If Here’s a brief introduction to Mediapipe;

 “Mediapipe is a cross-platform/open-source tool that allows you to run a variety of machine learning models in real-time. It’s designed primarily for facilitating the use of ML in streaming media & It was built by Google”

All the solutions provided by Mediapipe are state-of-the-art in terms of speed and accuracy and are used in a lot of well-known applications.

The facial landmarks detection solution provided by Mediapipe is capable of detecting 3D 468 facial landmarks from a 2D image/video and is pretty fast and highly accurate as well and even works fine for occluded faces in varying lighting conditions and with faces of various orientations, and sizes in real-time, even on low-end devices like mobile phones, and Raspberry Pi, etc.

The landmarks detector’s remarkable speed distinguishes it from the other solutions out there anThe landmarks detector’s remarkable speed distinguishes it from the other solutions out there and the reason which makes this solution so fast is that they are using a 2 step detection approach where they have combined a face detector with a comparatively less computationally expensive tracker

So that for the videos, the tracker can be used instead of invoking the face detector at every frame. Let’s dive further into more details

The machine learning pipeline of the Mediapipe’s solution contains two different models that work together:

  1. A face detector that operates on the full image and locates the faces in the image.
  2. A face landmarks detector that operates only on those face locations and predicts the 3D facial landmarks. 

So the landmarks detector gets an accurately cropped face ROI which makes it capable of precisely working on scaled, rotated, and translated faces without needing data augmentation techniques.

In addition, the faces can also be located based on the face landmarks identified in the previous frame, so the face detector is only invoked as needed, that is in the very first frame or when the tracker loses track of any of the faces.  

They have utilized transfer learning and used both synthetic rendered and annotated real-world data to get a model capable of predicting 3D landmark coordinates. Another approach could be to train a model to predict a 2D heatmap for each landmark but will increase the computational cost as there are so many points.

Alright now we have gone through the required basic theory and implementation details of the solution provided by Mediapipe, so without further ado,  let’s get started with the code.


Download Code:

Part 2: Face Landmarks Detection on images and videos

Import the Libraries

Let’s start by importing the required libraries.

As mentioned Mediapipe’s face landmarks detection solution internally uses a face detector to get the required Regions of Interest (faces) from the image. So before going to the facial landmarks detection, let’s briefly discuss that face detector first, as Mediapipe also allows to separately use it.

Face Detection

The mediapipe’s face detection solution is based on BlazeFace face detector that uses a very lightweight and highly accurate feature extraction network, that is inspired and modified from MobileNetV1/V2 and used a detection method similar to Single Shot MultiBox Detector (SSD). It is capable of running at a speed of 200-1000+ FPS on flagship devices. For more info, you can check the resources here.

Initialize the Mediapipe Face Detection Model

To use the Mediapipe’s Face Detection solution, we will first have to initialize the face detection class using the syntax mp.solutions.face_detection, and then we will have to call the function mp.solutions.face_detection.FaceDetection() with the arguments explained below:

  • model_selection – It is an integer index ( i.e., 0 or 1 ). When set to 0, a short-range model is selected that works best for faces within 2 meters from the camera, and when set to 1, a full-range model is selected that works best for faces within 5 meters. Its default value is 0.
  • min_detection_confidence – It is the minimum detection confidence between ([0.0, 1.0]) required to consider the face-detection model’s prediction successful. Its default value is 0.5 ( i.e., 50% ) which means that all the detections with prediction confidence less than 0.5 are ignored by default.

We will also have to initialize the drawing class using the syntax mp.solutions.drawing_utils which is used to visualize the detection results on the images/frames.

Read an Image

Now we will use the function cv2.imread() to read a sample image and then display the image using the matplotlib library, after converting it into RGB from BGR format.

Perform Face Detection

Now to perform the detection on the sample image, we will have to pass the image (in RGB format) into the loaded model by using the function mp.solutions.face_detection.FaceDetection().process() and we will get an object that will have an attribute detections that contains a list of a bounding box and six key points for each face in the image. The six key points are on the:

  1. Right Eye
  2. Left Eye
  3. Nose Tip
  4. Mouth Center
  5. Right Ear Tragion
  6. Left Ear Tragion

After performing the detection, we will display the bounding box coordinates and only the first two key points of each detected face in the image, so that you get more intuition about the format of the output.

FACE NUMBER: 1

—————————–

FACE CONFIDENCE: 0.98

FACE BOUNDING BOX:

xmin: 0.39702364802360535

ymin: 0.2762746810913086

width: 0.16100731492042542

height: 0.24132275581359863

RIGHT_EYE:

x: 0.4368540048599243

y: 0.3198586106300354

LEFT_EYE:

x: 0.5112437605857849

y: 0.3565130829811096

Note: The bounding boxes are composed of xmin and width (both normalized to [0.0, 1.0] by the image width) and ymin and height (both normalized to [0.0, 1.0] by the image height). Each keypoint is composed of x and y, which are normalized to [0.0, 1.0] by the image width and height respectively.

Now we will draw the detected bounding box(es) and the key points on a copy of the sample image using the function mp.solutions.drawing_utils.draw_detection() from the class mp.solutions.drawing_utils, we had initialized earlier and will display the resultant image using the matplotlib library.

Note: Although, the detector quite accurately detects the faces, but fails to precisely detect facial key points (landmarks) in some scenarios (e.g. for non-frontal, rotated, or occluded faces) so that is why we will need the Mediapipe’s face landmarks detection solution for creating the Snapchat filter that is our main goal.

Face Landmarks Detection

Now, let’s move to the facial landmarks detection, we will start by initializing the face landmarks detection model.

Initialize the Mediapipe Face Landmarks Detection Model

To initialize the Mediapipe’s face landmarks detection model, we will have to initialize the face mesh class using the syntax mp.solutions.face_mesh and then we will have to call the function mp.solutions.face_mesh.FaceMesh() with the arguments explained below:

  • static_image_mode – It is a boolean value that is if set to False, the solution treats the input images as a video stream. It will try to detect faces in the first input images, and upon a successful detection further localizes the face landmarks. In subsequent images, once all max_num_faces faces are detected and the corresponding face landmarks are localized, it simply tracks those landmarks without invoking another detection until it loses track of any of the faces. This reduces latency and is ideal for processing video frames. If set to True, face detection runs on every input image, ideal for processing a batch of static, possibly unrelated, images. Its default value is False.
  • max_num_faces – It is the maximum number of faces to detect. Its default value is 1.
  • min_detection_confidence – It is the minimum detection confidence ([0.0, 1.0]) required to consider the face-detection model’s prediction correct. Its default value is 0.5 which means that all the detections with prediction confidence less than 50% are ignored by default.
  • min_tracking_confidence – It is the minimum tracking confidence ([0.0, 1.0]) from the landmark-tracking model for the face landmarks to be considered tracked successfully, or otherwise face detection will be invoked automatically on the next input image, so increasing its value increases the robustness, but also increases the latency. It is ignored if static_image_mode is True, where face detection simply runs on every image. Its default value is 0.5.

After that, we will initialize the mp.solutions.drawing_styles class that will allow us to get different provided drawing styles of the landmarks on the images/frames.

Perform Face Landmarks Detection

Now to perform the landmarks detection, we will pass the image (in RGB format) to the face landmarks detection machine learning pipeline by using the function mp.solutions.face_mesh.FaceMesh().process() and get a list of four hundred sixty-eight facial landmarks for each detected face in the image. Each landmark will have:

  • x – It is the landmark x-coordinate normalized to [0.0, 1.0] by the image width.
  • y – It is the landmark y-coordinate normalized to [0.0, 1.0] by the image height.
  • z – It is the landmark z-coordinate normalized to roughly the same scale as x. It represents the landmark depth with the center of the head being the origin, and the smaller the value is, the closer the landmark is to the camera.

We will display only two landmarks of each eye to get an intuition about the format of output, the ml pipeline outputs an object that has an attribute multi_face_landmarks that contains the found landmarks coordinates of each face as an element of a list.

FACE NUMBER: 1

—————————–

LEFT EYE LANDMARKS:

x: 0.49975821375846863
y: 0.3340317904949188
z: -0.0035526191350072622

x: 0.505615234375
y: 0.33464953303337097
z: -0.005253124982118607

RIGHT EYE LANDMARKS:

x: 0.4383838176727295
y: 0.2998684346675873
z: -0.0014895268250256777

x: 0.430422842502594
y: 0.30033284425735474
z: 0.006082724779844284

Note: The z-coordinate is just the relative distance of the landmark from the center of the head, and this distance increases and decreases depending upon the distance from the camera so that is why it represents the depth of each landmark point.

Now we will draw the detected landmarks on a copy of the sample image using the function mp.solutions.drawing_utils.draw_landmarks() from the class mp.solutions.drawing_utils, we had initialized earlier and will display the resultant image. The function mp.solutions.drawing_utils.draw_landmarks() can take the following arguments.

  • image – It is the image in RGB format on which the landmarks are to be drawn.
  • landmark_list – It is the normalized landmark list that is to be drawn on the image.
  • connections – It is the list of landmark index tuples that specifies how landmarks to be connected in the drawing. The provided options are; mp_face_mesh.FACEMESH_FACE_OVAL, mp_face_mesh.FACEMESH_LEFT_EYE, mp_face_mesh.FACEMESH_LEFT_EYEBROW, mp_face_mesh.FACEMESH_LIPS, mp_face_mesh.FACEMESH_RIGHT_EYE, mp_face_mesh.FACEMESH_RIGHT_EYEBROW, mp_face_mesh.FACEMESH_TESSELATION, mp_face_mesh.FACEMESH_CONTOURS.
  • landmark_drawing_spec – It specifies the landmarks’ drawing settings such as color, line thickness, and circle radius. It can be set equal to the mp.solutions.drawing_utils.DrawingSpec(color, thickness, circle_radius)) object.
  • connection_drawing_spec – It specifies the connections’ drawing settings such as color and line thickness. It can be either a mp.solutions.drawing_utils.DrawingSpec object or a function from the class mp.solutions.drawing_styles, the currently provided options for face mesh are; get_default_face_mesh_contours_style() ,get_default_face_mesh_tesselation_style().

Create a Face Landmarks Detection Function

Now we will put all this together to create a function detectFacialLandmarks() that will perform face landmarks detection on an image and will visualize the resultant image along with the original image or return the resultant image along with the output of the model depending upon the passed arguments.

Now we will utilize the function detectFacialLandmarks() created above to perform face landmarks detection on a few sample images and display the results.

Face Landmarks Detection on Real-Time Webcam Feed

The results on the images were remarkable, but now we will try the function on a real-time webcam feed. We will also calculate and display the number of frames being updated in one second to get an idea of whether this solution can work in real-time on a CPU or not.

Output

Impressive! the solution is fast as well as accurate.

Face Expression Recognition

Now that we have the detected landmarks, we will use them to recognize the facial expressions of people in the images/videos using the classical techniques. Our recognizor will be capable of identifying the following facial expressions:

  • Eyes Opened or Closed 😳 (can be used to check drowsiness, wink or shock expression)
  • Mouth Opened or Closed 😱 (can be used to check yawning)

For the sake of simplicity, we are only limiting this to two expressions. But if you want, you can easily extend this application to make it capable of identifying more facial expressions just by adding more conditional statements or maybe merging these two conditions. Like for example, eyes and mouth both wide open can represent surprise expression.

Create a Function to Calculate Size of a Face Part

First, we will create a function getSize() that will utilize detected landmarks to calculate the size of a face part. All we will need is to figure out a way to isolate the landmarks of the face part and luckily that can easily be done using the frozenset objects (attributes of the mp.solutions.face_mesh class), which contain the required indexes.

  • mp_face_mesh.FACEMESH_FACE_OVAL contains indexes of face outline.
  • mp_face_mesh.FACEMESH_LIPS contains indexes of lips.
  • mp_face_mesh.FACEMESH_LEFT_EYE contains indexes of left eye.
  • mp_face_mesh.FACEMESH_RIGHT_EYE contains indexes of right eye.
  • mp_face_mesh.FACEMESH_LEFT_EYEBROW contains indexes of left eyebrow.
  • mp_face_mesh.FACEMESH_RIGHT_EYEBROW contains indexes of right eyebrow.

After retrieving the landmarks of the face part, we will simply pass it to the function cv2.boundingRect() to get the width and height of the face part. The function cv2.boundingRect(landmarks) returns the coordinates (x1, y1, width, height) of a bounding box enclosing the object (face part), given the landmarks but we will only need the height and width of the bounding box.

Now we will create a function isOpen() that will utilize the getSize() function we had created above to check whether a face part (e.g. mouth or an eye) of a person is opened or closed.

Hint: The height of an opened mouth or eye will be greater than the height of a closed mouth or eye.

Now we will utilize the function isOpen() created above to check the mouth and eyes status on a few sample images and display the results.

As expected, the results are fascinating!

Snapchat Filter Controlled by Facial Expressions

Now that we have the face expression recognizer, let’s start building a Snapchat filter on top of it, that will be triggered based on the facial expressions of the person in real-time.

Currently, our face expression recognizer can check whether the eyes and mouth are open 😯 or not 😌 so to get the most out of it, we can overlay scalable eyes 👀 images on top of the eyes of the user when his eyes are open and a video of fire 🔥 coming out of the mouth of the user when the mouth is open.

Create a Function to Overlay the Image Filters

Now we will create a function overlay() that will apply the filters on top of the eyes and mouth of a person in images/videos utilizing the facial landmarks to locate the face parts and will also resize the filter images according to the size of the face part on which the filter images will be overlayed.

Snapchat Filter on Real-Time Webcam Feed

Now we will utilize the function overlay() created above to apply filters based on the facial expressions, that we will recognize utilizing the function isOpen() on a real-time webcam feed.

Output

Cool! I am impressed by the results now if you want, you can extend the application and add more filters like glasses, nose, and ears, etc. and use some other facial expressions to trigger those filters.

Join My Upcoming Computer Vision For Building Cutting Edge Applications Course

A Course that goes beyond basic applications and teaches you how to create some next-level apps that utilize physics, deep learning (LSTM + CNN) + classical image processing, hand and body gestures to do a variety of very interesting things.

Summary:

Today, in this tutorial, we learned about a very common computer vision task called Face landmarks detection. First, we covered what exactly it is, along with its applications, and then we moved to the implementation details of the solution provided by Mediapipe and how it uses a 2-step (detection + tracking) pipeline to speed up the process.

After that, we performed multi-face detection and 3D face landmarks detection using Mediapipe’s solutions on images and real-time webcam feed. 

Then we learned to recognize the facial expressions in the images/videos utilizing the face landmarks and after that, we learned to apply face filters, which were dynamically controlled by the facial expressions in the images/videos.

Alright here are a few limitations of our application that you should know about, the face expression recognizer we created is really basic to recognize dedicated expressions like shock, surprise. For that, you should train a DL model on top of these landmarks.

Another current limitation is that the face filters are not currently being rotated with the rotations of the faces in the images/videos. This can be overcome simply by calculating the face angle and rotating the filter images with the face angle. I am planning to cover this and a lot more in my upcoming course mentioned above.

Let me know in the comments, you can also reach out to me personally for a 1 on 1 Coaching/consultation session in AI/computer vision regarding your project or your career.

Ready to seriously dive into State of the Art AI & Computer Vision?
Then Sign up for these premium Courses by Bleed AI

4 Comments

  1. mahwiz

    great work

    Reply
    • Taha Anwar

      Thanks Mahwiz, 🙂

      Reply
  2. Jibran Abdul Jabbar

    Excellent Work

    Reply
    • Taha Anwar

      Thanks Jibran 🙂

      Reply

Submit a Comment

Your email address will not be published. Required fields are marked *