• Home
  • >
  • MediaPipe
  • >
  • MediaPipe – The Ultimate Guide to Video Processing

MediaPipe – The Ultimate Guide to Video Processing

Ever wondered what runs behind “OK Google?” Well, that’s MediaPipe. If you have just started with MediaPipe and this is one of the first articles you are going through, congratulations, you found the right place. This article will cover the basics of MediaPipe, the difference between Solutions, the Framework and

Introduction to Mediapipe - Epic post

Ever wondered what runs behind “OK Google?” Well, that’s MediaPipe. If you have just started with MediaPipe and this is one of the first articles you are going through, congratulations, you found the right place. This article will cover the basics of MediaPipe, the difference between Solutions, the Framework and some super interesting real-world Applications!

SCROLL DOWN to the Applications section or click here to read right away.

Mediapipe

The official mediapipe documentation states that inferencing is real-time, and it takes just a few lines of code to create a perception pipeline. Is it that simple? Does MediaPipe provide real-time output for any data we throw at it? Follow along with the article to know more.

What is MediaPipe?

MediaPipe is a Framework for building machine learning pipelines for processing time-series data like video, audio, etc. This cross-platform Framework works on Desktop/Server, Android, iOS, and embedded devices like Raspberry Pi and Jetson Nano.

A Brief History of MediaPipe

Since 2012, Google has used it internally in several products and services. It was initially developed for real-time analysis of video and audio on YouTube. Gradually it got integrated into many more products; the following are some.

  1. Perception system in NestCam
  2. Object detection by Google Lens
  3. Augmented Reality Ads
  4. Google Photos
  5. Google Home
  6. Gmail
  7. Cloud Vision API, etc. 

MediaPipe powers revolutionary products and services we use daily. Unlike power-hungry machine learning Frameworks, MediaPipe requires minimal resources. It is so tiny and efficient that even embedded IoT devices can run it. In 2019, MediaPipe opened up a new world of opportunity for researchers and developers following its public release.

MediaPipe Toolkit

MediaPipe Toolkit comprises the Framework and the Solutions.  The following diagram shows the components of the MediaPipe Toolkit.

Mediapipe Toolkit : Mediapipe documentation
FIGURE 1: MediaPipe Toolkit

2.1 Framework

The Framework is written in C++, Java, and Obj-C, which consists of the following APIs.

  1. Calculator API (C++).
  2. Graph construction API (Protobuf).
  3. Graph Execution API (C++, Java, Obj-C).

Graphs

The MediaPipe perception pipeline is called a Graph. Let us take the example of the first solution, Hands. We feed a stream of images as input which comes out with hand landmarks rendered on the images. 

mediapipe hands solution

       

The flow chart below represents the MP (Abbr. MediaPipe) hand solution graph.

mediapipe hands solution graph
FIGURE 2: MediaPipe Hands Solution Graph

In computer science jargon, a graph consists of Nodes connected by Edges. Inside the MediaPipe Graph, the nodes are called Calculators, and the edges are called Streams. Every stream carries a sequence of Packets that have ascending time stamps. 

In the image above, we have represented Calculators with rectangular blocks and Streams using arrows. 

MediaPipe Calculators

These are specific computation units written in C++ with assigned tasks to process. The packets of data ( Video frame or audio segment ) enter and leave through the ports in a calculator. When initializing a calculator, it declares the packet payload type that will traverse the port. Every time a graph runs, the Framework implements Open, Process, and Close methods in the calculators. Open initiates the calculator; the process repeatedly runs when a packet enters. The process is closed after an entire graph run.

As an example, consider the first calculator shown in the above graph. The calculator, ImageTransform, takes an image at the input port and returns a transformed image in the output port. On the other hand, the second calculator, ImageToTensor, takes an image as input and outputs a tensor. 

Calculator Types in MediaPipe

All the calculators shown above are built-in into MediaPipe. We can group them into four categories.

  1. Pre-processing calculators are a family of image and media-processing calculators. The ImageTransform and ImageToTensors in the graph above fall in this category.
  1. Inference calculators allow native integration with Tensorflow and Tensorflow Lite for ML inference.
  1. Post-processing calculators perform ML post-processing tasks such as detection, segmentation, and classification. TensorToLandmark is a post-processing calculator.
  1. Utility calculators are a family of calculators performing final tasks such as image annotation.

The calculator APIs allow you to write your custom calculator. We will cover it in a future post.

2.2 MediaPipe Solutions

Solutions are open-source pre-built examples based on a specific pre-trained TensorFlow or TFLite model. You can check Solution specific models here. MediaPipe Solutions are built on top of the Framework. Currently, it provides sixteen solutions, as listed below.

  1. Face Detection
  2. Face Mesh
  3. Iris
  4. Hands
  5. Pose
  6. Holistic
  7. Selfie Segmentation
  8. Hair Segmentation
  9. Object Detection
  10. Box Tracking
  11. Instant Motion Tracking
  12. Objectron
  13. KNIFT
  14. AutoFlip
  15. MediaSequence
  16. YouTube 8M

The solutions are available in C++, Python, JavaScript, Android, iOS, and Coral. As of now, the majority of the solutions are available only in C++ (except KNIFT and IMT) followed by Android, with Python not too far behind

The other wrapper languages, too, are growing fast with a very active development state. As you can see, even though MediaPipe Framework is cross-platform, that does not imply the same for the solutions. MediaPipe is currently at version 0.10.9 (as of 2024). We can expect the solutions to get more support with the beta releases. Following are some of the solutions provided by MediaPipe.

mediapipe solutions

Synchronization and Performance Optimization

MediaPipe supports multimodal graphs. To speed up the processing, different calculators run in separate threads. For performance optimization, many built-in calculators come with options for GPU acceleration. Working with time series data must be synchronized properly; otherwise, the system will break. The graph ensures that flow is handled correctly according to the timestamps of packets. The Framework handles synchronization, context sharing, and inter-operations with CPU calculators. 

MediaPipe Dependency

MediaPipe depends on OpenCV for video and FFMPEG for audio data handling. It also has other dependencies like OpenGL/Metal, Tensorflow, Eigen [1], etc.

Getting Started with MediaPipe

We recommend you have a basic knowledge of OpenCV before starting with MediaPipe. Check out this simplified series of posts on Getting started with OpenCV.

MediaPipe Python solutions are the easiest for beginners because of the simplicity of the setup process and the popularity of the Python programming language. The modularity of the MediaPipe Framework enables customization. But before plunging into customization, we recommend getting comfortable with various pre-built solutions. Understand internal APIs associated with them and then tweak the outputs to create your exciting applications. 

MediaPipe Visualizer

MediaPipe Visualizer provides an easy way to try all the solutions. 

In MediaPipe, the protobuf (.pbtxt) text file defines a graph. The MediaPipe Visualizer welcome page greets you with a protobuf file containing a blank graph unit. It has various pre-built graphs of solutions that you can load from the New button at the top right. 

The visualizer works within the browser! Let’s give it a try.

The following screenshot shows an in-browser hand detection example.

mediapipe hands in-browser hand detection
FIGURE 3: MediaPipe In-browser Hand Detection

The desktop version of the pre-built MediaPipe is where you will have fun tweaking and creating your applications. You can install MediaPipe with a single command. 

pip install mediapipe

MediaPipe solutions come with easy-to-follow documentation[2], and we leave it to the reader to try them out. 

Applications of MediaPipe

Today, there are multiple interesting real-world applications of MediaPipe, in Computer Vision. In this section we will explore a few of these applications and get a comprehensive understanding of them:

Human Pose Estimation

Human pose estimation is a pivotal area in computer vision, focusing on detecting and analyzing human body postures and movements. It uses advanced algorithms to interpret complex poses from images or videos, enabling real-time analysis and interaction. MediaPipe facilitates building robust body posture analysis systems, which is essential for applications like ergonomic assessments and sports training. Integrating MediaPipe, developers can create AI fitness trainers, offering personalized workout guidance and performance feedback.

In pose estimation, human body posture analysis primarily focuses on accurately identifying and assessing human postures. It can analyze body alignments and movements from visual data, providing insights into body language and physical form. This technology is pivotal in areas like ergonomics, sports coaching, and physical therapy, where precise posture evaluation is crucial. It enables the monitoring and correction of body postures in real-time, aiding in injury prevention and enhancing performance.

pose estimation mediapipe human body posture analysis body alignment analysis posture evaluation
FIGURE 4: Human Body Posture Analysis

Another interesting application of pose estimation is using MediaPipe as an AI fitness trainer for those perfect workouts! This specific streamlit application has the potential to track positions such as bending forward, and backwards, lower one’s hips, knees falling over toes, and deep squats.

mediapipe human pose estimation ai fitness tracker workout analysis computer vision mediapipe
FIGURE 5: AI Fitness Tracker

Driver Drowsiness Detection

Recent statistics highlight the significant impact of driver drowsiness on road safety. According to the National Highway Traffic Safety Administration (NHTSA), each year, drowsy driving leads to approximately 100,000 police-reported crashes and results in over 1,500 deaths. This alarming figure underscores the critical need for effective driver drowsiness detection systems.

In 2021, there were 684 fatalities in crashes involving a drowsy driver, accounting for 1.6% of all motor vehicle traffic crash fatalities. This statistic further emphasizes the role of driver fatigue in road accidents and fatalities. Given these figures, the development and implementation of driver drowsiness detection technologies are not only beneficial but potentially life-saving, offering a proactive solution to a prevalent and often overlooked road safety issue.

mediapipe driver drowsiness detection computer vision adas alert system  face mesh pipeline eye aspect ratio
FIGURE 6: Driver Drowsiness Detection

Clearly, there is a need to develop an integrated system that assists with the task of driver drowsiness detection, and alerts drivers accordingly. MediaPipe has a set of pre-built methods such as  Face Mesh pipeline, and the simple yet robust Eye Aspect Ratio (EAR) technique. Both of these methods can be used in combination to accurately detect drowsiness in drivers.

Video Call Experience Enhancements

During the COVID-19 pandemic, video calling applications experienced unprecedented growth, becoming essential tools for maintaining personal and professional connections. With lockdowns and social distancing measures in place, these apps facilitated remote work, online education, and virtual social gatherings. For instance, Zoom, a leading video conferencing app, reported a surge from 10 million to 300 million daily meeting participants in early 2020.

Introducing interactive features into such applications can benefit greatly in-terms of user experience. One such feature is gesture control. What if you were magically able to control certain parameters within the application just by using hand gestures? Using a combination of MediaPipe’s hand, face detection, and selfie segmentation methods, gesture control for your favorite video conference application is just a few lines of Python code away!

mediapipe video call enhancements gesture control hand face detection selfie segmentation methods apple centerstage video conference application
FIGURE 7: Gesture Control

Apple took the world by storm by introducing the revolutionary feature called CenterStage in macOS. We all know that a few folks like to move around their rooms while giving a presentation while working from home, right? But the problem is, since the camera is fixed in a particular location, it cannot account for the movements made by the presenter. CenterStage is magically able to track the position of the presenter and adjust the camera stream accordingly. Sounds crazy, right? But have you ever wondered how such a system works, and how to implement it? It turns out that you can implement CenterStage for zoom calls using MediaPipe and Python.

mediapipe video call enhancements gesture control hand face detection selfie segmentation methods apple centerstage video conference application
FIGURE 8: Center Stage – Dynamic Frame Adjustment

Designing Snapchat and Instagram Filters

At some point in our lives, we all used Snapchat to share images with our friends. Applications such as Snapchat and Instagram make it more fun to use by introducing interactive filters. But, have you ever wondered how these filters work behind the scenes?

mediapipe face mesh ar filters snapchat instagram filters interactive vr
FIGURE 9: AR Filters

It so happens that you can create Snapchat and Instagram filters using MediaPipe. They use methods such as face detection to detect the presence and location of faces in the input camera stream; then feature keypoints are detected within the face to localize different parts of the face. In the end, the feature is applied on top of the face in real-time.

Do the above mentioned applications seem fascinating to you? SCROLL UP or click here to get started with MediaPipe installation.

Is it Genuinely Real-Time?

MediaPipe Solutions do not always work in real-time. 

The solutions are built on top of MediaPipe Framework that provides Calculator API (C++), Graph construction API (Protobuf), and Graph Execution API (C++, Java, Obj-C). With the APIs, we can build our graph and write custom calculators. 

The Toolkit is excellent, but its performance depends on the underlying hardware. The following example shows side by side inference comparison of HD(1280×720) and Ultra HD(3840×2160) video.

FPS comparison HD vs Ultra HD MediaPipe Pose
FIGURE 10: MediaPipe Pose – FPS Comparison

You can try building MediaPipe solutions from scratch and certainly see a jump in performance. However, you may still not achieve real-time inferencing.

Warning: The MediaPipe Framework supports bezel [3] build. Therefore, to harness the full potential of MediaPipe, one needs to be reasonably comfortable with C++ and bezel. The mediapipe documentation is also NOT easy to follow since it is in the active development stage. 

Conclusion

MediaPipe solutions are straightforward, and you can cover them in a day or two. 

On the other hand, the learning curve can be pretty steep for the C++ MediaPipe Framework. Don’t worry; we will get there by taking baby steps.

Overall, it is a beautiful, fast-growing library that delivers promising results. Implementing MediaPipe in projects nullifies most of the hassles we usually face while working on an ML project. No need to worry about synchronization and cumbersome setups. It allows you to focus on the actual development part. 

Questions? Please ask in the comments section.

References

[1] OpenGL/Metal, Tensorflow, Eigen

[2] MediaPipe documentation

[3] Bezel build



Read Next

VideoRAG: Redefining Long-Context Video Comprehension

VideoRAG: Redefining Long-Context Video Comprehension

Discover VideoRAG, a framework that fuses graph-based reasoning and multi-modal retrieval to enhance LLMs' ability to understand multi-hour videos efficiently.

AI Agent in Action: Automating Desktop Tasks with VLMs

AI Agent in Action: Automating Desktop Tasks with VLMs

Learn how to build AI agent from scratch using Moondream3 and Gemini. It is a generic task based agent free from…

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

Get a comprehensive overview of VLM Evaluation Metrics, Benchmarks and various datasets for tasks like VQA, OCR and Image Captioning.

Subscribe to our Newsletter

Subscribe to our email newsletter to get the latest posts delivered right to your email.

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?

 

Get Started with OpenCV

Subscribe To Receive

We hate SPAM and promise to keep your email address safe.​