Ever wondered what runs behind “OK Google?” Well, that’s MediaPipe. If you have just started with MediaPipe and this is one of the first articles you are going through, congratulations, you found the right place. This article will cover the basics of MediaPipe, the difference between Solutions, and the Framework.
The official documentation states that inferencing is real-time, and it takes just a few lines of code to create a perception pipeline. Is it that simple? Does MediaPipe provide real-time output for any data we throw at it? Follow along with the article to know more.
- What is MediaPipe
- MediaPipe Toolkit
- Synchronization and performance optimization
- Dependency
- Getting started with MediaPipe
- Is it genuinely real-time?
What is MediaPipe?
MediaPipe is a Framework for building machine learning pipelines for processing time-series data like video, audio, etc. This cross-platform Framework works on Desktop/Server, Android, iOS, and embedded devices like Raspberry Pi and Jetson Nano.
A Brief History of MediaPipe
Since 2012, Google has used it internally in several products and services. It was initially developed for real-time analysis of video and audio on YouTube. Gradually it got integrated into many more products; the following are some.
- Perception system in NestCam
- Object detection by Google Lens
- Augmented Reality Ads
- Google Photos
- Google Home
- Gmail
- Cloud Vision API, etc.
MediaPipe powers revolutionary products and services we use daily. Unlike power-hungry machine learning Frameworks, MediaPipe requires minimal resources. It is so tiny and efficient that even embedded IoT devices can run it. In 2019, MediaPipe opened up a new world of opportunity for researchers and developers following its public release.
MediaPipe Toolkit
MediaPipe Toolkit comprises the Framework and the Solutions. The following diagram shows the components of the MediaPipe Toolkit.
2.1 Framework
The Framework is written in C++, Java, and Obj-C, which consists of the following APIs.
- Calculator API (C++).
- Graph construction API (Protobuf).
- Graph Execution API (C++, Java, Obj-C).
Graphs
The MediaPipe perception pipeline is called a Graph. Let us take the example of the first solution, Hands. We feed a stream of images as input which comes out with hand landmarks rendered on the images.
The flow chart below represents the MP (Abbr. MediaPipe) hand solution graph.
In computer science jargon, a graph consists of Nodes connected by Edges. Inside the MediaPipe Graph, the nodes are called Calculators, and the edges are called Streams. Every stream carries a sequence of Packets that have ascending time stamps.
In the image above, we have represented Calculators with rectangular blocks and Streams using arrows.
MediaPipe Calculators
These are specific computation units written in C++ with assigned tasks to process. The packets of data ( Video frame or audio segment ) enter and leave through the ports in a calculator. When initializing a calculator, it declares the packet payload type that will traverse the port. Every time a graph runs, the Framework implements Open, Process, and Close methods in the calculators. Open initiates the calculator; the process repeatedly runs when a packet enters. The process is closed after an entire graph run.
As an example, consider the first calculator shown in the above graph. The calculator, ImageTransform, takes an image at the input port and returns a transformed image in the output port. On the other hand, the second calculator, ImageToTensor, takes an image as input and outputs a tensor.
Calculator Types in MediaPipe
All the calculators shown above are built-in into MediaPipe. We can group them into four categories.
- Pre-processing calculators are a family of image and media-processing calculators. The ImageTransform and ImageToTensors in the graph above fall in this category.
- Inference calculators allow native integration with Tensorflow and Tensorflow Lite for ML inference.
- Post-processing calculators perform ML post-processing tasks such as detection, segmentation, and classification. TensorToLandmark is a post-processing calculator.
- Utility calculators are a family of calculators performing final tasks such as image annotation.
The calculator APIs allow you to write your custom calculator. We will cover it in a future post.
2.2 MediaPipe Solutions
Solutions are open-source pre-built examples based on a specific pre-trained TensorFlow or TFLite model. You can check Solution specific models here. MediaPipe Solutions are built on top of the Framework. Currently, it provides sixteen solutions, as listed below.
- Face Detection
- Face Mesh
- Iris
- Hands
- Pose
- Holistic
- Selfie Segmentation
- Hair Segmentation
- Object Detection
- Box Tracking
- Instant Motion Tracking
- Objectron
- KNIFT
- AutoFlip
- MediaSequence
- YouTube 8M
The solutions are available in C++, Python, JavaScript, Android, iOS, and Coral. As of now, the majority of the solutions are available only in C++ (except KNIFT and IMT) followed by Android, with Python not too far behind.
The other wrapper languages, too, are growing fast with a very active development state. As you can see, even though MediaPipe Framework is cross-platform, that does not imply the same for the solutions. MediaPipe is currently at alpha version 0.7. We can expect the solutions to get more support with the beta releases. Following are some of the solutions provided by MediaPipe.
Synchronization and Performance Optimization
MediaPipe supports multimodal graphs. To speed up the processing, different calculators run in separate threads. For performance optimization, many built-in calculators come with options for GPU acceleration. Working with time series data must be synchronized properly; otherwise, the system will break. The graph ensures that flow is handled correctly according to the timestamps of packets. The Framework handles synchronization, context sharing, and inter-operations with CPU calculators.
MediaPipe Dependency
MediaPipe depends on OpenCV for video and FFMPEG for audio data handling. It also has other dependencies like OpenGL/Metal, Tensorflow, Eigen [1], etc.
Getting Started with MediaPipe
We recommend you have a basic knowledge of OpenCV before starting with MediaPipe. Check out this simplified series of posts on Getting started with OpenCV.
MediaPipe Python solutions are the easiest for beginners because of the simplicity of the setup process and the popularity of the Python programming language. The modularity of the MediaPipe Framework enables customization. But before plunging into customization, we recommend getting comfortable with various pre-built solutions. Understand internal APIs associated with them and then tweak the outputs to create your exciting applications.
MediaPipe Visualizer
MediaPipe Visualizer provides an easy way to try all the solutions.
In MediaPipe, the protobuf (.pbtxt)
text file defines a graph. The MediaPipe Visualizer welcome page greets you with a protobuf file containing a blank graph unit. It has various pre-built graphs of solutions that you can load from the New button at the top right.
The visualizer works within the browser! Let’s give it a try.
The following screenshot shows an in-browser hand detection example.
The desktop version of the pre-built MediaPipe is where you will have fun tweaking and creating your applications. You can install MediaPipe with a single command.
pip install mediapipe
MediaPipe solutions come with easy-to-follow documentation[2], and we leave it to the reader to try them out.
6. Is it genuinely Real-Time?
MediaPipe Solutions do not always work in real-time.
The solutions are built on top of MediaPipe Framework that provides Calculator API (C++), Graph construction API (Protobuf), and Graph Execution API (C++, Java, Obj-C). With the APIs, we can build our graph and write custom calculators.
The Toolkit is excellent, but its performance depends on the underlying hardware. The following example shows side by side inference comparison of HD(1280×720) and Ultra HD(3840×2160) video.
You can try building MediaPipe solutions from scratch and certainly see a jump in performance. However, you may still not achieve real-time inferencing.
Warning: The MediaPipe Framework supports bezel [3] build. Therefore, to harness the full potential of MediaPipe, one needs to be reasonably comfortable with C++ and bezel. The documentation is also NOT easy to follow since it is in the active development stage.
Conclusion
MediaPipe solutions are straightforward, and you can cover them in a day or two.
On the other hand, the learning curve can be pretty steep for the C++ MediaPipe Framework. Don’t worry; we will get there by taking baby steps.
Overall, it is a beautiful, fast-growing library that delivers promising results. Implementing MediaPipe in projects nullifies most of the hassles we usually face while working on an ML project. No need to worry about synchronization and cumbersome setups. It allows you to focus on the actual development part.
In the upcoming posts, we will show how to build interesting Augmented Reality filters using the MediaPipe Face solution. Later in this series, we will cover customizing calculators of pre-built MediaPipe solutions and building custom graphs.
More on Mediapipe
Hang on, the journey doesn’t end here. After months of development, we have some new and exciting blog posts for you! 1. Building a Poor Body Posture Detection and Alert System using MediaPipe 2. Creating Snapchat/Instagram filters using Mediapipe 3. Gesture Control in Zoom Call using Mediapipe 4. Center Stage for Zoom Calls using MediaPipe 5. Drowsy Driver Detection using Mediapipe 6. Comparing Yolov7 and Mediapipe Pose Estimation models Never Stop Learning! |
Questions? Please ask in the comments section.
References
[1] OpenGL/Metal, Tensorflow, Eigen
[3] Bezel build