Center Stage for Zoom Calls Using MediaPipe

Center Stage uses wide angle camera and custom object detection to track the person in the frame and make necessary adjustments with change in distance from the camera. In this post Center Stage is implemented using Mediapipe, OpenCv,VidStab and pyvirtualcam for Zoom Calls.

Today we are going to walk you through the implementation of Center Stage as seen in Apple iPads, iMacs, and MacBooks. We will be using the following:

✅ MediaPipe to track the person

✅ OpenCV for computer vision functions

✅ VidStab for video stabilization

In our last post we saw how we can use hand gestures for controlling zoom calls using OpenCV and MediaPipe. In this post, we will implement a version of Apple’s Center Stage technology.

Center Stage uses Machine Learning and tracks the person using its ultrawide camera. Sounds too hard? Let us learn how to implement it, shall we?

  1. System Setup for Center Stage
    1. Components of the System
    2. MediaPipe for motion detection
    3. Custom code for finding Region Of Interest
    4. Video Stabilization using VidStab
  2. MediaPipe
    1. Face Detection
  3. Detecting and Tracking the Region of Interest (ROI)
    1. Near Camera
  4. Video Stabilisation using VidStab
  5. Code for implementing Center Stage
    1. Setting Up the Development Environment
    2. Setting Up the virtual camera and Code
  6. Results
  7. Conclusion

1. System Setup for Center Stage

Implementing Center Stage requires a wide-angle camera, but not everyone has a wide-angle camera lying around in their cupboard. This code in the project can be implemented on any laptop or computer with a webcam.

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

1.1 Components of the System

We need four things to make this work. 

  • Location of the face 
  • ROI(Region Of Interest)
  • Stabilized Output
  • Integration with Zoom

The first two make sense but the third one might seem unnecessary. Stabilizing is the most important aspect.

The location of the face is necessary to make tracking possible. We need the value of a pixel on the face to track it.

ROI  must be cropped out of the webcam frame and at least half the size of the original frame.

The stabilization ensures that the transition of the frame while following the person is very smooth otherwise, it will look very choppy and ruin the purpose of the Center Stage.

Integration with zoom can be easily done using pyvirtualcam. To know more check out our previous blog post.

1.2 MediaPipe for Motion Detection

For the location of the face, we will use MediaPipe. MediaPipe offers face_detection for face tracking which returns the value of points on the face.

1.3 Custom Code for Finding Region Of Interest

For ROI we will be writing custom code that moves the frame whenever motion is detected and also works when the person is away from the camera.

1.4 Video Stabilization using VidStab

Stabilization is very necessary when it comes to choppy frames. Whenever there is a movement of ROI it is not going to be smooth so we will require something to stabilize the movement of ROI. Let us jump into how we can implement this.

2. MediaPipe

MediaPipe is a Framework for building machine learning pipelines for processing time-series data like video, audio, etc. To learn more about it, check out our blog post on Introduction to Mediapipe.

2.1 Face Detection

Face Detection by mediapipe.

face detection mediapipe - opencv camera zoom

Face detection will return a proto message that contains a bounding box and 6 key points, and each key point corresponds to a number that fetches its location in the array.

This is what the output looks like.

[label_id: 0
score: 0.919319748878479
location_data {
  format: RELATIVE_BOUNDING_BOX
  relative_bounding_box {
    xmin: 0.37421131134033203
    ymin: 0.30910003185272217
    width: 0.32962566614151
    height: 0.4394763112068176
  }
  relative_keypoints {
    x: 0.4729466438293457
    y: 0.437053918838501
  }
  relative_keypoints {
    x: 0.6066942811012268
    y: 0.432010293006897
  }
  relative_keypoints {
    x: 0.543822705745697
    y: 0.5491837859153748
  }
  relative_keypoints {
    x: 0.5436693429946899
    y: 0.63396817445755
  }
  relative_keypoints {
    x: 0.397892028093338
    y: 0.46969300508499146
  }
  relative_keypoints {
    x: 0.6772758960723877
    y: 0.4609552025794983
  }
}
]

We are only interested in relative_keypoints.

  • right eye-0
  • left eye-1
  • nose tip-2
  • mouth center-3
  • right ear tragion-4
  • left ear tragion-5

We need three coordinates things from here.

  • nose tip-2
  • right ear tragion-4
  • left ear tragion-5

The Nose tip will be used to get the location of the face. Right ear tragion and left ear tragion will be used for zooming purposes. Let us use these three coordinates for fetching the ROI.

mp_face_detection = mp.solutions.face_detection
    with mp_face_detection.FaceDetection(
            model_selection=1, min_detection_confidence=0.5) as face_detection:

        img.flags.writeable = False
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        results = face_detection.process(img)

        img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)

        # Set the default values to None
        coordinates = None
        zoom_transition = None
        if results.detections:
            for detection in results.detections:
                height, width, channels = img.shape

                # Fetch coordinates of nose, right ear and left ear
                nose = detection.location_data.relative_keypoints[2]
                right_ear = detection.location_data.relative_keypoints[4]
                left_ear = detection.location_data.relative_keypoints[5]

                #  get coordinates for right ear and left ear
                right_ear_x = int(right_ear.x * width)
                left_ear_x = int(left_ear.x * width)

                # Fetch coordinates for the nose and set as center
                center_x = int(nose.x * width)
                center_y = int(nose.y * height)
                coordinates = [center_x, center_y]

Note we need to make sure that we multiply the coordinates with the width and height of the image because relative_keypoints returns a value between 0 and 1.

3. Detecting and Tracking the Region of Interest (ROI) 

Take a look at this cute cat picture 

cat on a tree - opencv zoom

Source: stocksnap.io 

Let us bring our focus only to the cat. Hence we need to crop out.

cat cropped - zoom image python opencv

Region Of Interest or ROI 

In our case, the region of interest is the person in the frame.

We need  an algorithm that will automatically 

  • fetch us the ROI 
  • zoom in if the person is far from the camera

3.1 Near Camera

If the person is near the camera we fetch the coordinates of the nose, the left ear, and the right ear.

Woman looking left zoomen in - zoom opencv

Source: Unsplash.com

We track the value of the nose and make sure that our camera frame moves on detection of change in coordinates. Currently, we have the following coordinates.

Nose tip:
x: 0.38050377368927
y: 0.2459903359413147

Left ear:
x: 0.4521227777004242
y: 0.22984343767166138

Right ear:
x: 0.4040358364582062
y: 0.2189599871635437

The difference between the y-coordinate of the left ear and the right ear is less than 0.01 away from Camera.

If the person/object is away from the camera, we need to fetch the coordinates of the left ear and the right ear and check the distance between them. If it passes a certain threshold, we confirm that the object is away from the image.

Woman looking left zoomen out : center stage zoom
Nose tip:
x: 0.7117148637771606
y: 0.3911730945110321

Left ear:
x: 0.7758564352989197
y: 0.40252548456192017

Right ear:
x: 0.722640335559845
y: 0.38772353529930115

The difference in the Y-coordinate of the left ear and the right ear in this image is approximately 0.1 we can use this information to keep a threshold of say 0.08 x width, which means if the difference between the left ear and the right ear is more than 0.08 x width we need to zoom in the same way andif the difference is less than 0.08 x width we need to zoom out.

# Check the distance between left ear and right ear if distance is less than 120 pixels zoom in
                if (left_ear_y - right_ear_y) < 120:
                    zoom_transition = 'transition'

Here is the code.

def zoom_at(image, coord=None, zoom_type=None):
    """
    Args:
        image: frame captured by camera
        coord: coordinates of face(nose)
        zoom_type:Is it a transition or normal zoom
    Returns:
        Image with cropped image
    """
    global gb_zoom
    # If zoom_type is transition check if Zoom is already done else zoom by 0.1 in current frame
    if zoom_type == 'transition' and gb_zoom < 3.0:
        gb_zoom = gb_zoom + 0.1

    # If zoom_type is normal zoom check if zoom more than 1.4 if soo zoom out by 0.1 in each frame
    if gb_zoom != 1.4 and zoom_type is None:
        gb_zoom = gb_zoom - 0.1

    zoom = gb_zoom
    # If coordinates to zoom around are not specified, default to center of the frame
    cy, cx = [i / 2 for i in image.shape[:-1]] if coord is None else coord[::-1]

    # Scaling the image using getRotationMatrix2D to appropriate zoom
    rot_mat = cv2.getRotationMatrix2D((cx, cy), 0, zoom)

    # Use warpAffine to make sure that  lines remain parallel
    result = cv2.warpAffine(image, rot_mat, image.shape[1::-1], flags=cv2.INTER_LINEAR)
    return result

We have a global variable gb_zoom that contains the current zoom level. It is set to default i.e 1.4. If the subject is away , image transition is detected and slow zoom-in occurs.

If the subject comes closer after being away, a smooth zooming out occurs. If the coordinates i.e the position of the nose, is not mentioned, we default to the center of the image.

here is where the cropping magic occurs with the getRotationMatrix2D  to get the rotation matrix which is made to zoom into the point and apply it to our image using warpAffine.

4. Video Stabilization using VidStab

ostrich-unstable Alt: Ostrich unstable cam follow
Ostrich unstable cam follow

The above videos clearly demonstrate the role of Stabilization.. We can see how the video on the left has very sharp transitions and moves a lot while the image on the right is very smooth and stable. Note that the black strips in the second video can be removed with cropping. We will be using VidStab to do it for us.

device_val = None
stabiliser = VidStab()
img = stabilizer.stabilize_frame(input_frame=img,
                                            smoothing_window=2)

5. Code for implementing Center Stage

5.1 Setting Up the Development Environment

Create a new folder and in it a file called requirements.txt.

Add the following contents to this file.

absl-py==1.1.0
attrs==21.4.0
cycler==0.11.0
fonttools==4.34.3
kiwisolver==1.4.3
matplotlib==3.5.2
mediapipe==0.8.10.1
numpy==1.23.0
opencv-contrib-python==4.6.0.66
packaging==21.3
Pillow==9.2.0
protobuf==3.20.1
pyparsing==3.0.9
python-dateutil==2.8.2
six==1.16.0

Now, run the following commands.

python3 -m venv <foldername>
source <foldername>/bin/activate
pip3 install -r requirements.txt

5.2 Setting Up the virtual camera and Code

Virtual Camera Setup

For linux:- Install v4l2loopback

Note: Follow the documentation of pyvirtualcam, whether you are on windows or mac.

sudo modprobe v4l2loopback

Folder Structure

Create an empty Python file, in line with the structure given below.

.
├── main.py
├── requirements.txt

Now for the fun stuff, let’s get to some coding!

import cv2
import mediapipe as mp
from vidstab import VidStab
from pyvirtualcam import PixelFormat
import pyvirtualcam
import platform

# global variables
gb_zoom = 1.4


def zoom_at(image, coord=None, zoom_type=None):
    """
    Args:
        image: frame captured by camera
        coord: coordinates of face(nose)
        zoom_type:Is it a transition or normal zoom
    Returns:
        Image with cropped image
    """
    global gb_zoom
    # If zoom_type is transition check if Zoom is already done else zoom by 0.1 in current frame
    if zoom_type == 'transition' and gb_zoom < 3.0:
        gb_zoom = gb_zoom + 0.1

    # If zoom_type is normal zoom check if zoom more than 1.4 if soo zoom out by 0.1 in each frame
    if gb_zoom != 1.4 and zoom_type is None:
        gb_zoom = gb_zoom - 0.1

    zoom = gb_zoom
    # If coordinates to zoom around are not specified, default to center of the frame.
cy, cx = [i / 2 for i in image.shape[:-1]] if coord is None else coord[::-1]

    # Scaling the image using getRotationMatrix2D to appropriate zoom.
    rot_mat = cv2.getRotationMatrix2D((cx, cy), 0, zoom)

    # Use warpAffine to make sure that  lines remain parallel
    result = cv2.warpAffine(image, rot_mat, image.shape[1::-1], flags=cv2.INTER_LINEAR)
    return result


def frame_manipulate(img):
    """
    Args:
        image: frame captured by camera
    Returns:
        Image with manipulated output
    """
    # Mediapipe face set up
    mp_face_detection = mp.solutions.face_detection
    with mp_face_detection.FaceDetection(
            model_selection=1, min_detection_confidence=0.5) as face_detection:

        img.flags.writeable = False
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        results = face_detection.process(img)

        img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)

        # Set the default values to None
        coordinates = None
        zoom_transition = None
        if results.detections:
            for detection in results.detections:
                height, width, channels = img.shape

                # Fetch coordinates of nose, right ear and left ear
                nose = detection.location_data.relative_keypoints[2]
                right_ear = detection.location_data.relative_keypoints[4]
                left_ear = detection.location_data.relative_keypoints[5]

                #  get coordinates for right ear and left ear
                right_ear_x = int(right_ear.x * width)
                left_ear_x = int(left_ear.x * width)

                # Fetch coordinates for the nose and set as center
                center_x = int(nose.x * width)
                center_y = int(nose.y * height)
                coordinates = [center_x, center_y]
# Check the distance between left ear and right ear if distance is less than 120 pixels zoom in
                if (left_ear_x - right_ear_x) < 120:
                    zoom_transition = 'transition'

        # Perform zoom on the image
        img = zoom_at(img, coord=coordinates, zoom_type=zoom_transition)

    return img


def main():
    # Video Stabilizer
    device_val = None
    stabilizer = VidStab()

    # For webcam input:
    cap = cv2.VideoCapture(0)
    cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)  # set new dimensions to cam object (not cap)
    cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)
    cap.set(cv2.CAP_PROP_FPS, 120)

    # Check OS
    os = platform.system()
    if os == "Linux":
        device_val = "/dev/video2"

    # Start virtual camera
    with pyvirtualcam.Camera(1280, 720, 120, device=device_val, fmt=PixelFormat.BGR) as cam:
        print('Virtual camera device: ' + cam.device)

        while True:
            success, img = cap.read()
            img = frame_manipulate(img)
            # Stabilize the image to make sure that the changes with Zoom are very smooth
            img = stabilizer.stabilize_frame(input_frame=img,
                                            smoothing_window=2, border_size=-20)
            # Resize the image to make sure it does not crash pyvirtualcam
            img = cv2.resize(img, (1280, 720),
                            interpolation=cv2.INTER_CUBIC)

            cam.send(img)
            cam.sleep_until_next_frame()


if __name__ == '__main__':
    main()

6. Results

7. Conclusion

Hurray! You have successfully implemented Center Stage without a wide-angle camera.

Note: This is just a very basic implementation of Center Stage, what Apple has done depends on a lot of Hardware This is my approach to make it hardware independent.

Bonus Tip: Besides Zoom, this program also works on any other software that can detect and use virtual cameras.

More on Mediapipe

Hang on, the journey doesn’t end here. After months of development, we have some new and exciting blog posts for you!!!

1. Building a Poor Body Posture Detection and Alert System using MediaPipe
2. Creating Snapchat/Instagram filters using Mediapipe
3. Gesture Control in Zoom Call using Mediapipe
4. Drowsy Driver Detection using Mediapipe
5. Comparing Yolov7 and Mediapipe Pose Estimation models

Never Stop Learning!!!


Read Next

VideoRAG: Redefining Long-Context Video Comprehension

VideoRAG: Redefining Long-Context Video Comprehension

Discover VideoRAG, a framework that fuses graph-based reasoning and multi-modal retrieval to enhance LLMs' ability to understand multi-hour videos efficiently.

AI Agent in Action: Automating Desktop Tasks with VLMs

AI Agent in Action: Automating Desktop Tasks with VLMs

Learn how to build AI agent from scratch using Moondream3 and Gemini. It is a generic task based agent free from…

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

Get a comprehensive overview of VLM Evaluation Metrics, Benchmarks and various datasets for tasks like VQA, OCR and Image Captioning.

Subscribe to our Newsletter

Subscribe to our email newsletter to get the latest posts delivered right to your email.

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?

 

Get Started with OpenCV

Subscribe To Receive

We hate SPAM and promise to keep your email address safe.​