Home
>
3D Computer Vision
>
Depth Pro: The Sharp Monocular Metric Depth Estimation from Apple Explanation and Applications

Jaykumaran
on January 21, 2025

Depth Pro: The Sharp Monocular Metric Depth Estimation from Apple Explanation and Applications

Apple's DepthPro is quite impressive, producing pixel-perfect, high-resolution metric depth maps with sharp boundaries through monocular depth estimation. It outperforms all of its contenders like Metric3D v2 and DepthAnything in "in-the-wild" dynamic scenes. In this article, we will get an in-depth understanding of the DepthPro model architecture and training strategy,

3D Computer Vision, Computer Vision, Deep Learning, SpatialAI-Depth

Depth Pro, is an foundational zero shot metric depth estimation model from Apple ML, nails at creating high resolution, sharp monocular metric depth maps in less than a second. Depth Pro achieves SOTA results in metric depth and outperfoms its competitors such as Depth Anything V2 and Marigold.

Imagine reviving those photos of a favorite childhood picnic, stored in your gallery. Well, why just imagine? With the image editing options in Google Photos, you can transform a still image and bring them to life with live-like motion edits of your choice and cherish those moments. The best part is, you don’t essentially need to have the camera metadata info of the original device with which they were shot. That’s the power of depth data inferred using modern monocular depth estimation.

The field of depth estimation has evolved significantly over the years. Think back to those good old days of Microsoft Kinetic on the Xbox, that brought an immersive and surreal gaming experience. Kinetic relied on the stereo-based cameras and had a pool of computer vision algorithms running behind the scenes to bring depth perception to life. While stereo cameras set the stage for depth applications, deep learning based monocular depth such as Depth Pro are simpler yet promising alternatives.

The topics discussed in this article are outlined as follows:

Quick refresher to Monocular metric depth and its attributes
Depth Pro Paper Explanation and Architectural details
Inference with Depth Pro
Comparing Depth Pro with models like DepthAnything and DepthCrafter
Applications of Depth Data in Photo Editing

Individuals working at the intersection of computer vision and creative digital tools will find the application section to be highly engaging.

A Primer to Monocular Depth Estimation
Attributes of a Robust Monocular Depth Estimation Model
Pointers from Depth Pro Paper
DepthPro: Model Architecture
Training strategy of Depth Pro By Apple ML
Evaluation metrics and Benchmarks
Code Walkthrough of Depth Pro Inference
Depth Pro: Inference Results
Estimating Metric Depth with Depth Pro
Testing Depth Pro on Different Edge Cases
Limitations of Depth Pro
Comparison 1: Image Inference – Depth Pro v/s Depth Anything V2 v/s Marigold
Comparison 2: Video Inference – Depth Pro v/s Depth Crafter
Applications of Depth Map
Key Takeaways
Conclusion

If you’re just getting started with depth and stereo vision, we recommend you to check our series of articles on SpatialAI,

A Primer to Monocular Depth Estimation

Typically, obtaining depth requires expensive LiDAR systems, or specialized stereo camera setup to capture multiple views of the same scene. However with SOTA monocular depth estimators, you can infer depth from a single image and can achieve on par results comparable to these hardware setups.

While there are classical heuristic approaches like Shape from X, that are still noteworthy, they rely on strong assumptions about the lighting or geometry about the scene, limiting their practicality for in-the-wild scenarios. That’s where deep learning based monocular depth estimation came into the play and shines in.

Monocular depth, though we deduce from a 2D image, it falls under representing a 3D info of the visual scene just like how humans perceive the world with cues and patterns. While there were earlier monocular depth models, the field started to pivot with Intel’s MiDAS and Depth Anything undergoing sophisticated advancements.

We knew that a semantic segmentation is a pixel-wise classification task, similarly a monocular depth estimation often is a pixel-wise regression task. It assigns values to each pixel of varying intensities helping to discern the object and the background either in absolute or relative scale.

1. Metric/Absolute Depth: Metric depth represents the depth in the real world for each pixel, where their intensity directly gauges the physical distance (in meters or centimeters) from the camera sensor to the object point corresponding to that pixel. eg: UniDepth, DepthPro

2. Relative Depth: Usually normalized between 0 to 1, indicates which pixels in the image are closer and which are farther away, to distinguish the foreground and background planes, without referring to real-world units of measurement. The results are sufficient for applications like image editing. eg: ZoeDepth , Depth Anything V2, Marigold

Attributes of a Robust Monocular Depth Estimation Model

Generalization across diverse scenes: The model should perform consistently across various environments, whether it’s indoor or outdoor scenes. Training a single depth estimation model on all datasets typically deteriorates performance due to scaling differences between indoor and outdoor scenes especially in metric depth.
Structural integrity and geometric features: The model should preserve the structural integrity and geometric features of the objects in the visual scene, leveraging cues like lighting, gradients, and textures.
Independence from camera intrinsics: The model shouldn’t rely on camera specifics, which increases its usability in various applications.
Sharp boundaries and artifact prevention: They should be able to handle occlusion boundaries and prevent artifacts like flying pixels and be able produce sharp boundaries.
Fast and accurate: They should be computationally efficient, faster, and accurate.
Transparent and Reflective Surfaces: Typically, monocular depth estimation models struggle with transparent and reflective surfaces such as mirror, water and glass, as these exhibit complex lighting and refractive properties. The model should accurately understand the scene and deduce the inherent characteristics of the object, providing precise depth information and not falling short with mispredictions due to external reflections.
Adaptability to varying focal lengths: Scenes captured with the same camera at varying focal lengths will have different perceptions of background distance. A good model should account for this variation and maintain consistent depth maps.

Focal length Variations using Sony DSLR — Fig 1: Focal length Variations
Source: @tonygale – Sony Alpha Universe Community

Relative Depth for different focal lengths by Depth Anything V2 — Fig 2: Relative Depth for different focal lengths

Pointers from Depth Pro Paper

Depth Pro is blazing fast and produces sharp high resolution depth maps. The paper reports that on a V100 GPU it takes just 0.3 sec to estimate the depth of an image. It natively produces a depth dim of 1536×1536 (~2.35 MP), with further upsampling to match the size of the input image.

What makes Depth Pro stand out is its ability to produce high fidelity boundary delineation even for very thin structures like hair strands, or fur-like with high precision and recall. In contrast, Depth Anything V2 produces low fidelity and smoothened depth maps.

Comparison of current moncular depth estimation models where depth pro excels in getting mane of lion accurately and sharply — Fig 3: Depth Pro v/s Marigold v/s Depth Anything V2

In the above example, we see that Marigold is also good in extracting the fur boundaries, though it yields finer-details than depth anything v2 it has noisy and grainy predictions which looks like a pencil sketch. Diffusion based depth models like Marigold have exceptional generalization as the rich latent space of theirs brings a lot of structural knowledge about the image, they remain slow due to multiple cycles or iterations of denoising. On the contrary, for sharp boundaries depth pro does not require any diffusion prior supervision or sophisticated multi-step task specific modules.

Unlike most monocular depth estimation models which are typically indomain and overfitted to a specific dataset constrained to either indoor or outdoor settings, Depth Pro consistently performs exceptionally well on in-the-wild images in dynamic environments making it the most preferred choice as a zero shot depth estimator.

The success of Depth Pro lies in its thoughtful design principles. It employs an efficient multi-scale vision transformer and incorporates a training protocol that combines real and synthetic datasets with sharp depth maps resulting in an unparalleled precision in boundary tracing.

The authors hint that depth pro design choices are specifically tailored for novel-view synthesis. In general, for novel view synthesis a good monocular depth estimation model should work zero shot and it should produce metric depth to reproduce shape and scenes. This allows it to determine the distances of objects within the scene without any requirement of camera intrinsics.

100K+ Learners
3 Hours of Learning

Join Free OpenCV Bootcamp

15K+ Learners
3 Hours of Learning

Join Free TensorFlow Bootcamp

10K+ Learners
8 Hours of Learning

Join Free PyTorch Bootcamp

Depth Pro: Model Architecture

We knew that Transformers excels in capturing the global context due to their attention mechanism. However scaling Vision Transformers for high input resolution is computationally intensive due to their quadratic complexity O(n²).

As a workaround, instead of modifying any existing pretrained vision transformers and retraining the entire network, the authors approach cleverly to overcome this limitation by processing images at multiple scales (1536², 768² and 384²) and extracting their features using a naive ViT encoder at their disposal. Finally, these features from multiple scales are fused to obtain global context while still maintaining the low-level features.

Multi Scale Vision Transformers

Depth Pro architecture primarily involves four main components:

Image Encoder
Patch Encoder
DPT Decoder
Focal length Head

Both the image encoders for depth and focal length are of ViT-L Dinov2 , the authors report that they performed exceptionally well compared to all other backbones from the timm library.

The patch encoder processes images as patches at different scales and fuses the learned representations across scales making them scale-invariant. These layers are shared to predict a single high-res dense map. Depth Pro operates at a fixed input resolution of 1536², and the output is upscaled to match the input image dimensions.

The input image is resized to 1536² and downsampled to 768² and 384². Note that these dimensions are multiples of 384. But why specifically 384? Because the authors leveraged a pretrained Dinov2 backbone from the timm library, which expects an input size of 384 allowing the use of backbone without any modifications.

Depth Estimation Network uses a Multi Scale Vision Transformers with varying image resolutions — Fig 4.1: Multi Scale Vision Transformers

High resolution: 1536 x 1536
Mid resolution: 768 x 768
Low resolution: 384 x 384

Then the high res and mid res image variants are split into overlapping patches to avoid seam with a patch size of 384².

For simplification lets consider dividing the high and mid res without overlap:

High Res: 1536²/384² = 4 x 4 patches,

Mid Res : 768² / 384² = 2 x 2 patches.

Now, if overlap is accounted for, we get:

High Res: 5×5 overlapping patches @ 384²
Mid Res: 3×3 overlapping patches @ 384²
Low res: 1×1 patch @ 384²

Then the high, mid and low res images are concatenated and flattened into a 1D vector.

Patch Encoder

DepthPro(
  (encoder): DepthProEncoder(
    (patch_encoder): VisionTransformer(
      (patch_embed): PatchEmbed(
        (proj): Conv2d(3, 1024, kernel_size=(16, 16), stride=(16, 16))

This 1D input image vector is passed to a BeiT model with a patch size of 16×16 acts as a patch encoder. It encodes patches at three different scales and they are merged, producing an output feature map at five levels,

Features 1@96² and Features2@96² are additional latent encodings hooked from the BeiT model’s high res patches with feature map size of 96² .
Features 3@96² is of high res with feature map size of 96²(1536 x 1536 / 16 x 16)
Features 4@48² is of mid res with feature map size of 48²(768 x 768 / 16 x 16)
Features 5@24² is of low res with feature map size of 24²(384 x 384 / 16 x 16)
This multi-scale image representation helps Depth Pro to learn fine-grained features or local context in the given image.

(image_encoder): VisionTransformer(
(patch_embed): PatchEmbed(
   (proj): Conv2d(3, 1024, kernel_size=(16, 16), stride=(16, 16))

Features 6@24² :
Additionally , low res version of the input image (384²) without any patch splitting is passed directly to the second image encoder acting like an anchor providing overall global context of the image.

Fig 4.2: Feature Upsampling – Depth Estimation Network

After encoding, these features are upsampled with specific scale factors to align their resolutions for final fusion via DPT Decoder with ConvTranspose2D:

Feature 1: Scaled by a factor of 8
Feature 2: Scaled by a factor of 4
Features 3, 4, 5, 6: Scaled by a factor of 2

Used in most of the depth models like depth anything or MiDAS, DPT (Dense Prediction Transformer) is a decoder that leverages the Vision Transformer (ViT) for fine-grained pixel-wise depth prediction. Fusion blocks in DPT use residual convolutions to merge features and enhance resolution

(decoder): MultiresConvDecoder(
    (convs): ModuleList(. . .)
    (fusions): ModuleList(
      (0): FeatureFusionBlock2d(
		. . .)


(head): Sequential(
(0): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ConvTranspose2d(128, 128, kernel_size=(2, 2), stride=(2, 2))
(2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): ReLU(inplace=True)
(4): Conv2d(32, 1, kernel_size=(1, 1), stride=(1, 1))
(5): ReLU())

Depth Prediction Network

Depth Pro outputs an inverse depth map by default which is preferred for visualization purposes.

It represented as,

$C=f(I)$

where C is Canonical inverse depth in metres(m).

Raw Depth: Closer pixel appear dark and farther pixels are appear white

Inverse Depth: Closer pixels appear white, farther pixels appear dark

$\text{inverse depth} = 1/ depth$

Canonical Inverse Depth: Normalized or standardised inverse depth to make it consistent and interpretable as we fix scale based on:

Min and max focal distance (absolute scale).
Min and max intensity values (relative scale).

Dense metric depth: To compute dense metric depth, we need to scale the inverse depth with focal length ( $f_{px}$ for Horizontal FOV) and width(w) of the image.

This is formulated as:

$D_m = \frac{f_{px}}{wC}$

where, $D_m$ is the dense metric depth, $f_{px}$ is the focal length predicted

In most cases, the horizontal and vertical FOV of a camera will have the same focal length (f_px = f_py) as modern cameras use square pixels.

Focal Length Estimation Head

Focal length is the distance between the principal plane of the lens and the image plane (camera sensor) when the lens is focused on a subject at infinity. It is an important EXIF data of a camera that has to be known to scale depth which helps to determine how the 2D pixels will translate to a 3D distance of objects.

Usually monocular depth estimators do use the training dataset’s focal length to accurately project the 2D pixels into 3D point clouds. For instance, the Depth Anything V2 pipeline does use a focal length of 470.4 mm to convert its metric depth into point clouds.

Earlier zero shot metric depth models required camera intrinsics to be known to accurately produce depth maps. However, recent methods like UniDepth, introduces a network that has two separate modules for predicting depth and camera embedding, and the camera embedding is conditioned on depth map by concatenating the features to improve the depth quality which is independent of camera intrinsics about the training dataset.

Depth Pro, takes inspiration from Unidepth where focal length is obtained from the subnetwork of the model which estimates the focal length for an input image, this makes it possible to perform metric depth without the need for any source camera specific details. The FOVNetwork returns a single value for focal length in pixels.

Separate Focal Length Estimation Network in Depth Pro to avoid any drop in model accuracy - robust monocular metric depth estimator — Fig 5: Focal Length Estimation Network

Why is the focal length a separate head?

In the Depth Pro pipeline, depth and focal length estimation are treated as separate tasks. If they were trained together, it would require balancing the two networks and optimizing them simultaneously. Therefore, using two networks enables a decoupled objective, allowing the depth and focal length networks to be trained independently on two completely different datasets without much reduction in performance metrics.

During inference, however, if a focal length for the input image is found in the EXIF data, it is preferred over the focal length estimated by the model’s FOV Head as its more reliable data than an estimated one, making it an obvious choice. The FOVNetwork does use a small convolutional head to predict the horizontal field of view.

(fov): FOVNetwork(
    (encoder): Sequential(
      (0): VisionTransformer(
		. . .
	)
(head): Sequential(
(0): Conv2d(128, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(1): ReLU(inplace=True)
(2): Conv2d(64, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
(3): ReLU(inplace=True)
(4): Conv2d(32, 1, kernel_size=(6, 6), stride=(1, 1)))))

Training strategy of Depth Pro

Depth Pro is pretrained with a two stage training approach with a large mix of real and synthetic images to achieve high accuracy and sharp boundary delineation:

Stage 1: Training on Labelled Datasets

The model is initially trained on labeled datasets captured from lidar or stereo cameras that are available already over the internet. This makes the model to learn real world distributions for better generalization.

Stage 2: Fine-Tuning with Synthetic Datasets

To further improve the model for sharp outputs the model is further fine-tuned on synthetic datasets which have accurate sharp depth maps created using VFX or engines. A set of carefully chosen loss functions is used in conjunction with Scale and Shift Invariant (SSI) Loss with the objective to minimize the MAE loss. It improves precision in boundary regions and robust to scale and shifts of in-the-wild images.

Note: SSI Loss helps to disregard the scale and shifts variability with each sample in the training set. A loss function is taken into consideration that is independent of the scale and shifts of the data. This is done by transforming the depth value of the sample into disparity space and normalizing them between 0 and 1.

Evaluation metrics and Benchmarks of Apple Depth Pro

Depth Pro is a 504M parameter model which makes it to easily fit on a laptop with 6GiB VRAM . In comparison to its counterparts like Metric3D, Marigold requires more VRAM and takes more time for the same high res output but delivers low fidelity. Depth Pro strikes a balance between speed and performance.

Depth Pro outperforms all existing monocular depth estimation models achieving best average accuracy in zero-shot metric depth accuracy and highest F1 score, boundary recall (R) in zero-shot boundary accuracy, significantly.

From Table 1, we can see that Depth Pro operates on a completely different level, with an average rank of 2.5 (lower is better), indicating its aggregated performance across all the datasets.

Table 1: Zero-shot metric depth accuracy, performance on various datasets like middleburry nuscenes sintel achieving lowest rank by depth pro- eval and metrics — Table 1: Zero-shot metric depth accuracy

The paper also proposes a new evaluation metric for sharp boundaries specifically to assess sharp boundaries in depth maps as existing benchmarks do not take this into account. To prepare a benchmark dataset for sharp boundary evaluation, existing binary segmentation or matting techniques were applied to prepare the ground truth for sharp boundaries depth map. As it is less time consuming to manually inspect a problematic segmentation mask than annotating them from scratch. The pixels at the edges of the binary segmentation mask are treated as object boundaries, which also solves the blurry edge problem commonly observed in monocular depth estimators.

As discussed, in Zero-shot boundary accuracy , depth pro ace across all the datasets its threshold accuracy (δ1: higher is better) highlighted in green which indicates that for any task where boundary precision matters depth pro is your go to model choice.

Table 2: Zero-shot boundary accuracy of Depth Pro on standard monocular zero shot metric depth and relative depth estimators where Depth Pro has highest δ1 indicating a strong foundation model especially for boundary dilineation — Table 2: Zero-shot boundary accuracy

Fig 6: Comparison of Depth Pro v/s others for Sharp Boundaries - showing clearly Depth Pro excels at boundary dilineation — Fig 6: Comparison of Depth Pro v/s others for Sharp Boundaries

Now its time for hands on with coding, let’s go through the inference pipeline of Depth Pro and test the model’s aforementioned claims and actual performance when stressed under various conditions.

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

Click here to download the source code to this post

Code Walkthrough of Depth Pro Inference

To start out with Depth Pro locally simply, clone the repository, and install the required dependencies.

!git clone https://github.com/apple/ml-depth-pro.git
cd ml-depth-pro

#setup
!pip install -e .

You will need to download the pretrained checkpoint using the following bash command which will place the model under `ml-depth-pro/checkpoints/depth-pro.pt` folder.

source get_pretrained_models.sh

Image Inference – Usage

To run Depth Pro via cli,

!depth-pro-run  -i image.jpg -o output_dir

The input can be an image or a directory containing multiple images, the output_dir will store the resulting inverse depth maps.

The following set of code is mostly adapted from ml-depth-pro/cli/run.py with few add-on snippets to calculate surface normal estimation from the raw depth.

Import Dependencies

import logging
from pathlib import Path
import cv2
import numpy as np
import PIL.Image
import torch
from matplotlib import pyplot as plt
from tqdm import tqdm

from depth_pro import create_model_and_transforms, load_rgb

We will load the model in half-precision and move it to “cuda”.

def run(args):
   """Run Depth Pro on a sample image."""

   # Load model.
   model, transform = create_model_and_transforms(
       device=get_torch_device(),
       precision=torch.half,
   )
   model.eval()

   image_paths = [args.image_path]

The input image is loaded as a PIL image using load_rgb, which returns f_px if it is obtained from the image metadata.

for image_path in tqdm(image_paths):
       # Load image and focal length from exif info (if found.).
       try:Parallax Effect with After Effects
           LOGGER.info(f"Loading image {image_path} ...")
           image, _, f_px = load_rgb(image_path)
       except Exception as e:
           LOGGER.error(str(e))
           continue

The input image is resized to 1536×1536, and basic transforms like ToTensor and Normalize are applied. The estimated depth is then resized to the original image dimensions.

 # Run prediction. If `f_px` is provided, it is used to estimate the final metric depth,
       # otherwise the model estimates `f_px` to compute the depth metricness.
       prediction = model.infer(transform(image), f_px=f_px)

All the inferences were carried out on a RTX4050 GPU with 6GiB VRAM, an i5 16 GB RAM machine. Each inference took less than 15 seconds per image, occupying 5GB of VRAM with half precision.

Depth Pro: Inference Results

1.Raw Depth

From the DPT Decoder head, the raw depth is extracted and is clipped between [0.1m;250m].

# Extract the depth.
depth = prediction["depth"].detach().cpu().numpy().squeeze()
max_depth_vizu = min(depth.max(), 1 / 0.1)
min_depth_vizu = max(depth.min(), 1 / 250)
depth_clipped = np.clip(depth, min_depth_vizu, max_depth_vizu)
      
depth_normalized = (depth_clipped - min_depth_vizu) / (max_depth_vizu - min_depth_vizu)
      
grayscale_depth = (depth_normalized * 255).astype(np.uint8)

Raw Depth with Colormap of Depth Pro on Human subject, monocular depth estimation - Depth Pro Dpt Decoder estimation at 1536 x1536 upscaled depth map inferred from the model, showing metric depth in metres — Raw Depth

2. Inverse Depth

For better visualization, we can take an inverse of the raw depth which appears much more intuitive as we are mostly concerned about the foreground subjects rather than background. The inverse depth pixel values are normalized between [0.1m;250m] and scaled back to usual 8-bit grayscale format.

inverse_depth = 1 / depth
# Visualize inverse depth instead of depth, clipped to [0.1m;250m] range for better visualization.
max_invdepth_vizu = inverse_depth.max()
min_invdepth_vizu = inverse_depth.min()
      
inverse_depth_normalized = (inverse_depth - min_invdepth_vizu) / (
max_invdepth_vizu - min_invdepth_vizu
       )
inverse_depth_grayscale = (inverse_depth_normalized * 255).astype(np.uint8)

Inverse Depth with Colormap of Depth Pro on Human subject, monocular depth estimation - Depth Pro Dpt Decoder estimation at 1536 x1536 upscaled depth map, native output — Inverse Depth

3. Color Inverse Depth

The inverse depth grayscale single channel depth map can be color-mapped with a cmap of “inferno” or other common options such as “viridis”, “turbo”.

# Save as color-mapped "turbo" jpg image.
cmap = plt.get_cmap("inferno")
color_depth = (cmap(inverse_depth_normalized)[..., :3] * 255).astype(
               np.uint8
           )
PIL.Image.fromarray(color_depth).save(
               inverse_cmap_output_file, format="JPEG", quality=90
           )

Inverse Depth with Colormap of Depth Pro on Human subject, colormapped with inferno, monocular depth estimation — Inverse Depth with Colormap

4. Surface Normal

Surface normal refers to the orientation of a subject in 3D space, where from camera ego blue represents the normals pointing towards the camera, green indicates right-facing normals and pink shades represents left normals. It provides local geometric cues that’s not immediately apparent in a depth map or segmentation mask.

Interestingly, surface normals can be estimated from raw depth by calculating the horizontal and vertical gradient using the Sobel operator with a kernel size like 7×7 here. An 8-bit RGB image is created by normalizing and scaling these normals for visualization.

#***************** SURFACE NORMAL ***********************
kernel_size = 7
grad_x = cv2.Sobel(depth.astype(np.float32), cv2.CV_32F, 1, 0, ksize = kernel_size)
grad_y = cv2.Sobel(depth.astype(np.float32), cv2.CV_32F, 0, 1, ksize = kernel_size)
z = np.full(grad_x.shape, 1)
normals = np.dstack((-grad_x, -grad_y, z))
      
normals_mag = np.linalg.norm(normals, axis= 2, keepdims=True)
with np.errstate(divide="ignore", invalid="ignore"):
      normals_normalized   = normals / (normals_mag + 1e-5)
      
normals_normalized = np.nan_to_num(normals_normalized, nan = -1, posinf=-1, neginf=-1)
normal_from_depth = ((normals_normalized + 1) / 2 * 255).astype(np.uint8)

Surface Normal Estimation - The raw depth map is transformed where its normals can be viewed as surface normal creating a etched out 3D like mesh — Surface Normal Estimation

5. Focal length in pixels

As discussed earlier, if we get the focal length in millimeters from the image EXIF data and convert it to focal length in pixels, it is chosen over the estimated f_px.

An image metadata got from fileinfo to obtain camera intrinsic like focal length, image width etc, — Camera Metadata – Satya Image

if f_px is not None:
           LOGGER.debug(f"Focal length (from exif): {f_px:0.2f}")
elif prediction["focallength_px"] is not None:
           focallength_px = prediction["focallength_px"].detach().cpu().item()
           print(f"Estimated focal length: {focallength_px}")
           LOGGER.info(f"Estimated focal length: {focallength_px}")

When the image focal length is converted to pixels, for 140mm we expect,

Focal length in pixels = (focal length in mm * image width in pixels) / sensor width in mm

The Canon EOS 5D Mark IV has a sensor width of 36mm,

Thus, $fpx_{image}$ = 140 x 3360 / 36 = 13066 pixels

Estimated focal length: ~9729.327 pixels returned by the Depth Pro FOVNetwork.

Absolute Difference = [Estimated Focal Length (pixels) – Calculated Focal Length (pixels)] = 13066 – 9729.33 = 3336.67 pixels

Percentage Difference = (Absolute Difference / Calculated Focal Length (pixels)) x 100% = (3336.67 / 13066) x 100% = 25.54%

Clearly there is a discrepancy between the actual focal length in pixels expected and the estimated focal length by Depth Pro FOV Head.

6. Sharp Depth Maps

Image shows a kid playing where here hair strands are upright, this is a kinda hard sample for any segmentation or monocular depth estimator, however depth pro produces sharp depth maps — Sharp Depth Maps of Hair Strands

Estimating Metric Depth with Depth Pro

Interactive OpenCV Window

For an interactive OpenCV Window demo scripts hit the “download code” button.

Let’s test the robustness of Depth Pro in estimating the physical distance of each point on an object from the camera.

Case 1: Actual Distance: 67 cm; Estimated Distance: ~ 68 to 71 cm

To do this, we placed a box at a distance of 67cm away from the table and captured it using a fixed mobile camera. The distance between the object and table is measured with metric tape for reference.

Case 1: Captured above 67 cm from the cardboard top , interior showing the metric depth accuracy estimated for each pixel or surface by monocular metric depth estimation like Depth Pro, got decent results — Case 1: Captured above 67 cm from the cardboard top

Case 1: Metric Depth Visualization

We observed that the model approximated the range around 68-71 cm when hovered over the cardboard box, which is impressive. However in some cases, we had failures possibly due to how the image is captured. One will need to do extensive testing for any real world use cases with depth pro for metric depth.

Case 2: Actual Distance: 100 cm ; Estimated Distance: ~110cm

The lowest plane / internal bottom of the box’s actual distance is 100cm , but DepthPro’s depth map estimates the points on that plane to be around 110cm.

Case 2: Captured at 100 cm from the bottom of tobox interior showing the metric depth accuracy estimated for each pixel or surface by monocular metric depth estimation like Depth Pro, got decent results — Case 2: Captured at 100 cm from the bottom of box interior

Case 2: Metric Depth Visualization

Testing Depth Pro on Different Edge Cases

1.Glass Subjects

Depth Pro nails in discerning the internal reflections of glass structures, considering them as an intact object which is the desirable attribute expected from a monocular metric depth estimator — Glass Subjects – Inverse Depth

2. Inside Water

Submarine - Inverse Depth, showing the performance of depth pro of subjects inside water surface, — Submarine – Inverse Depth

3. Water Reflections

Depth Pro performs exceptionally well in distinguishing water reflections from real subjects, by not mistaking reflections of the boy or the elephants in the water as an actual subjects.

Depth Pro performance on water reflections, showing the robustness of the model accurates not fall behind mispredicting the scene — Water Reflections

4. Illusion

3D Anamorphosis of Nikola Tesla
courtesy: Patrik Proško, the depth map of this is impressive, though depth pro is a monocular depth model was able to distinguish between different objects — 3D Anamorphosis of Nikola Tesla
courtesy: Patrik Proško

Limitations of Depth Pro

Though, Depth Pro is an impressive, yet it too has failure scenarios such as blurred objects and foggy conditions, where it fails to get any meaningful results of subjects beyond the haze, While the structures of these subjects become slightly visible when surface normals are computed, however they are not apparent to a naked eye in the inverse or raw depth map. Sometimes, it does infer the reflections in the mirror as real humans which isn’t desirable.

1. Blurred Subjects

A man holding a glass that has his inverted reflection,When a metric depth map of this is obtained the model fails to get the human subject at the background who is kinda blurred or isn't in sharp focus — Blurred Human Subject

2. Foggy Scene

Shows a foggy scene of scenic mountain, Depth Pro completes fails to capture the subjects that's beyond the haze, where models like Depth Anything were able to get meaningful estimates of the mountain in the background — Foggy Scenes

3. Mirror Reflections

The image though produces decent depth maps for mirro reflections it does capture the structure of the subjects in the mirror which is undesirable, showing the limitation of monocular depth estimators — Depth Pro performance on subjects within mirror reflections

4. Graffiti Illusions

Image shows where the monocular depth models like Depth Pro fails in discerning the illusion and real subjects — Graffiti Art – Failure case

Comparison 1: Image Inference – Depth Pro v/s Depth Anything V2 v/s Marigold

Compared to Depth Anything V2, Depth Pro outputs are high-res and sharp where crisp details without any washed out pixels.
Sample 1

An inference comparison between Depth Pro v/s Marigold v/s Depth Anything V2 to check the sharpness of near wheel spokes, depth pro nails this scene with high res output

Sample 2

A comparison between Depth Anything V2 v/s Marigold v/s Depth Pro on the a 3d glass frustrum like building where depth pro fails for some pixels where Depth Anything V2 accurates estimates that its a single glass object

Comparison 2: Video Inference – Depth Pro v/s DepthCrafter

Depth Pro doesn’t support native video depth, so we will split the video as frames and pass the dir containing all frames to Depth Pro. On the other hand, Tencent’s DepthCrafter is an SOTA video depth model, which employs an image-to-video diffusion model internally. As a result from the following set of comparison between samples, we can clearly see that DepthCrafter produces consistent depth for these high motion and dynamic videos, whereas depth pro struggles in handling these scenarios producing flickering or washed-out depth maps. This emphasizes the importance of accounting temporal information into the context for generating coherent depth maps.

Sample 1

Courtesy: Matrix 1999, Warner Bros Pictures

DepthCrafter – Consistent

Depth Pro – Flickering

Sample 2

Dirt Track Racing

DepthCrafter – Consistent

Depth Pro – Slight Flickering

Applications with Depth Pro’s Depth Map

Have you ever thought about how Meta 3D photo sharing features in their social platforms work? It uses image depth data to create those effects. Similarly, portrait mode on Pixel or iPhones saves depth data, enabling realistic 3D styles and effects. While depth maps have numerous applications in autonomous driving, medical imaging, gaming/XR, 3D reconstruction, and robotics, in this article, we’ll focus on some of their uses in photo and video editing at a high level.

Application 1: Parallax Effect with Adobe Software

Photoshop by default has a relative depth based depth blur model named bottlenet in its neural filter plugins, which comes in handy when adding movement to static images.
Instead of using the default plugin’s output, we imported DepthPro’s raw depth map into the project and applied it to the input to create a parallax effect with After Effects.

The gif shows a static captain america toy subject is shown to have movement with parallax effect, using Adobe After Effects, - from Monocular metric depth obtained from Depth Pro — Turning static images to movements with parallax effect

Application 2: Simulating Depth of Field Effect from Static Image

To simulate the focal properties of a real camera, all you have to do is the depth map. Depth of Field creates sharp focus on the focal plane at a specific distance, while keeping all other areas blurred. It refers to the distance between the nearest and farthest objects that appear acceptably sharp in an image.

rgb_image = cv2.resize(rgb_image, (new_width, new_height))
depth_map_normalized = cv2.resize(depth_map_normalized, (new_width, new_height))

# Function to apply depth of field effect
def apply_dof(focal_depth):
   focal_range = 0.1  # Range around focal depth to remain sharp

   # Create smooth focus weights
   sharpness_weights = np.exp(-((depth_map_normalized - focal_depth) ** 2) / (2 * focal_range ** 2))
   sharpness_weights = sharpness_weights.astype(np.float32)

   # Apply Gaussian blur to the background
   blurred_image = cv2.GaussianBlur(rgb_image, (51, 51), 0)

   # Blend the original image and blurred image using sharpness weights
   sharpness_weights_3d = np.expand_dims(sharpness_weights, axis=2)  # Add a channel for blending
   dof_image = sharpness_weights_3d * rgb_image + (1 - sharpness_weights_3d) * blurred_image
   dof_image = np.clip(dof_image, 0, 255).astype(np.uint8)
   return dof_image

# Callback function for the trackbar
def on_trackbar(val):
   # Convert slider value (0-100) to focal depth (0.0-1.0)
   focal_depth = val / 100.0
   dof_image = apply_dof(focal_depth)
   cv2.imshow("Depth of Field Effect", dof_image)

# Create a window and resize it to fit the screen
cv2.namedWindow("Depth of Field Effect", cv2.WINDOW_NORMAL)
cv2.resizeWindow("Depth of Field Effect", new_width, new_height)

# Create a trackbar (slider) at the top of the window
cv2.createTrackbar("Focal Plane", "Depth of Field Effect", 50, 100, on_trackbar)  # Default at middle (50)

# Show initial DOF effect
initial_dof_image = apply_dof(0.5)  # Start with focal depth at 0.5
cv2.imshow("Depth of Field Effect", initial_dof_image)

Depth of Field Effect : simulation by adjusting the focal plane

Application 3: Depth Blur / Portrait Mode

One of the real application were depth data is extremely useful is in getting cool portrait effect in mobile camera’s where they have fixed focal length unlike DSLR lens with varying focal length.

# Normalize depth map to range [0, 1]
depth_map_normalized = cv2.normalize(depth_map.astype(np.float32), None, 0, 1, cv2.NORM_MINMAX)

# Convert normalized depth map back to uint8 (0-255 range)
depth_map_uint8 = (depth_map_normalized * 255).astype(np.uint8)

# Automatically infer focus range
min_depth = np.min(depth_map_normalized)
focus_margin = 0.22  # start with 10% margin for focus range -> 0.1
focus_near = int(min_depth * 255)
focus_far = int((min_depth + focus_margin) * 255)

# Debug: Print focus range
print(f"Focus range: {focus_near} to {focus_far}")

# Create a binary mask for the focus region
focus_mask = cv2.inRange(depth_map_uint8, focus_near, focus_far)

# Apply Gaussian blur to the entire image
blurred_image = cv2.GaussianBlur(rgb_image, (51, 51), 0)

# Convert focus mask to 3 channels for blending
focus_mask_color = cv2.merge([focus_mask, focus_mask, focus_mask])

# Blend images: Keep original where mask is white, blur otherwise
result = np.where(focus_mask_color == 255, rgb_image, blurred_image)

# Save and display the result
cv2.imshow("Depth Blur Effect", result)

By using raw depth a simple portrait effect can be created, This image shows satya mallick sitting. With depth pro output a simple Gaussian blur is applied using OpenCV within a specified inbound and outbound range, creating realistic portraits

Application 4: 3D Point Cloud Projection from Depth Map

Finally, we will explore a common application of creating 3D point clouds from 2D images using depth maps.

Metric depth has an edge over relative depth estimates when a pixel is projected into 3D space. Unlike relative depth models, metric depth allows scaling with the help of field of view, calculated by formula, f_px / image width.

The following code block generates point clouds and reprojects into 3D using Open3D visualization toolkit.

focal_length_x  = 140 #In mm
focal_length_y = 140 #In mm

x,y = np.meshgrid(np.arange(width), np.arange(height))

x = (x - width /2) / focal_length_x
y = (y - height / 2) / focal_length_y

z = np.array(depth_raw)

points = np.stack((np.multiply(x,z), np.multiply(y, z), z), axis = -1).reshape(-1, 3)
colors = np.array(color_img).reshape(-1, 3) / 255.0

# Create the point cloud and save it to the output directory
out_dir = "Applications/vis_point_cloud"
os.makedirs(out_dir, exist_ok=True)
pcd = o3d.geometry.PointCloud()

pcd.points = o3d.utility.Vector3dVector(points)
pcd.colors = o3d.utility.Vector3dVector(colors)

o3d.io.write_point_cloud(f'{os.path.join(out_dir, "satya.ply")}', pcd)

# Load in the point cloud created from OpenCV to compared to Open3D
opencv_pcd_path = "Applications/vis_point_cloud/satya.ply"
pcd = o3d.io.read_point_cloud(opencv_pcd_path)

# Flip it, otherwise the pointcloud will be upside down
pcd.transform([[1, 0, 0, 0], [0, -1, 0, 0], [0, 0, -1, 0], [0, 0, 0, 1]])
o3d.visualization.draw_geometries([pcd])

The outputs aren’t great, and we may need to fix camera intrinsics in the reprojection matrix. However using the image metadata, we were able to get the focal length as 70-200mm@140mm with F/4.5. By using that f_px and performing transforms with Q matrix (reprojection), this is the best possible point cloud we have obtained.

Key Takeaways from Depth Pro

Depth Pro is faster and produces higher resolution depth maps with an effective training strategy. Compared to other monocular depth estimators, depth pro’s output is crisp and features sharp boundaries.

We have shown a cherry-picked demo for inferring distance from the depth map with two cases (card board). However we faced discrepancies in getting expected distance when we tried with other examples. Let us know in the comments, if you have a right pipeline with logic that handles it correctly.

We strongly believe Depth Pro is another feather to the monocular depth field, after MiDaS and DepthAnything’s legacy. However it is limited by the absence of an official fine-tuning pipeline and is licensed under Apple with special clauses for broader adoption.

Despite its advancements, Depth Pro faces challenges such as “flying pixels” or artifacts in depth maps that can distort images and boundary tracing issues. However, it performs better than previous models in these areas.

The image is output of Depth Pro on a moving human subject where the inverse depth contain artifacts showing depth pro limitations

Conclusion

What do you think about Depth Pro results? Quite impressive, isn’t it? As Depth Pro is a foundational model it can be extended to other downstream tasks like Novel view synthesis, surface normal estimation, semantic segmentation, depth completion, 3D Reconstruction etc., It would have been a great addition to open source community if the project was licensed under a MiT or Apache 2.0. If you are planning to use the model for commercial purpose be sure to check the licensing clause.

For this article, working along with our media team, for ideation, particularly to understand the cross functional application of computer vision especially depth in softwares like Adobe Photoshop and After Effects for content creation, helped to gain a lot of hindsight. What interesting applications you have built with depth map? Let us know in comments, we would love to hear them.

References

1. Depth Pro Github

2. Interesting Read by PatricioGonzalez

3. HuggingFace: Monocular depth estimation guide

4. Surface normal from depth code snippet: Decoding Meta Sapiens

5. DepthAnything V2 : Metric Depth

6. Open3D Tutorials by Nicolai Neilsen

Was This Article Helpful?

2D Gaussian Splatting: Geometrically Accurate Radiance Field Reconstruction

Discover how 2D Gaussian Splatting transforms neural rendering by replacing volumetric 3D Gaussians with surface-aligned

TRM: Tiny AI Models beating Giants on Complex Puzzles

Models with billions, or trillions, of parameters are becoming the norm. These models can write

Deploying ML on Arduino: From Blink to Think

Deploying ML on Arduino Nano 33 BLE. Explore TinyML techniques, setup steps, and why older

Was This Article Helpful?

VideoRAG: Redefining Long-Context Video Comprehension

Discover VideoRAG, a framework that fuses graph-based reasoning and multi-modal retrieval to enhance LLMs' ability to understand multi-hour videos efficiently.

AI Agent in Action: Automating Desktop Tasks with VLMs

Agentic AIGUIVLMs

Kukil September 30, 2025

AI Agent in Action: Automating Desktop Tasks with VLMs

Learn how to build AI agent from scratch using Moondream3 and Gemini. It is a generic task based agent free from…

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

Computer VisionVLMs

Bhomik Sharma September 23, 2025

The Ultimate Guide To VLM Evaluation Metrics, Datasets, And Benchmarks

Get a comprehensive overview of VLM Evaluation Metrics, Benchmarks and various datasets for tasks like VQA, OCR and Image Captioning.

Subscribe to our Newsletter

Subscribe to our email newsletter to get the latest posts delivered right to your email.