CVPR 2024: Overview and Key Papers

AI research made great strides in 2023-2024, including VLLMs like GPT4-O and Gemini; Text-to-Video Diffusion Models like SORA and Veo; and Humanoids like Atlas V2, Figure -01, and Tesla Optimus. Extensive research is ongoing every day behind all of these latest innovations. To showcase the best research in this rapid advancement in AI, the IEEE Computer Society (CS) and the Computer Vision Foundation (CVF) organize CVPR (CVPR 2024), the world’s best conference for showcasing the latest research in Computer Vision and AI.

According to a survey by Express Computer, about 13.5K papers are published daily, and approximately 550 papers are published every hour. In 2024, CVPR was conducted from June 17th to 21st in Seattle, and we witnessed some cool research happening around the globe. In this article, we try to cover some of the research papers that caught our attention. By the end of this article, you will have an overview of CVPR and the latest research going on advanced Computer Vision and AI around the world.

OpenCV at CVPR 2024
Generative Image Dynamics
Hydra-MDP
pixelSplat
Visual Anagrams
RAVE
SpatialTracker
ZeroNVS
Special Mentions
Conclusion
References

OpenCV at CVPR 2024

Gary Bradski, Anindya Roy, Phil Nelson, and ShiQi Yu represented the OpenCV team at CVPR 2024. Here are some key highlights that the team showcased at booth 1920:

The Team featured OpenCV 5, showcasing features from the latest releases and roadmap for OpenCV 5. At last count, 600+ attendees added themselves to OpenCV’s visitor list.
Everybody thanked the developers of OpenCV for making many graduate courses, MS/PhD projects, and commercial solutions possible. A handful of industry attendees promised support via memberships.
The Team highlighted OpenCV University and OpenCV.ai to educate people about the latest AI and Computer Vision.
Several representatives from semiconductor and camera companies, impressed by recent collaborations with Arm and Qualcomm, have invited OpenCV to join their partnership programs.
MLOps and CV tooling teams like Voxel51 requested collaborations.

100K+ Learners
3 Hours of Learning

Join Free OpenCV Bootcamp

15K+ Learners
3 Hours of Learning

Join Free TensorFlow Bootcamp

10K+ Learners
8 Hours of Learning

Join Free PyTorch Bootcamp

Paper 1: Generative Image Dynamics

Paper

Project Page

A paper by Zhengqi Li, Richard Tucker, et al. from Google Research got the best paper award at CVPR 2024. This paper is all about generating a seamlessly looping video or an interactive simulation of dynamics from a single image. The model trained on a collection of motion trajectories extracted from real video sequences depicting natural, oscillatory dynamics of objects such as trees, flowers, candles, and clothes swaying in the wind.

Now, if you give a single image to the model, it will use a frequency-coordinated diffusion sampling process to predict a spectral volume, which can be converted into a motion texture that spans an entire video. Along with an image-based rendering module, the predicted motion representation can be used for a number of downstream applications, such as turning still images into seamlessly looping videos, or allowing users to interact with objects in real images, producing realistic simulated dynamics (like dragging and releasing points).

Model Architecture

Motion prediction module: The model predicts a spectral volume S through a frequency-coordinated denoising model. Each block of the diffusion network θ interleaves 2D spatial layers with attention layers (red box, right), and iteratively denoises latent features zⁿ . The denoised features are fed to a decoder D to produce S. During training, the model concatenates the downsampled input I₀ with noisy latent features encoded from a real motion texture via an encoder E, and replaces the noisy features with Gaussian noise z^N during inference (left).

Predicting Motion

Motion Representation

Concept: A “motion texture” comprises sequences of time-variant 2D displacement maps F_t, where each map specifies the displacement of pixels in an input image I₀ over time t.
Implementation: We pass the image I₀and replace the noisy features with Gaussian noise z^N to the model. The model predicts the spectral volumes(S) using reverse diffusion.
Spectral Volume Representation: Motions are encoded in the frequency domain by transforming per-pixel trajectories into spectral volumes using a Fourier transform. This allows complex motion dynamics to be represented compactly and processed efficiently.

Predicting Motion with a Diffusion Model

Latent Diffusion Model (LDM): Chosen for its computational efficiency and the ability to maintain synthesis quality. The model consists of:
- Variational Autoencoder (VAE): This component compresses the input image into a latent space and reconstructs the input from these features.
- U-Net-Based Diffusion Model: It learns to iteratively denoise latent features, starting from Gaussian noise, producing spectral volumes through denoising steps.
Training and Normalization:
- Frequency Adaptive Normalization: Critical in preparing the data for the model, this step normalizes the Fourier coefficients to prevent extremes by using statistics (like the 95th percentile) from training data.
- Motion Prediction: Utilizes the trained model to predict spectral volumes by applying a frequency-coordinated denoising process. This method ensures that predictions across frequency bands are harmonized, avoiding unrealistic motion artifacts.

Image-Based Rendering

CVPR-generative-image-dynamics-model-architecture-p2 — **Image – Rendering Module Overview**

Rendering module: The method fills in missing content and refines the warped input image using a deep image-based rendering module, where multi-scale features are extracted from the input image I. Softmax splatting is applied over the features with a motion field F_t from 0 to t (subject to the weights W). The warped features are fed to an image synthesis network to produce the rendered image I_t.

Forward Warping: This involves transforming the predicted motion texture back into a time domain using inverse FFT and then using this texture to warp the input image into a new frame I_t.
Softmax Splatting:
- Technique: Used for refining the warped image by blending multiple source pixels that map to the same output location. This is achieved using a feature pyramid and softmax weights to manage overlapping and smooth transitions.
- Implementation: The image I₀ is encoded into multi-scale feature maps. The motion field F_t is applied to these features, which are then synthesized into the output image I_t using a neural synthesis network.

Perceptual Loss: During training, a perceptual loss based on the VGG network compares the synthesized frames against actual video frames to fine-tune the model’s output for higher fidelity and realism.

Examples

CVPR2024-generative-image-dynamics-method-overview — **Image – Method Overview**

Image-to-Video Conversion

CVPR2024-generative-image-dynamics-example-1

Basic Idea: The model transforms a single still photo into a dynamic video by first calculating a “motion spectral volume” from the photo. This volume captures all possible movements in the scene.
Animation Process: Then, these calculated motions are turned into a smooth video using a special module.

Seamless Looping Videos

Challenge: Looping videos are those that start and end at the same point smoothly, so they can run continuously without you noticing the beginning or end. Creating these types of videos is challenging because it’s hard to find examples to teach the model.
Solution: The authors developed a technique using a motion prediction model that ensures the start and end of the video match up perfectly. This involves a clever trick where they tweak the model during the video creation process to focus on making the start and end frames look the same. During the video-making process, each pixel’s movement is adjusted to make the first and last frames nearly identical, making the video loop seamless.

Interactive Dynamics from a Single Image

CVPR2024-generative-image-dynamics-example

Concept: This technique allows us to make a still photo react to virtual forces, like poking or pushing an object in the photo, and see it move realistically.
How It Works: We use a special mathematical model that mimics real-world physics. By assigning different frequencies to movements in the image, we can predict how each part should move in response to a force based on how objects naturally vibrate.
Application: This can be particularly useful in simulations or interactive applications where you want to create realistic movements without needing actual video footage as input.

Our Take

The paper is unique in introducing a frequency-coordinated diffusion sampling process to predict a spectral volume from a single image.
We like how it uses diffusion to generate motion frequencies from an image.
The authors haven’t released the code yet, but you can try the demo here.

Paper 2: Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Paper 1

Paper 2

Blog

OpenCV at CVPR 2024

Paper 1: Generative Image Dynamics

Model Architecture

Predicting Motion

Motion Representation

Predicting Motion with a Diffusion Model

Image-Based Rendering

Examples

Image-to-Video Conversion

Seamless Looping Videos

Interactive Dynamics from a Single Image

Our Take

Paper 2: Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation

Architecture Overview

Code

Paper 3: pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

Architecture Overview

More Examples

Code Walkthrough

Paper 4: Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models

Method Overview

More Examples

Code Pipeline

Paper 5: RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models

Model Architecture

Code Pipeline

Examples

Paper 6: SpatialTracker: Tracking Any 2D Pixels in 3D Space

Model Overview

More Examples

Code Pipeline

Paper 7: ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image

Method Overview

More Examples

Code Walkthrough

CVPR 2024 – Special Mentions

DiffusionLight: Light Probes for Free by Painting a Chrome Ball

SpiderMatch: 3D Shape Matching with Global Optimality and Geometric Consistency

BioCLIP: A Vision Foundation Model for the Tree of Life

Oryon: Open-Vocabulary Object 6D Pose Estimation

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

I’M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions

Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images

Loopy-SLAM: Dense Neural SLAM with Loop Closures

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

Conclusion

References

Subscribe & Download Code

Get Started with OpenCV

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?