CVPR 2024 Key Research & Dataset Papers

CVPR 2024 (Computer Vision and Pattern Recognition) is an annual conference held from June 17th to 21st at the Seattle Convention Center, USA, which was a huge success. The IEEE CVPR 2024 Research Papers has an acceptance rate of just ~ 23.6%, proving its high-quality research standards. The conference offered many interesting papers, workshops, datasets, and benchmarks for the computer vision community, which may be the foundation for the next decade.

In this article, we primarily aim to focus on:

What problem statement existed in each category?
What were the novel methodologies the authors carried out?
And finally, there are impressive demos with the GitHub repository link for the respective papers.

This is the second part of our series on noteworthy papers from CVPR 2024. In our last article, we covered a wide variety of papers that drive current research in 3D Diffusion, Autonomous Vehicles, NeRF, and more.

If you are here directly to this article, bookmark our Part 1 of CVPR 2024: An Overview to read it for later.

OpenCV Booth at CVPR 2024 - CVPR 2024 conference — **FIG** 1: OpenCV Booth at CVPR 2024

Here is a quick overview of 11 papers that we will cover.

Florence 2
DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks
DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
Object Recognition as Next Token Prediction
MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation
MemSAM: Taming Segment Anything Model for Echocardiography Video Segmentation
EventPS: Real-Time Photometric Stereo Using an Event Camera
Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods
LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry
Key Datasets
Special Mention
Conclusion

1. Florence 2

Arxiv: https://arxiv.org/abs/2311.06242

Problem statement: Unified Architecture for Vision tasks.
Category: Vision, Language, and Reasoning

Florence -2 Multi-tasks - Computer Vision research — **FIG 2**: Florence -2 Multi-tasks

Florence-2 by Bin Xiao et al. from Azure AI, Microsoft is a strong foundational VLM that outshines its competitors showcasing task agnostic zero-shot performance. Florence-2 was pre-trained on the FLD-5B dataset having 126M images. The authors point out that by unfreezing the vision backbone the model’s ability is enhanced to learn from region and pixels. It was also found that language pre-trained weights had less impact on purely vision-based tasks.

The datasets were prepared and refined using specialist models and services like Mask R-CNN, DINO, Azure OCR, etc., which excel at specific task categories and are trained with weak supervision.

Model Architecture

Understanding global semantics and local features is vital for image comprehension. Florence 2 excels at this and adopts a sequence-to-sequence framework to address various vision tasks in a unified manner.

Vision or Image Encoder

Florence 2 uses a DaViT vision encoder to preprocess input images of shape I ∈ R H×W×3 (H, W, channels) to flatten visual token embeddings (V ∈ R Nv×Dv, where Nv and Dv represent the number and dimensionality of vision tokens, respectively). Along with this, multi-task prompts are tokenized as text+location embeddings.

Multi-modality encoder decoder

Following the Image Encoder, a standard transformer encoder block’s cross-attention captures the relationship between visual and textual queries. Then, the decoder’s higher-dimensional output is projected into interpretable text, visual, and location representations for downstream tasks.

Model Configuration:

Inference Results: (Florence-2-large-ft)

Here, FT means a Fine-tuned model on a collection of downstream tasks.

Let’s perform some experiments on an RTX 4050 GPU with an i5 CPU machine to test Florence-2-large-ft’s capabilities on various downstream tasks. You can test with your images using HuggingFace Spaces listed on the Model’s Page.

Task: <MORE_DETAILED_CAPTION>

Detailed Captioning Task - Florence-2-Microsoft - Computer Vision research — **FIG 5**: Detailed Captioning Task

Task: <OPEN_VOCABULARY_DETECTION>

Prompt: Camel

Open Vocabulary Detection Task - Florence 2 - AI advancements 2024 — **FIG 6**: Open Vocabulary Detection Task

Task: <OCR>

OCR Task - Florence 2 - CVPR2024 Research Papers — **FIG 7**: OCR Task

Highlights of Paper:

By extending the vocab size of the tokenizer to include location tokens, the model performed better in both spatial coverage and semantic granularity. This eliminates the need for task-specific heads, making Florence-2 a good generalist model.

Despite their small sizes (base – 0.23B and large – 0.77 B), the models give a neck-to-neck performance to large models like Flamingo, PALI, and Kosmos2.

Because of its unified architecture, Florence-2 is capable of tasks such as Visual grounding, Object Detection, Referring Expression Segmentation, open vocabulary detection, detailed captioning etc.

💡 Interesting Fact: Earlier in 2018, Project Florence by Microsoft aimed to develop a plant human interface using light and electrical signals.

Observation and Takeaways

From our initial testing, we found that Florence-2 excels at OCR and Detailed Captioning. However, in some images consisting of difficult scenarios, it struggles with prompt specified object detection or segmentation compared to supervised task specific models like YOLO and Mask R-CNN.

The author suggests further fine-tuning Florence 2 can improve its domain and task adaption.

Florence-2 Inference Notebook [ Link ]

Fine-tune Florence-2 Blog [ Link ]

Previous

Next

100K+ Learners
3 Hours of Learning
Join Free OpenCV Bootcamp

15K+ Learners
3 Hours of Learning
Join Free TensorFlow Bootcamp

10K+ Learners
8 Hours of Learning
Join Free PyTorch Bootcamp

View all AI Free Courses

2. DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks

Arxiv: https://arxiv.org/abs/2405.04408

Problem statement: Single network capable of doing five document restoration tasks.

Category: Document analysis and understanding

DocRes, by Jiaxin Zhang et al. from South China University, is a generalist model for document restoration that eliminates the need for multiple models for specific tasks which misses the synergies in input images among tasks.It can do five mutli-tasks like dewarping,deshadowing, appearance enhancement, deblurring and binarization.
Existing methods heavily rely on image-to-image pair visual prompts, ProRes, and Mask Image Modeling (MIM). These methods are resource intensive as they follow a ViT framework which is limited to (448×448). This confines it to adapt to variable resolutions commonly up to 1K.
DocRes addresses this through an effective visual prompt approach called Dynamic Task-Specific Prompt (DTSPrompt). DocRes using DTSPrompt analyzes the input image to extract task-specific features. On the basis of prior extracted features, DTSPrompt dynamically generates prompts specific to each task resulting in superior model performance.

DTSPrompt dynamically adapts to the input image $[ \text{Is} \in \mathbb{R}^{h \times w \times 3} ]$ .

$\text{DTSPrompt} = [ G(\text{Is}, \text{task}) \in \mathbb{R}^{h \times w \times 3} ]$

where, $G$ – DTSPromptGenerator ; $Is$ – Input document

Unlike Florence-2 which is a task agnostic generalist model, DocRes is a task oriented generalist model which is an essential aspect for document restoration tasks.

DocRes Generalist Model - Deep learning innovations 2024 — **FIG 8**: DocRes Generalist Model

Dynamic task-specific prompt:

1. Dewarping: The network uses the simplest text line mask algorithm for de-warping, which assists the document segmentation model in generating document masks. Additionally the authors incorporate the x and y coordinates of each pixel as positional information (prior features) to facilitate backward mapping, thus enabling the model to better understand and correct spatial distortions.

DTSPrompt for flattening documents is as follows:

$[ G(\text{Is}, \text{``dewarp''}) = [P_m (\text{Is}), P_{cx}, P_{cy}] ]$

where, prior document masks and positional information is concatenated along the channel dimension.

2. Deshadowing:

DocRes pipeline uses the background of the document with shadow as prior features. The author mentions that to get the background they use dilation operations followed by a median filter to remove text and to smooth out artifacts.

DTSPrompt for shadow removal is,

$G(\text{Is}, \text{``deshadow''}) = P_{bg}(\text{Is})$

3. Appearance Enhancement: Usually, background light, shadow map, or white-balance kernels are used as prior features for a clean appearance restoration. But here, the author opted for a simple approach by finding a difference between an input image and document background estimated (Pbg) as in our earlier task, as a guidance cue to the model for the initial enhancement process.

Clean appearance restoration follows an empirical formula as:

$P_{\text{diff}}(\text{Is}) = 255 - \left| \text{Is} - P_{bg}(\text{Is}) \right|$

$G(\text{Is}, \text{``appearance''}) = P_{\text{diff}}(\text{Is})$

4. Deblurring: When trying to fix a blurred image, traditionally, we use methods like gradient distribution of the image as a prior feature, which shows how the brightness varies across the image. However, in this paper, the advantage of the gradient map (Pg (Is) ∈ R^h×w ) of a picture is taken into account.

Deblurring is achieved using a DTSPrompt as:

$G(\text{Is}, \text{``deblur''}) = [P_g(\text{Is}), P_g(\text{Is}), P_g(\text{Is})]$

5. Binarization: As we know, Binarization involves converting a grayscale or color image into a binary mask to separate the text from the background. For this, DocRes first uses the Sauvola binarization algorithm to determine which pixels of an image should be either black(0) or white(255), denoted by Pb(Is). Along with this, threshold map (Pt) and gradient information (Pg) are used as prior features for refining network’s decision.

For the Text segmentation task, the DTSPrompt is formulated as

$G(\text{Is}, \text{``binarize''}) = [P_b(\text{Is}), P_t(\text{Is}), P_g(\text{Is})]$

DocRes - Pipeline - Innovations in computer vision at CVPR 2024 — **FIG 9**: DocRes – Pipeline

Highlights of the Paper

“DocRes” ingenuity lies in its Prompt fusion and restoration network. The authors did this by integrating DTSPrompt with input image ( $Is$ ) along the channel dimension to create a new input $\text{Is} \in \mathbb{R}^{h \times w \times 6}$ to the restoration network (Restormer model).

DocRes shows excellent performance across multi-tasks often surpassing unified models like De-GAN, DocDiff as well as task-specific SOTA models like DocGeo for dewarping, BGSNet for deshadowing, UDoc-GAN for appearance enhancement and deblurring. However, for the binarization task, GDB holds the lead, with DocRes closely trailing behind it.
DocRes can be adapted for various image resolutions by replacing the framework(e.g., ViT). Also, the author discusses how DocRes is capable of generalizing out-of-domain data through their ablation studies.

Inference Results

Now it’s time for real testing; inference is performed on an RTX 3080Ti and i7-13700K with 12-cores.

Note: In inference.py, replace np.bool with bool to run without numpy error in colab.

!python inference.py --im_path ./input/151_in.png --task  end2end --save_dtsprompt 1

DocRes Inference Results - end2end task - AI research trends 2024 — **FIG 10**: DocRes Inference Results – `end2end`

Observation and Takeaways

Based on our initial round of testing, we found that the DocRes end2end task requires nearly 10GB GPU vRAM. However, the inference results are quite promising. Future work can focus on finding ways to run DocRes in an optimized way.

Repository: [ Link ]

HuggingFace Spaces [ DocRes ]

For a similar Document restoration using deep learning and document scanner using OpenCV, you may find it interesting to read our earlier posts.

You can access the inference notebook for the above project from the Download Code section.

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

Click here to download the source code to this post

3. DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction

Arxiv: https://arxiv.org/abs/2403.02075

Problem statement: Realtime and accurate diffusion-based non-linear tracker.

Category: Video: Low-level Analysis, motion, and tracking

DiffMOT by Weiyi Lv et al. from Shanghai University is a first-of-its-kind, diffusion probabilistic-based model for real-time Multi-Object Tracking (MOT) focusing on challenges in predicting non-linear motion.

DiffMOT Frame Tracking - CVPR best papers 2024 — **FIG 11**: DiffMOT Frame Tracking

MOTs that involve linear motion, like Pedestrian detection, are easily tracked by heuristic methods like the Kalman filter. Kalman Filters assume that an object’s motion, velocity, and direction remain constant within small intervals of time. As a result, KF Trackers don’t work well in complex scenarios with non-linear motion (i.e., non-uniform velocity and direction).

For example, dancers on a stage or players in a sport perform different movements at varying speeds.

DiffMOT D²MP Predictor - Innovations in computer vision at CVPR 2024 — **FIG 12**: DiffMOT D²MP Predictor

But DiffMOT tackles this kind of movement effectively by predicting the next position of an object’s bounding box. It does this by conditioning the trajectories of the object from the previous n frames, guiding the denoising process for the current frame.

Diffusion probabilistic models are inefficient because they start with a rough MOT guess and require generating thousands of samples and iterative refinement for precise final predictions, demanding heavy computation. To overcome this shortcoming, DiffMOT uses a Decoupled Diffusion-based Motion Predictor (D²MP ) approach.From previous trajectories and motion information, the motion predictor uses just one-step sampling to reduce inference time while still maintaining high accuracy. The association of correct bounding boxes over time uses the Hungarian Algorithm (similar to ByteTrack).

Architecture

Unlike a typical diffusion model with only data-to-noise mapping, D²MP contains data-to-zero (Forward process) and zero-to-noise (Reverse process) over time. An HMINet (Historical Memory Information Network) is used in the Reverse Process of motion predictor. This uses Multi Head Self Attention (MHSA) to capture long-range dependencies in the previous frame and summarize them into a conditional embedding to predict the motion in the next frame.

Highlights of the Paper:

DiffMOT achieves State-of-the-art performance on non-linear datasets like DanceTrack and SportsMOT with HOTA metrics of 62.3% and 76.2%, respectively, and a real-time inference speed of nearly 22.7FPS on an RTX3090 machine.
It also outperforms widely used trackers like SORT, FairMOT, QDTrack, and ByteTrack in terms of accuracy.
The detector can be easily replaced with any object detection model to increase speed and detection accuracy, indicating DiffMOT’s flexibility.

Tip

The HOTA (High Order Tracking Accuracy) metric combines the detection accuracy of the detector (YOLO-X), associated accuracy, and Localization accuracy by the tracker (D²MP).

Inference Results (Courtesy: DiffMOT Project)

DiffMOT Inference - DanceTrack - CVPR best papers 2024 — **FIG 14**: DiffMOT Inference – DanceTrack

DiffMOT Inference - SportsMOT - CVPR 2024 Best paper — **FIG 15:** DiffMOT Inference – SportsMOT

Observation and Takeaways:

From the above inference results, we can observe that DiffMOT performs excellently in detection. However, it still faces challenges in videos with sudden changes or complex movements, leading to ID switching. Despite this, as the authors rightly mentioned in the paper, it clearly outperforms KF Trackers. DiffMOT is a good starting point for developing more accurate trackers based on diffusion models.

Note: As we have not tested DiffMOT extensively, we are refraining from making a qualitative comparison between DiffMOT and other state-of-the-art trackers. If you are interested in this further, please check the supplementary section of the paper on page 13.

Repository: [ Link ]

4. From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

Arxiv: https://arxiv.org/abs/2401.01885

Problem statement: Generate 3D avatars with just a single audio

Category: Humans: Face, body, pose, gesture, movement.

The Audio to Photreal framework by Evonne Ng et.al from Meta proposes a novel approach of generating photorealistic avatars that produce realistic conversational motions and gestures for the face, body, and hands just using an audio input.
The team achieved this by combining the diverse gesture possibilities offered by Vector Quantization (VQ) with the nuanced enhancements, such as eye gaze and smirks, provided by the diffusion network.

Audio to Photoreal Embodiment Meta - Innovations in computer vision at CVPR 2024 — **FIG 16**: Audio to Photoreal Embodiment – Meta

To better understand this, let’s look at an example: Let’s say we are animating a virtual person in a meta world to wave their hand.

Without VQ and Diffusion, the wave might look stiff and repetitive like a robot.
But with VQ, we can simulate it to have varying wave patterns or styles each time, making it look more like a human.
Additionally, with a diffusion network, subtle realistic hand movements, such as bending fingers or hands, will make the avatar appear more natural and lifelike.

How does it work?

A rich set of dyadic conversations is captured between two people for training.The motion model comprises three major parts:

Audio to Photoreal Embodiment Pipeline - Meta- AI research trends 2024 — **FIG 17**: Audio to Photoreal Embodiment Pipeline

a) Face Motion Model: This network is a diffusion model conditioned on conversational audio and lip movements. It generates facial expressions to reconstruct the facial mesh.

b) Guide Pose Predictor: This autoregressive transformer-based VQ network takes audio as input and outputs coarse guide pose at 1 FPS.

c) Pose Motion Predictor: The coarse poses are used as extra conditioning to this diffusion network to fill in higher frequency details of the motion.

Finally, the face and body pose are fed into an avatar render network, which generates a photorealistic avatar.

FIG 18: Audio to Photoreal Embodiment – Avatar Synthesis

Highlights of Paper:

The paper presents an alternative way to create synthesized motions of interpersonal conversation with photorealism, addressing the shortcomings of mesh-based or skeletal avatars.
For the same input audio, the network generates diverse samples resulting in more peaky and dynamic motion like pointing. Despite being trained on specific individuals, the input features to the network are person agnostic and can adapt any persona for unseen audio without retraining.
The team open-sourced a multi-view dyadic(between two people) conversation dataset for accurate body or face tracking and photorealistic 3D reconstruction.

Repository: [ Link ]

Colab Notebook: [ Link ]

💡 DEMO

You may be interested in seeing Real-Time Automatic Speech Recognition and Diarization results with OpenAI Whisper from our earlier article.

5. Object Recognition as Next Token Prediction

Arxiv: https://arxiv.org/abs/2312.02142

Problem Statement: Object recognition with language decoders

Category: Recognition: Categorization, detection, retrieval

The paper presents a thoughtful idea of object recognition in an autoregressive manner with LLMs by Kaiyu Yue et al. from Meta.

Object Recognition as Next Token Prediction - Computer Vision research — **FIG 19**: Image Classification as Next Token Prediction

We know traditional linear classification networks like ResNet pretrained on the Imagenet dataset, which contains 1k classes and has a fixed final layer output dimension of 1000. This limits the ability of pretrained image classification models on a particular dataset to extend to other classes.

Modern architectures like CLIP can overcome this limitation to some extent by creating a flexible set of object embeddings to detect any class from the input image. However, CLIP requires a predefined set of object descriptions(gallery) to function as intended. The predefined gallery can cover only a subset of all possible objects and their variations.

In simple terms, if an image has a dog or cat, CLIP can identify them but may not be able to detect specific features like the breed (like a Dalmatian dog or Angora cat). Thus, even CLIP is limited and can cover only a portion of textual space in practical scenarios. Additionally, increasing the gallery size of CLIP results in performance degradation.

So, an ideal approach is to use LLM as a decoder to recognize any object and its variations in the textual space. Google’s Flamingo follows a similar approach, but it requires a few shot samples for each downstream task prior to the inference prompt.
To address this, the authors suggest a more straightforward approach: aligning LLM for recognition tasks only.

Architecture - Object Recognition as Next Token Prediction - CVPR 2024 conference — **FIG 20**: Architecture – Object Recognition as Next Token Prediction

Here, pretrained CLIP or ViT is used as an image encoder, which projects the image embeddings to higher dimensions of the language decoder (LLM).