Cosmos-Reason 1:Vision Language Model for Embodied Decisions

NVIDIA’s Cosmos Reason1 is a family of Vision Language Models trained to understand the physical world and make decisions for embodied reasoning. What makes Cosmos Reason1, as a promising contender for video understanding and embodied reasoning is mainly attributed to its dataset and training strategy.

Existing LLMs or VLMs are disembodied intelligent systems trained on web-scale internet data. While they may acquire essential knowledge to reason about real world, but they often struggle to ground that knowledge to real world dynamics and interactions. They lack the spatial, temporal and physical understanding of how actions performed by a physical system affect its environment which humans do instinctively and effortlessly. For e.g., spatial puzzles, object permanence, about arrow of time etc. This makes them less suitable for complex systems such as robots or autonomous vehicles that need to perceive, understand and interact with the physical world.

Training Dataset
1. Physical common sense
2. Embodied reasoning
Multimodal Architecture
Cosmos Reason1: Training Strategy
1. Supervised Fine-tuning
2. Reinforcement Learning
Benchmark Results
Code Walkthrough of Cosmos Reason1 Inference
Conclusion
References

To generate physically grounded responses, Cosmos understands the physical world through the given video input, understands it, with long chain of thought reasoning.

Simply put, understanding how the world works while respecting physical constraints is called physical common sense whereas embodied reasoning refers to how an agent acts or interacts within that world to plan actions that achieve goals.

Let’s take a closer look at them.

Training Dataset

The model was trained on carefully curated ~4M video-text pairs that focuses on,

I. Physical common sense

This part of dataset has 16 subcategories covering:

Space: Relationship between objects, what’s plausible (can a football fit in a teacup?), what object afford , and understanding different environments.
Time: The sequence of actions , forward or backward, cause and effect (causality) etc.
Fundamental Physics: Object Permanence, states, mechanics , thermodynamics. It even includes Anti-Physics to help the model recognize impossible scenarios.

Physical common sense subcategories - Space, Time, Fundamental Physics - Nvidia Cosmos Reason1 Training Dataset — Fig: Physical common sense subcategories

The dataset includes captions for physical understanding, multiple choice questions, and long chain-of-thought reasoning traces. The data preparation is developed as two pipelines.

1. High quality and detailed narratives of videos with human annotators and pre-trained VLMs.
2. Model Distillation from Deepseek-R1 for Physical AI supervised fine-tuning which captures long chain of thought reasoning traces of deepseek and teach Cosmos Reason1 how to think.

Categorical Distribution of Physical Common Sense Benchmark (604 Questions) - Chart - Nvidia -Cosmos-Physical AI — Fig: Categorical Distribution of Physical Common Sense Benchmark (604 Questions)

II. Embodied reasoning

Understanding the world is one thing, knowing how to act within it as an embodied agent (like a robot or a self-driving car) is another important aspect for intelligent systems. The second part of the dataset was designed around NVIDIA’s 2D embodied reasoning ontology, focusing on four key capabilities for five types of embodied agents.

Types of Embodied agents are,

Humans
Animals
Robot Arms
Humanoid Robots
Autonomous vehicles

The data aimed to teach reasoning for,

Task-completion verification: “Did the robot successfully pick the block?”
Action affordance: “Is it possible for the car to make this turn safely?”
Next plausible action prediction: “Given a pedestrian crosses a road, what is the next likely action the car should do?”

Reasoning Embodied AI enables the physical systems to ground their action to intelligently operate in uncertain and dynamic environments.

100K+ Learners
3 Hours of Learning

Join Free OpenCV Bootcamp

15K+ Learners
3 Hours of Learning

Join Free TensorFlow Bootcamp

10K+ Learners
8 Hours of Learning

Join Free PyTorch Bootcamp

Multimodal Architecture: Transformers and Mamba MLP

Cosmos VLM uses a vision encoder and an LLM decoder. The vision encoder processes the visual information while the text is tokenized into text tokens. Both the visual and text representations are projected into a common embedding space, concatenated and then passed to the LLM decoder.

Nvidia Cosmos Reason1 Model Architecture - Mamba — Fig: Cosmos-Reason1-7B and 56B Model Architecture

The 7B model uses a standard dense Transformer architecture whereas the 56B employs a hybrid Mamba-MLP-Transformer Decoder. This design choice is due to Mamba architecture’s efficient handling of long video sequences where traditional transformers has quadratic complexity processing them.

Cosmos Reason1-56B - Hybrid Mamba MLP Transformer backbone — Fig: Cosmos Reason1-56B – Hybrid Mamba MLP Transformer backbone

The following table better represents the configuration details of Cosmos-Reason1 models.

Cosmos Reason1: Training Strategy

The models were trained on two stages:

Physical AI SFT
Physical AI RL

I. Supervised Fine-tuning (SFT)

The objective of this stage is to take a pretrained Vision Language Model (Qwen2.5-VL) and specialize it for Physical AI tasks using the curated dataset.

1.82M understanding QA pairs
1.94M long Chain-of-thought reasoning traces

These models are designed to take a video input along with text prompts, and generate natural language responses. The output responses aren’t just simple answers the can include:

Explanation: Describing why something is happening?
Embodied decisions: Suggesting next action to take.
Reasoning: Going through Long chain-of-thought reasoning (mental steps) to arrive at a conclusion.

Embodied Reasoning SFT Data Curation Pipeline - Nvidia Cosmos Reason1 — Fig: Embodied Reasoning SFT Data Curation Pipeline

II. Reinforcement Learning (RL)

To further enhance the model’s physical common sense and embodied reasoning capabilities in decision making with a simple rule-based rewards: An accuracy reward (did the model pick the correct MCQ option?) and a format reward (did the model correctly use <think> and <answer> tags for its reasoning and final answer?).

The authors used an efficient algorithm GRPO (Generalized Reward Policy Optimization) similarly to the one used in Deepseek R1. This algorithm doesn’t require training a separate critic model and the framework is fully asynchronous.

$A_i = \frac{R(o_i) - \mathrm{mean}(\mathcal{G})}{\mathrm{std}(\mathcal{G})}$

Example 1:

Before RL: Given a tricky scenario (single lane, parked cars, oncoming traffic, double yellow lines indicating no passing) and asked for next action, the model might incorrectly suggest changing to (non-existent or illegal) left lane.
After RL: The model is tuned to correctly identify the constraints and suggest “None of provided options” to abide by road rules and ensure safety. This is the interesting part, rather than simply choosing from the given options or guessing a likely one, the model actually reasons through the given environment and intelligently decides to take the appropriate next action.

Example 2:

Intuitive Physics - Spatial Puzzle - Nvidia Cosmos Reason1 - Frame Understanding - Testing — Fig: Spatial Puzzle

Fig: Spatial Awareness of Cosmos Reason1

The authors have also developed a novel, fully asynchronous and robust RL training framework to make highly efficient and fault tolerant fine-tuning.

Fig: Architecture of RL Training Framework

This two-stage approach, in SFT provides the broad knowledge and chain of though reasoning capabilities followed by RL for refined decision making enabled Cosmos Reason1 to outperform proprietary models like Gemini2.0 Flash, GPT-4o.

Benchmark Results

To evaluate Cosmos on Physical AI capabilities, Physical common sense, and embodied reasoning, NVIDIA developed a suite of benchmarks tailored to evaluate model performance across various task-specific datasets. It consists of

604 MCQs for common sense reasoning
610 MCQs for embodied reasoning

The benchmark results shows that Cosmos-Reason1-7B was able to double up its scores compared it to the Qwen2.5-VL 7B baseline. The benchmarks also shows the performance gains due to post fine-tuning with RL has given around +5 on average.

Table: Evaluation on physical common sense and embodied reasoning benchmark - Nvidia Cosmos Reason1 — Table: Evaluation on physical common sense and embodied reasoning benchmark

Table: Evaluation on embodied reasoning benchmark - Nvidia Cosmos — Table: Evaluation on embodied reasoning benchmark

Table: Evaluation on intuitive physics benchmark. - Nvidia Cosmos Reason1 — Table: Evaluation on intuitive physics benchmark.

Download Code To easily follow along this tutorial, please download code by clicking on the button below. It's FREE!

Click here to download the source code to this post

Code Walkthrough of Cosmos Reason1 Inference

We have tried the inference on a RTX3080Ti 12GB, official vLLM pipeline for inference and was facing OOM error. The pipeline also doesn’t support 4bit or 8bit quantization. As we know that the Cosmos-Reason1-7B is fine-tuned from Qwen2.5-VL Instruct model. Therefore we decided to use the Qwen2_5_VLForConditionalGeneration to make use of bitsandbytes for 4 bit quantization which occupied close to ~8.1GB VRAM.

Qwen-VL Utils contains a set of helper functions for processing and integrating visual language information with Qwen-VL Series Model.

Installing Dependencies

!pip install transformers==4.51.3 accelerate 
!pip install qwen-vl-utils

Next, install bitsandbytes package for model quantization.

!pip install bitsandbytes

Import Packages

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from qwen_vl_utils import process_vision_info
import torch

Load the model in 4bit Quantization

We found there was a difference in response quality with quantization, however it was better than the base Qwen2.5-VL Instruct model.

# Model and quantization setup
MODEL_PATH = "nvidia/Cosmos-Reason1-7B"
# MODEL_PATH = "Qwen/Qwen2.5-VL-7B-Instruct"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

# Load model
llm = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

The model expects the video inputs as a list of dictionaries with Qwen reasoning (<think>) chat template.

# Message structure
video_messages = [
    {"role": "system", "content": (
        "You are a helpful assistant. Answer the question in the following format:\n"
        "<think>\nyour reasoning\n</think>\n\n<answer>\nyour answer\n</answer>."
    )},
    {"role": "user", "content": [
        {"type": "text", "text": "Is it safe to turn right?"},
        {"type": "video", "video": "assets/av_example.mp4", "fps": 4},
    ]},
]

The processor receives the video message and applies the chat template to match Cosmos Reason1’s expected input structure.

# Processor and inputs
processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    video_messages,
    tokenize=False,
    add_generation_prompt=True,
)

image_inputs, video_inputs, video_kwargs = process_vision_info(video_messages, return_video_kwargs=True)

inputs = processor(
    text=[prompt],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(llm.device)

Finally, the preprocessed input is passed to the LLM decoder to generate the response.

# Generate
generated_ids = llm.generate(
    **inputs,
    temperature=0.6,
    top_p=0.95,
    repetition_penalty=1.05,
    max_new_tokens=4096,
)

generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Inference 1 – Testing Decision Making

Query: “Is it safe to turn right?”

Note: The thinking traces of the model is truncated and only the final response (<answer> ) is shown in the following image.

Qwen2.5-VL Instruct Response (Non-Reasoning)

Inference 2: Testing Temporal Spatial Awareness

As a human we can see the following video is playing backward. Initially with only SFT Cosmos Reason1 wasn’t right about the arrow of time based on action and movement in frames. However, with additional post training with RL the model better comprehends the given scenario as the video is playing backward showing its usefulness.

Query: “Is the video playing backward or forward?”

Inference 3: Action Prediction

Question: The overall goal is to “pickup items in the supermarket”. The agent in the video is currently performing one subtask out of many to complete this instruction. For the agent in the video, what is the most plausible next immediate subtask?

 Place the green snack package into the shopping cart.

Inference 4

Query: “Explain the Ad in Detail”

The advertisement for Pepsi begins with a dark, mysterious setting where two cans of Pepsi stand prominently against an almost pitch-black backdrop (Frame 0.0). As the scene progresses, one can is moved into focus by a hand, revealing its sleek design and vibrant colors more clearly while maintaining some anonymity due to blurred figures in the background (Frame 3.29).

Next, attention shifts to the opening process as droplets of condensation form on the can's surface when it is opened (Frame 6.63), emphasizing freshness and refreshment. The camera then transitions smoothly to show a glass filled with ice cubes, bubbles rising vigorously within the liquid, highlighting the carbonation and effervescence typical of Pepsi (Frame 9.97).

In subsequent scenes, different settings illustrate various contexts where Pepsi can be enjoyed—such as at casual gatherings like barbecues or movie nights. A can sits unobtrusively amidst plates of food, suggesting convenience during social events (Frame 13.3); another appears alongside popcorn and drinks being held up enthusiastically among friends in what looks like a lively party atmosphere (Frame 16.64 &amp; Frame 23.31).

The visual narrative continues with abstract representations of energy and excitement using dynamic patterns around the iconic logo (Frames 19.98 and 26.65). These abstractions symbolize how Pepsi enhances moments shared with others.

Finally, the campaign culminates in a bold statement about preference ("BETTER WITH"), accompanied by multiple cans arranged diagonally across a stark black canvas, reinforcing the brand's message that Pepsi complements every occasion better than any other soda option (Frame 29.99). This final frame serves as a strong conclusion, leaving viewers with a clear impression of why they should choose Pepsi over competitors.

Conclusion

Reasoning over videos requires excellent understanding about the spatial and temporal context from all the available image sequences of a video. It is one of challenging problem where modern Vision Language Models (VLM) often struggle and fell short in generating accurate responses. Cosmos Reason1 success lies in design choices (Mamba) , dataset curation and two-stage fine-tuning (SFT and RL). Cosmos Reason1 shows a significant leap in Video Understanding, Physical AI and Embodied Reasoning. It’s another gem to the open source community from NVIDIA.

VLM for Video Understanding with Spatial and Temporal Context: NVIDIA Cosmos Reason1

Training Dataset

I. Physical common sense

II. Embodied reasoning

Multimodal Architecture: Transformers and Mamba MLP

Cosmos Reason1: Training Strategy

I. Supervised Fine-tuning (SFT)

II. Reinforcement Learning (RL)

Benchmark Results

Code Walkthrough of Cosmos Reason1 Inference

Installing Dependencies

Import Packages

Load the model in 4bit Quantization

Inference 1 – Testing Decision Making

Inference 2: Testing Temporal Spatial Awareness

Inference 3: Action Prediction

Inference 4

Conclusion

References

Get Started with OpenCV

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?

Training Dataset

I. Physical common sense

II. Embodied reasoning

Multimodal Architecture: Transformers and Mamba MLP

Cosmos Reason1: Training Strategy

I. Supervised Fine-tuning (SFT)

II. Reinforcement Learning (RL)

Benchmark Results

Code Walkthrough of Cosmos Reason1 Inference

Installing Dependencies

Import Packages

Load the model in 4bit Quantization

Inference 1 – Testing Decision Making

Inference 2: Testing Temporal Spatial Awareness

Inference 3: Action Prediction

Inference 4

Conclusion

References

Subscribe & Download Code

Get Started with OpenCV

Subscribe to receive the download link, receive updates, and be notified of bug fixes

Which email should I send you the download link?